Control your Vocab (or not)

I am a NINES Graduate Fellow for 2009-2010, and this post was written for the NINES Blog. To see it in its original context, click here.

Yesterday I had two conversations about controlled vocabulary in digital humanities projects (a.k.a. my definition of a really good day). Both conversations centered around the same question: what is the best way to associate documents with subject information? If you don’t attach some keywords or subject categories to your documents then you can forget about finding anything later. There are, in my estimate, two main camps for doing this in a digital project — tags and pre-selected keywords.

In my humble opinion, tags are best when you want your users to take ownership of the data. They decide the categories, so in some sense, they have a stake in the larger project and how it evolves. You might even be able to tell why people are using the data in the first place, by looking at what tags they associate with your (or their) content. On the downside, tags can be problematic for first time users who need to search (rather than explore) your data. On several occasions I have been confronted with tag clouds that have descended (or ascended) into the realm of performance art. They are fascinating in of themselves, but fail to provide a meaningful path into the data.

Pre-selected keywords often work best when a clearly defined set of people are in charge of marking up the content. They are great for searching, and if indexed in a hierarchical structure, can provide semantically powerful groupings (especially for geographical information). And if you have a Third Normal Form database, then you never have to worry about misspellings or incorrect associations between your keywords (Disclaimer: I love 3NF databases. I know they don’t work for every project, but when your data fits that structure life is good). As a historian, however, I am wary of keywords that are imposed on a text. If someone calls himself a “justice,” I balk at calling him a “judge” even if it means a more efficient search.

Of course, it all depends on your data and what you want to do with it, but my favorite solution is have, at minimum, two layers of keywords. The bottom layer reflects the language in the text (similar to tagging), but those terms are then grouped into pre-selected types. So “justice,” “justice of the peace,” “judge,” “lawyer,” “barrister,” counselor” all get associated with type “legal.” You can fake hierarchies with tags, but it requires a far more careful attention to tag choices than I typically associate with that methodology.

I implemented the two-tiered approach in Project Quincy, but I would love to hear other suggestions and opinions.


Leave a Reply

Your email address will not be published. Required fields are marked *