Skip to Main Content

Research Data Management

Background and links to more information about data management issues.

Creating Digital Text Collections or Corpora

Digital Text Collections

Digital text repositories, like Hathi Digital Trust, store millions of digitized texts. For copyrighted material, your use of these texts may be restricted to keyword searches (returning only snippets) or, when allowed, renting these books for a set amount of time. More valuably, they offer access to millions of full-text copies of books and other texts that are in the public domain. You may view images of these texts (like the image to the top left above) through their browser. For larger text analysis projects, however, you may download a plain text copy (bottom left) of these texts and construct your own digital corpus or text collection for analysis. Note: Hathi and others just automatically digitize these texts using optical character recognition (OCR) software and thus, contain some errors depending on the quality of the page image.

Converting Formulaic Texts into Structured Data

Would you like to convert a formulaic text - like an encyclopedia, gazetteer, statistical abstract, directory, etc. - into a structured dataset that may be queried, sorted, and filtered?

Regular Expression techniques offer a powerful means to parse and extract specific types of information from such texts. For an introductory tutorial in this technique using LibreOffice, see the Programming Historian's Understanding Regular Expressions lesson. For more advanced applications of this technique using the programming language Python, see PH's Generating an Ordered Dataset from a Text File lesson.

Frequency Lists and n-grams

For copyrighted texts, some repositories offer word and term frequency lists to researchers in place of full-text content. JSTOR, for example, allows researchers to download n-gram lists (one-, two-, and three-word terms ordered by their frequency) for journal articles and book reviews in their database.

Annotated / Encoded Text Corpora

Annotated text corpora included special encoding or "tags" to identify and add commentary on the content of texts. By encoding texts with xml tags, these corpora allow web developers to sort, filter, or represent given elements of texts in a variety of ways. These tags have an additional use, however, They also allow researchers to transform these texts into searchable databases. 

Metadata Analysis


While full-text analyses of large corpora attract much of the attention, we can still learn much by reviewing the metadata of a corpus (i.e. the titles, authors, dates, and abstracts of books and articles). In a 2014 blog post, "Still Playing Catch Up," Cameron Blevins examines the American Historical Review's slow progress toward gender equality. While observing that female dissertation authors had more or less caught up to their male counterparts, for books reviewed by the AHR, male authors (as of 2013) still outnumbered female authors 2 to 1.* Interestingly, a curious gender imbalance also exists among the reviewers. While the reviewers of female-authored books have achieved near-gender parity, 3 out of 4 reviewers of male-authored books are men.

Blevins Gender Study

* As Blevins notes, using a study of first names to identify the gender authors is not without its pitfalls, including the fact that it "subtly reinforces an insidious gender binary framework."



Revolution Graph



In "Searching for the Victorians," Dan Cohen examines trends in the words found in book titles. Some words, such as "revolution" demonstrate predictable trends (see the graph to the right). Others, indicate patterns that deserve further exploration. For example, Cohen charts indicators of a growing pessimism in the nineteenth century, such as the decline in use of such words as "progress" and "happiness."



Word Searches, Frequencies, and N-Grams

[other examples]

Exploring Changes Over TimeSOTU graph

One of the most commonly mined sets of texts - at least for the United States - is the corpus of presidential State of the Union speeches delivered every year by the U.S. president since 1790. A 2015 article in The Atlantic describes some of the insights that text analysis provides about how presidential priorities have changed over time in the last 200+ years.






[work in progress - more examples will be added here...]