Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Text Analysis

Dartmouth College Library's guide to text analysis tools and platforms

Need help?

Dartmouth College Library offers text analysis platforms, training, and support for your research and teaching needs.  We currently provide access to two text analysis platforms, Constellate and ProQuest TDM Studio.  We also offer individual or group training and support.  Please contact the Research Data Help team: researchdatahelp@groups.dartmouth.edu to get started.

Introduction to Text Analysis

Text analysis platforms provide researchers with the ability to search, manipulate, and explore knowledge in the scholarly record at a large scale to uncover important information and gain new insights.  Text analysis can enable us to answer questions on how texts are interconnected, what sentiments they contain, and when significant terms change within a collection of unstructured texts.

Text analysis research questions explore a wide range of topics, from biomedical discovery to literary history. Research questions that are conducive for text analysis methods may involve these characteristics:

  • Change over time
  • Pattern recognition
  • Comparative analysis

There are five main questions that text analysis can help answer:

  1.     What are these texts about?
  2.     How are these texts connected?
  3.     What emotions (or affects) are found within these texts?
  4.     What names are used in these texts?
  5.     Which of these texts are most similar?

Question 1: What are these texts about?

  •     Word Frequency (Beginner)

Counting the frequency of a word in any given text. This includes Bag of Words and TF-IDF. Example: "Which of these texts focus on women?"

  •     Collocation (Beginner)

Examining where words occur close to one another. Example: "Where are women mentioned in relation to home ownership?"

  •     Topic Analysis (or Topic Modeling) (Intermediate)

Discovering the topics within a group of texts. Example: "What are the most frequent topics discussed in this newspaper?"

  •     TF/IDF (Intermediate)

 Finding the significant words within a text. Example: "What language is most significant within 1970s political speech?"

Question 2: How are these texts connected?

  •     Concordance (Beginner)

Where is this word or phrase used in these documents? Example: "Which journal articles mention Maya Angelou's phrase, 'If you're for the right thing, then you do it without thinking.'"

  •     Network Analysis (Advanced)

How are the authors of these texts connected? Example: "What local communities formed around civil rights in 1963?"

Question 3: What emotions (or affects) are found within these texts?

  •     Sentiment Analysis (Intermediate)

Does the author use positive or negative language? Example: "How do presidents describe gun control?"

Question 4: What names are used in these texts?

  •     Named Entity Recognition (Intermediate)

List every example of a kind of entity from these texts. Example: "What are all of the geographic locations mentioned by Tolstoy?"

Question 5: Which of these texts are most similar?

  •     Authorship Attribution (Advanced)

Find the author of an anonymous document. Example: "Who wrote The Federalist Papers?"

  •     Clustering (Advanced)

Which texts are the most similar? Example: "Is this play closer to comedy or tragedy?"

  •     Supervised Machine Learning (Advanced)

Are there other texts similar to this? Example: "Are there other Jim Crow laws like these we have already identified?"

 

This lesson is a remixed version of Teaching Text Analysis with Constellate, a Jupyter Book CC BY, Nathan Kelber and Ted Lawless for Constellate.

Text Analysis Process

Like all computational research methodologies, text analysis has several steps required to produce clear, reproducible results. Agnostic of the platform, users must determine required content, build the corpus dataset (of rights-cleared texts), complete analysis either using templates or custom computational scripts, and then export derived results.

Select Content

- Select specific publication titles or all titles in a database  ** Note Only Selected, rights-cleared content should be used **

- Refine content by keyword, date, source type or document type

Create Dataset

- Use standard platform visualizations

or

- copy/import it to the computational notebook environment (Jupyter/Colab)

Explore using Computational Notebooks

- Use sample scripts to explore your dataset

  • word frequencies
  • significant terms
  • sentiment analysis
  • topic modeling
  • named entity recognition

Create Custom Scripts
- Setup R or Python environment to create your own scripts

Export Derived Results
- Export files and visualizations for dissemination