Hesburgh Library
Navari Family Center for Digital Scholarship, 250E
University of Notre Dame
Notre Dame, IN 46556
(574) 631-8604
emorgan@nd.edu
Text mining and natural language processing are two sides of the same analysis ("reading") coin. Given a corpus of one more more texts, text mining and natural language processing are processes of:
1. articulating a research question
2. identifying materials that can address the question
3. obtaining the materials
4. converting the materials into plain text
5. creating a corpus from the materials, complete with metadata
6. counting & tabulating features from the corpus
7. modeling the counts and tabulations
8. evaluating the models for patterns and anomalies
9. addressing the research question
Text mining is different from natural language processing in that the former assumes no underlying structure in the way words are put together to communicate ideas. On the other hand, natural language processing does assume such structures, and those structures include parts-of-speech values (nouns, verbs, adjectives, etc.), named entities (persons, places, organizations, etc.), and to paraphrase the English linguist named J. R. Firth, that "You shall know a word by the company it keeps" is true.
Such is an outline of using text mining to address research questions, and there are a few things one ought to take away from the process. First, this is an iterative process. In reality, it is never done. Similarly, do not attempt to completely finish one step before you go on to the next. If you do, then you will never get past step #3. Moreover, computers do not mind if processes are done over and over again. Thus, you can repeat many of the subprocesses many times.
Second, this process is best done as a team of at least two. One person plays the role of domain expert armed with the research question. A person who knows how to manipulate different types of data structures (different types of lists) with a computer is the other part of the team.
Be forewarned. Text mining and natural language processing are not replacements for the traditional reading process. Instead, they are supplements. Yes, they do have a number of advantages over traditional reading. For example, they are scalable, meaning they offer the opportunity to process much more content than a person alone. They are repeatable, and thus more demonstrative. Text mining can be especially useful, even when the student, researcher, or scholar does not know the given language; pure text mining is language-independent. But computers are stupid, and consequently, they do not interpret nuance very well. People are better at this, and thus traditional reading better for this purpose. Such is often called "close reading". Computers are excellent tools for addressing quantitative-esque questions, but they are lousy at addressing questions regarding why. It is up to a person to interpret observations (counts, tabulations, and models) in order to make judgements. Computers can't do such things. Additionally, language is ambiguous, and computers do not handle ambiguity very well. For all these reasons, you are encouraged to use both the processes described in this guide as well as the traditional reading process you have been taught for the whole of your formal eduction. Heck, even though I advocate the use of computers for analyzing the whole of an author's oeuvre, I also encourage the student, researcher, or scholar to print their reading materials, bind then into books, and actively reading the materials with pen or pencil in hand.
The balance of this guide first outlines in more detail each of the numbered step above, and then outlines how to use some specific software tools to do the work.
This is one of the more difficult parts of the process, and the questions can range from the mundane to the sublime. Examples might include: 1) how big is this corpus, 2) what words are used in this corpus, 3) how have given ideas ebbed & flowed over time, or 4) what is St. Augustine's definition of love and how does it compare with Rousseau's?
Not asking questions ahead of time is like going on a road-trip without knowing the destination, and consequently, you will never know when you get there.
These items may range from set of social media posts, sets of journal articles, sets of reports, sets of books, etc. Point to the collection of documents.
Even in the age of the Internet, when we are all suffering from information overload, you would be surprised how difficult it is to accomplish this step. One might search a bibliographic index and download articles. One might exploit some sort of application programmer interface to download tweets. One might do a whole lot of copying & pasting. What ever the process, I suggest one save each and every file in a single directory with some sort of meaningful name.
Text mining is not possible without plain text; you must have plain text to do the work. This means PDF files, Word documents, spreadsheets, etc need to have their underlying texts extracted. Tika is a very good tool for doing and automating this process. Save each item in your corpus as a corresponding plain text file.
Put another way, this means create a list, where each item on the list is described with attributes which are directly related to the research question. Dates are an obvious attribute. If your research question compares and contrasts authorship, then you will need author names. You might need to denote language. If your research question revoles around types of authors, then you will need to associate each item with a type. If you want to compare & contrast ideas between different types of documents, then you will need to associate each document with a type. To make your corpus more meaningful, you will probably want to associate each item with a title value. Adding metadata is tedious. Be forewarned.
In this case, the word "features" is text mining parlance for enumerating characteristics of a text, and the list of such things is quite long. In includes: size of documents measured in number of words, counts & tabulations (frequencies) of ngrams, readability scores, frequencies of parts-of-speech, frequencies of named entities, frequencies of given grammars such as noun phrases, etc. There are many different tools for doing this work.
Given a set of features, once all the prep work is done, one can actually begin to address the research question, and there are number of tools and subprocesses that can be applied here. Concordancing is one of the quickest and easiest. From the features, identify a word of interest. Load the plain text into a concordance. Search for the word, and examine the surrounding words to see how the word was used. This is like ^F on steroids. Topic modeling is a useful process for denoting themes. Load the texts into a topic modeler, denote the number of desired topics. Run the modeler. Evaluate the results. Repeat. Associate each document with a metadata value, such as date. Run the modeler. Pivot the results on the date value. Plot the results as a line chart to see how topics ebbed & flowed over time. If the corpus is big enough (at leaset a million words long), then word embedding is a useful to learn what words are used in conjunction with other words. Those words can then be fed back into a concordance. Full text indexing is also a useful analysis tool. Index corpus complete with metadata. Identify words or phrases of interest. Search the index to learn what documents are most relevant. Use a concordance to read just those documents. Listing grammars is also useful. Identify a thing (noun) of interest. Identify an action (verb) of interest. Apply a language model to a given text and output a list all sentences with the given thing and action to learn what they are with. An example is "Ahab has", and the result will be lists of matching sentences including "Ahab has...", "Ahab had...", or "Ahab will have..."
Because computers only give you observations, and since you must interpret the observations, this step is difficult. Remember, using a computer to help you with your reading does not give you truth. Only you can do that.
For example, what does it mean when you discover such and such is a constant theme over time? Is there something to be said when you can observe one theme waxes and while another wanes? Now that you have created a list of all the I-sentences in a corpus and you observe the predicates of such things are all in the same family of actions, then what sorts of conclusions can you draw? Maybe you have identified a word or phrase of particular interest. Maybe you are not using a computer to give you observations as much as you are to help you winnow down and narrow your field investigation, then maybe you can use the computer to identify a smaller set of documents for traditional reading?
Suppose you are interested in war. You create a collection of documents you think can help you learn more about war. You then use the computer -- in any number of ways -- to determine what words and phrases surround the word "war" and its synonyms. Based on the results, you would be able to describe war in a number of different ways. If the documents were associated with times and/or places, then the ideas of war compared to when and where.
Once you (repeatably) model your corpus using ngrams, parts-of-speech, named-entities, statistically significant keywords, topic modeling, word embedding, concordancing, full-text indexing, through grammars, etc., then you will become very familiar with the corpus, and you will not only be able to identify patterns and anomalies, but you will also be able point to them and back up any assertions you make.
Ask yourself, “To what degree did I address the research question?” If the degree is high, or if you are tired, then stop. Otherwise, go to Step #1.