Hesburgh Library
Navari Family Center for Digital Scholarship, 250E
University of Notre Dame
Notre Dame, IN 46556
(574) 631-8604
emorgan@nd.edu
There is no single tool nor computer program that one can use to do text mining and natural language processing. Instead, the student, researcher, or scholar must employ a suite of different types of tools, such as the following:
In order to do text mining you must have plain text documents, not PDF files, not Word files, etc. What are "plain text files"? Plain text files are documents containing zero font styling, and zero formatting beyond simple spacing and carriage returns. Plain text files usually have a file extension of ".txt". Don't have plain text files? You only have PDF documents or something else? No worries because there are a number of applications enabling you to extract plain text from your PDF documents, etc.
To extract plain text data from PDF files, etc., I suggest a program called Tika. Tika is a piece of free software written in a programming language called Java, and Tika is as "free as a free kitten". Once Tika has been downloaded to your computer, you can use it in the graphical user interface (GUI) mode or from the command line. The GUI interface is useful for a small number of documents (four or five), but beyond that, you will want to use the command line interface. Don't worry. The process is not difficult. It just takes practice.
Text mining and natural language processing is all about text, and you will need a text editor to do your work.
Text editors and word processors are not the same things. The biggest difference between a text editor and a word processor is the former has nothing to do with formatting nor page layout; in text editor there is nothing like bolding, italics, centering, etc. Just as importantly, a text editor will only save files as plain text (.txt) files, not binary files that are only readable with specific applications.
There are many things one might do with a text editor that a word processor can not do. For example, a text editor can change the case of letters from upper to lower. A text editor will include a robust find/replace functionality with the ability use regular expressions. Very important! A text editor will enable you to turn on or turn off line wrapping. A text editor will enable you to save a file as an encoding called UTF-8 as opposed to something more operating system specific. A text editor will enable you to save a file as a Linux, Macintosh, or Windows text file with the proper line endings.
Many of the files we use on a daily basis are plain text files readable/writable with a text editor. Good examples are comma-separated value (.csv) files and HTML (.html) files. Text files are everywhere, and opening them up in a text editor can be quite enlightening. Opening up files in a text editor is a fundamental skill, especially when doing text mining and natural language processing.
Holy and religious wars are fought over which text editor is the best. I'm not going there. That said, you can get away with using NotePad or WordPad on Windows computers and TextEdit on Macintosh computers, but just barely. Instead, I recommend NotePad++ for people using Window computers and BBEdit for people using Macintosh computers. Both NotePad++ and BBEdit are freely available to download and use. The later's features are expanded if it is purchased, but BBEdit will not pester you to make a purchase, nor will it pester you about your usage. Both editors work quite well and support all of the features and functions outlined above.
Download and install a text editor. Practice using it. You might discover it to be a refreshing change because it allows you to focus on what you write as opposed to how it is presented.
The use of tag (word) clouds to visualize frequencies is often considered sophomoric, but used correctly and with enough context, they can result in compelling illustrations.
Given the whole of a text, a tag cloud application will count & tabulate all the words (tokens), optionaly normalize the text into upper- or lower-case, optionally remove function (stop) words, and ultimately illustrate the frequency of the words where more frequently occuring words appear larger as opposed to smaller. Some tag cloud applications can ingest frequencies instead of entire texts, which gives you more control over what is visualized and now.
As an example, below is a word cloud illustrating the frequency of statistically significant keywords from Homer's Iliad and Odyssey. From the the visualiztion the student, researcher, or scholar can begin to learn of the works' aboutness:
An application called Wordle (not the currently popular word game) is my favorite tag cloud application. It is fast, easy-to-use, cross-platform, and creates beautiful visualizations. Unfortunately, Wordle has ceased being supported by its original developer, and finding it on the 'Net is difficult. Download Wordle from its archived location on the Wayback Machine.
Originally developed in the 13th Century for the purposes of understanding religious works, concordances are one of the oldest of text mining tools.
A concordance quickly and easy enables you to see how a given word is used in context with other words. Open one or more plain text files in a concordance, enter a word or phrase of interest, and a concordance will return lists of words with the query placed in middle and four or five words on either side. Since "words are known by the company they keep", a concordance makes it easy to find that company. For example, below, the student, researcher, or scholar can begin to see how the word "love" used in Homer's Iliad and Odyssey:
I recommend a concordance application called AntConc. It is fast, full-featured, cross-platform, and free ("free as a free kitten", that is). Not only does it support concordancing functions, but is also supports other modeling techniques such as dispersion plots, collocations, ngrams, and frequencies.
Topic modeling is an unsupervised machine learning process used for the purposes of enumerating latent themes in a corpus. It is a popular clustering technique which outputs the aboutness of one or more documents. In most cases, topic modeling is implemenented using an algorithm called Latent Dirichlet Allocation (LDA).
For example, if one were to topic model Homer's Iliad and Odyessy, and if one were to specify eight topics, then the resulting topics, weights, and elaborating features would look something like this:
If you know anything about the Iliad and the Odyssey, then you will see that the results are a pretty good enumeration of the things mentioned in the stories.
The grand-daddy of topic modeling tools is called MALLET. It is a full-featured application but must be run from the command-line, which can be intimidatding. A simplified version of MALLET exists in a graphical user interface (GUI) form, and it is called Topic Modeling Tool. I suggest you begin with Topic Modeling Tool, and then graduate yourself to MALLET.
Like any type of reading, text mining and natural language process is not a perfect process; misinterpretations, errors, an anomalies abound. These imperfections occur for two reasons: 1) human error, or 2) computer error. In either case, you will want to correct the imperfections to some degree, but it will be nearly impossible to correct everything. Correcting the imperfections is often called "cleaning", but I prefer the word "normalizing".
For example, you might want to change the capitalization of all the words in your document(s) to lower-case. You might want to remove all the words containing digits. You might want to remove all multiple-carriage returns and replace them with double-carriage returns in order to make your documents more human-readable. In these cases, you will want to exploit your text editor's find/replace functionality to do the good work.
In other cases, a machine learning process may have been applied to your documents (such as named-entity extraction), and the result denoted Plato as a place or Christmas as a person. These things are incorrect. Since these machine learning processes usually result in a data set in the form of a matrix (comma-separated values or CSV files), a spreadsheet or database application is an appreciate tool to perform the corrections.
On the other hand, the correction process can be done more intelligently if you use a program called OpenRefine. OpenRefine excels as opening, evaluating, normalizing, and reporting on the content in matrix-like files. Like everything else outlined here, OpenRefine requires practice, but the time spent learning how to use OpenRefine turns into an investment that pays itself off quickly. Once you get to know OpenRefine you will find all sorts of interesting uses for it.
A small number of systems -- holistic systems -- try to do much of the above and more, but alas, the use of computers to do text analysis is in its infancy compared to the use of computers to operate on spreadsheets. Your expectation may need to be tempered.
One such system is called Voyant Tools or simply Voyant. Given one or more documents, Voyant will convert the documents to plain text, count & tabulate all sorts of things, and provide the means to visualize the results. It supports everthing from simple word clouds, to concordancing, to topic modeling. This web-based tools is a very good introduction to what can be done.
Another system is one of my own design -- the Distant Reader and the Distant Reader Toolbox. Given one or more documents, the Toolbox will create a data set comprised of many different plain text and delimited files. These files can then be analyzed using additional functions of the Toolbox, graphical applications such as the ones outlined above, or through the use of computer languages such as Python or R.
API is an acronym for "Application Programmer Interface", and programming APIs are software tools for doing specific tasks.
While just about any programming language can be used to do text mining and natural language processing, Python seems to be the most popular right now. This is true for two reasons: 1) in general, Python is used by a growing number of people, and 2) there are a number of mature natural language processing software toolboxes written in Python.
If you know a little bit of Python, then there are two tools I suggest you learn how to use. The first has been around for a while and it is called the Natural Language Toolkit or NLTK for short. Given a set of one or more plain text files, the programmer can easily parse the files into sentences, extract parts-of-speech, extract named-entities, compute statistically significant two-word collocations (think "bigrams"), output dispersion plots, etc. Most people cut their teeth on the NLTK.
The second recommended Python toolkit is called spaCy. It is more difficult to use than NLTK, but its results are more robust since it relies on sets of previously created machine learning models (semantics) instead of the mere shapes (syntax) of words. These models are available in dozens of languages from English, to German, to French, to Japanese. Moreover, the programmer can create their own models. Just like NLTK, given one or more plain text files, spaCy will parse the files into sentences, extract parts-of-speech, and extract named-entities. But it also supports dependency parsing, which is akin to diagramming sentences and learning of a sentence's different parts.
Start with NLTK and then work towards spaCy.