Library Guides: Text mining and natural language processing: How-To

Text Preparation

A prerequisite for any text mining or natural language processing task is to transform your documents into a format called "plain text".

In order to do text mining or natural language processing, you MUST have plain text files. Tika extracts plain text from MANY different file types. Here how:

Download Tika (http://tika.apache.org)
Save it on your desktop, and rename it to "tika.jar", which will make your life easier
Double-click (open) it
If that doesn't work, then try opening it from the command line: java -jar tika.jar
Drag a file on to the resulting window, wait, and view the resulting "plain text"
Repeat a number of times in order to practice
From the command line, read the help text: java -jar tika.jar --help
Download a sample corpus (http://dh.crc.nd.edu/tmp/tika-corpus.zip)
Save the supplied corpus to your desktop
Create a directory named output on your desktop
From the command line, run Tika in a batch mode: java -jar tika.jar -t -i corpus -o output
Repeat a number of more times, but with your own content

Concordancing

Concordancing is the poor man's search engine. It is a quick, easy, and effective what to begin analyzing your corpus.

Here is a cheatsheet for today's concordance workshop:

download a set of files: http://dh.crc.nd.edu/tmp/cooper-deerslayer-1841.zip
and uncompress the result
download a concordance program (AntConc): https://www.laurenceanthony.net/software/antconc/
use AntConc to process the whole of the uncompressed directory
configure AntConc use a stop word list (attached)
repeat Step #4 thoroughly

A concordance is the oldest of text mining tools. Now-a-days, a concordance is often called a keyword-in-context tool. A concordance is a wonderful way to see how a given word is used in a corpus.

Topic Modeling

Topic modeling is an "unsupervised machine learning" process used to divide a corpus into sub-corpora. It enables the reader to identify themes and observe possible trends. Here's how:

Download and install a topic modeling tool (https://github.com/senderle/topic-modeling-tool)
Download a few sample corpora (http://dh.crc.nd.edu/sandbox/workshop-data/modeling.zip)
Model the American corpus with a single topic and single dimension
Examine the raw modeling output as well as both the HTML and CSV output
Go to Step #3 a few times, while slowly increasing the number of topics & dimensions
Model the Deerslayeer using seven (or so) topics and a few dimensions
Ask yourself, "What single word can I associate with each 'topic'?"
Using Excel, open topics-metadata.csv and create a pivot table illustrating the flow of topics over chapters
Model the American corpus (again), but this time specify the use of a metadata file (American.csv)
Repeat Steps #8 & #9 but this time illustrate the themes of authors.
If you have time, repeat Step #9 & #10 for the Baxter corpus and Baxter.csv, but this time illustrate themes (trends) over years

Topic modeling is one way to "read" a large amount of content quickly & easily, but it is not a substitute for the traditional reading process. Instead topic modeling is a supplemental process for traditional reading.

Parts-Of-Speech And Named-Entitities

Language follows patterns, and because of those patterns is it possible to extract the parts-of-speech and named entities from a text. Once these things are in hand, the reader (you) can count & tabulate them to answer questions regarding who, what, when, how, etc. Here's... how:

Download a sample corpus (http://dh.crc.nd.edu/tmp/corpus.zip)
Open any of the resulting files and ask yourself, "Who is in these texts? What do they do? To what places do they refer? How are things described?"
Go to http://dh.crc.nd.edu/sandbox/pande/txt2pos.cgi
Copy & paste the whole of carol.txt into the resulting form, and submit the form
When the script is done save the result to a file named pos.txt
Open pos.txt and ask yourself the questions again
Download & install OpenRefine (http://openrefine.org)
Create a new OpenRefine project with the pos.txt file
Use the Text Facet and other options to count & tabulate words, lemmas, and parts-of-speech
Yet again, ask yourself the questions
Repeat Steps #3 through #10 but with named entities (http://dh.crc.nd.edu/sandbox/pande/txt2ent.cgi)
Repeat Steps #3 through #11 with the other items in the corpus

Parts-of-speech and named entity extraction is language dependent. Such is a limitation of the technology. On the other hand, the use of this technology is the first step to extracting meaning (as opposed to mere data) from a corpus.

Voyant Tools

Voyant Tools is a quick and easy way to begin learning about natural language processing. Here's how:

1. Begin by opening your Web browser to https://voyant-tools.org
2. Use the Upload button to submit one or more files from your computer

Voyant Tools will do good work against the content and you will be able to see things like a word cloud illustrating the frequency of words, a dispersion plot illustrating how the use of frequent words are used though the corpus, and a concordance -- a sort of keyword-in-context -- tool.

There are many many more features of Voyant, and you are encouraged to:

1. Click on anything and everything to see what happens, and
2. Read the online documentation: https://voyant-tools.org/docs/#!/guide

The Distant Reader

The Distant Reader is a tool of my own design. Given an (almost) arbitrary number of files of just about any type, the Reader creates a dataset which is amenable to computation -- "reading". For more complete instructions, see the official documentation.

The Reader is a suite of functions written in a programming language called Python. To get the Reader to work on your computer, you must install the Python programming language, a few supporting tools, and the Reader software itself. Here is an installation outline:

1. install Anaconda, a programming developing environment
2. install Java, another programming environment
3. from the command line, install the Reader: pip install reader-toolbox

Once you get this far, you ought to be able to run the Reader software from the command line with this command: rdr The result ought to be a menu of subcommands and look something like this:

$ rdr
Usage: rdr [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  about          Output a brief description and version number of the...
  adr            Filter email addresses from <carrel>
  bib            Output rudimentary bibliographics from <carrel>
  browse         Peruse <carrel> as a file system
  build          Create <carrel> from files in <directory>
  catalog        List study carrels
  cluster        Apply dimension reduction to <carrel> and visualize the...
  collocations   Output network graph based on bigram collocations in...
  concordance    A poor man's search engine
  documentation  Use your Web browser to read the Toolbox (rdr) online...
  download       Cache <carrel> from the public library of study carrels
  edit           Modify the stop word list of <carrel>
  ent            Filter named entities and types of entities found in...
  get            Echo the values denoted by the set subcommand
  grammars       Extract sentence fragments from <carrel> as in:
  info           Output metadata describing <carrel>
  ngrams         Output and list words or phrases found in <carrel>
  notebooks      Download, list, and run Toolbox-specific Jupyter Notebooks
  play           Play the word game called hangman
  pos            Filter parts-of-speech, words, and lemmas found in <carrel>
  read           Open <carrel> in your Web browser
  readability    Report on the readability (Flesch score) of items in...
  search         Perform a full text query against <carrel>
  semantics      Apply semantic indexing against <carrel>
  set            Configure the location of study carrels, the subsystem...
  sizes          Report on the sizes (in words) of items in <carrel>
  sql            Use SQL queries against the database of <carrel>
  summarize      Summarize <carrel>
  tm             Apply topic modeling against <carrel>
  url            Filter URLs and domains from <carrel>
  web            Experimental Web interface to your Distant Reader study...
  wrd            Filter statistically computed keywords from <carrel>

To create data sets (affectionately called "study carrels"), you must use the rdr build command. Here's how:

1. create a new directory on your desktop can call it "practice"
2. put between four and twelve documents of any type into the new directory
3. open your command line tool, and change to the desktop directory
4. create a study carrel with the following command: rdr build practice practice -s -e

If all goes well, the Reader will process the content of the practice directory in less than a few minutes, and you will then be able to apply any of the subcommands to the result. For example, you will be able to count & tabulate all of the words in the carrel with the following command: rdr ngrams practice -c | more

Again, for more complete instructions, see the official documentation.

Python And Text Mining

Python is a popular programming language, and it is often used to text mining and natural language processing. To give you an introduction to the use of Python and natural language processing, I have created a Jupyter Notebook and hosted it on GitHub. Please see GitHub repository for more information.