Hesburgh Library
Navari Family Center for Digital Scholarship, 250E
University of Notre Dame
Notre Dame, IN 46556
(574) 631-8604
emorgan@nd.edu
A prerequisite for any text mining or natural language processing task is to transform your documents into a format called "plain text".
In order to do text mining or natural language processing, you MUST have plain text files. Tika extracts plain text from MANY different file types. Here how:
Concordancing is the poor man's search engine. It is a quick, easy, and effective what to begin analyzing your corpus.
Here is a cheatsheet for today's concordance workshop:
A concordance is the oldest of text mining tools. Now-a-days, a concordance is often called a keyword-in-context tool. A concordance is a wonderful way to see how a given word is used in a corpus.
Topic modeling is an "unsupervised machine learning" process used to divide a corpus into sub-corpora. It enables the reader to identify themes and observe possible trends. Here's how:
Topic modeling is one way to "read" a large amount of content quickly & easily, but it is not a substitute for the traditional reading process. Instead topic modeling is a supplemental process for traditional reading.
Language follows patterns, and because of those patterns is it possible to extract the parts-of-speech and named entities from a text. Once these things are in hand, the reader (you) can count & tabulate them to answer questions regarding who, what, when, how, etc. Here's... how:
Parts-of-speech and named entity extraction is language dependent. Such is a limitation of the technology. On the other hand, the use of this technology is the first step to extracting meaning (as opposed to mere data) from a corpus.
Voyant Tools is a quick and easy way to begin learning about natural language processing. Here's how:
1. Begin by opening your Web browser to https://voyant-tools.org
2. Use the Upload button to submit one or more files from your computer
Voyant Tools will do good work against the content and you will be able to see things like a word cloud illustrating the frequency of words, a dispersion plot illustrating how the use of frequent words are used though the corpus, and a concordance -- a sort of keyword-in-context -- tool.
There are many many more features of Voyant, and you are encouraged to:
1. Click on anything and everything to see what happens, and
2. Read the online documentation: https://voyant-tools.org/docs/#!/guide
The Distant Reader is a tool of my own design. Given an (almost) arbitrary number of files of just about any type, the Reader creates a dataset which is amenable to computation -- "reading". For more complete instructions, see the official documentation.
The Reader is a suite of functions written in a programming language called Python. To get the Reader to work on your computer, you must install the Python programming language, a few supporting tools, and the Reader software itself. Here is an installation outline:
1. install Anaconda, a programming developing environment
2. install Java, another programming environment
3. from the command line, install the Reader: pip install reader-toolbox
Once you get this far, you ought to be able to run the Reader software from the command line with this command: rdr The result ought to be a menu of subcommands and look something like this:
$ rdr Usage: rdr [OPTIONS] COMMAND [ARGS]... Options: --help Show this message and exit. Commands: about Output a brief description and version number of the... adr Filter email addresses from <carrel> bib Output rudimentary bibliographics from <carrel> browse Peruse <carrel> as a file system build Create <carrel> from files in <directory> catalog List study carrels cluster Apply dimension reduction to <carrel> and visualize the... collocations Output network graph based on bigram collocations in... concordance A poor man's search engine documentation Use your Web browser to read the Toolbox (rdr) online... download Cache <carrel> from the public library of study carrels edit Modify the stop word list of <carrel> ent Filter named entities and types of entities found in... get Echo the values denoted by the set subcommand grammars Extract sentence fragments from <carrel> as in: info Output metadata describing <carrel> ngrams Output and list words or phrases found in <carrel> notebooks Download, list, and run Toolbox-specific Jupyter Notebooks play Play the word game called hangman pos Filter parts-of-speech, words, and lemmas found in <carrel> read Open <carrel> in your Web browser readability Report on the readability (Flesch score) of items in... search Perform a full text query against <carrel> semantics Apply semantic indexing against <carrel> set Configure the location of study carrels, the subsystem... sizes Report on the sizes (in words) of items in <carrel> sql Use SQL queries against the database of <carrel> summarize Summarize <carrel> tm Apply topic modeling against <carrel> url Filter URLs and domains from <carrel> web Experimental Web interface to your Distant Reader study... wrd Filter statistically computed keywords from <carrel>
To create data sets (affectionately called "study carrels"), you must use the rdr build command. Here's how:
1. create a new directory on your desktop can call it "practice"
2. put between four and twelve documents of any type into the new directory
3. open your command line tool, and change to the desktop directory
4. create a study carrel with the following command: rdr build practice practice -s -e
If all goes well, the Reader will process the content of the practice directory in less than a few minutes, and you will then be able to apply any of the subcommands to the result. For example, you will be able to count & tabulate all of the words in the carrel with the following command: rdr ngrams practice -c | more
Again, for more complete instructions, see the official documentation.
Python is a popular programming language, and it is often used to text mining and natural language processing. To give you an introduction to the use of Python and natural language processing, I have created a Jupyter Notebook and hosted it on GitHub. Please see GitHub repository for more information.