Introduction to Text Analysis
"Text analysis" is a broad term covering various processes by which text and natural language documents can be modified so that they can be organized and described.
This guide collects resources for several phases of the text analysis process, including text collection, text parsing and cleaning, text summary and analysis methods, and text visualization.
Overviews/summaries
Ted Underwood – Where to start with text mining
Tooling Up for Digital Humanities – Text Analysis
John Laudin – Text Analytics 101
O'Connor, Bamman, & Smith (2011) –Computational Text Analysis for Social Science
Ben Schmidt –Comparing Corpuses by Word Use
Possible Sources of Text
Native digital text
HTML
RSS feeds
Sample specific services:
Corpus of Global Web-Based English (GloWbE)
Tutorials for data collection from various services
Digitized
Hathi Trust(Hathi Download Helper)
JSTOR Data for Research* (withEarly Journal Content bundle, also fromarchive.org)
Open American National Corpus(collection of American English from various sources)
WordHoard*(tagged literary texts)
Corpus of Contemporary American English
* - also has some processing/analysis capabilities
Cleaning Text for Analysis
Before you can do a text analysis project, you often need to do a lot of cleaning and parsing to the text. This is because most text is created and stored so that humans can understand it, and it is not always easy for a computer to process that text.
Computers work well when there is structure to a data source or, at least, some regular patterns that it can identify. Most cleaning and parsing for text analysis involves increasing the regularity (for example, fixing typos) or adding structure (tagging certain words as important, or even splitting documents up into different sections that have special meaning - title, authors, chapters, etc.).
The major ways of analyzing texts are listed underAnalysis Methods, and you may need to know a bit about your analysis methods and the tools you'll be using before you know what type of cleaning you need to do. For example, some techniques and tools will be very precise when counting the individual words, and they may count a lower-case and an upper-case version of the same word separately. Here are some other cleaning and parsing techniques you might need to look into:
Removingstop words(deleting very common words like "a", "the", "and", etc.)
Stemmingorlemmatization(ways of combining words that have the same linguistic root or stem)
Tip:Tools likeWordlemay remove stop words, but they will likely count a word and the plural of that word separately, or preserve differences in case as mentioned above. Try converting everything to lower case and using a quickstemmingtool before loading things into word cloud generators.
File Conversion
Extracting from PDFs:
More timesavers to unlock public records data(PDFs into spreadsheets)
Tabula(Java program for all platforms)
gImageReader(OCR for images, PDFs)
Cleaning HTML/XML:
scrubber(also lemmatizes, removes stop words with prepared lists)
HTML to Text (or Story) fromData Science Toolkit
Changing tabs to commas, removing line breaks, etc.
Sort My List(also changes case, removes punctuation)
Transformer(rescue texts from old file formats)
Correcting/Standardizing Text
Google Refinefor entity normalization
Vard 2for cleaning historical text
TextFixerfor changing case, removing whitespace, sorting
Porter stemmer onlinefor stemming text
Microsoft Word to convert formatting to structure
Finding and replacing formatting and special characters in Word
Using regular expressions in Word
Convert text to table and back
Microsoft Excel to split, concatenate, filter data
Word Frequency in Excel with Filters, COUNTIF
Help with Regular Expressions
Text Editors with Regular Expression Capabilities
Windows
(See alsoTop 10 Cheap Windows Text Editors with Regular Expressions)
Microsoft Word(Extended Instructions)
Mac
(See alsoTop 10 Cheap Mac OS X Text Editors with Regular Expressions)
Microsoft Word(Extended Instructions)
Types of Text Analysis
Basic Text Summaries and Analyses
Word frequency (lists of words and their frequencies)
(See also:Word counts are amazing, Ted Underwood)
Collocation (words commonly appearing near each other)
Concordance (the contexts of a given word or set of words)
N-grams (common two-, three-, etc.- word phrases)
Entity recognition (identifying names, places, time periods, etc.)
Dictionary tagging (locating a specific set of words in the texts)
High-level Goals for Text Analysis
(From Underwood, T. (2012).Where to start with text mining.)
Document categorization
Information retrieval (e.g., search engines)
Supervised classification (e.g., guessing genres)
Unsupervised clustering (e.g., alternative “genres”)
Corpora comparison (e.g., political speeches)
Language use over time (e.g.,Google ngram viewer)
Detecting clusters of document features (i.e., topic modeling)
Entity recognition/extraction (e.g., geoparsing)
Visualization
Tools with Their Analysis Methods
Web Tools
Voyant Tools– word frequencies, concordance, word clouds, visualizations
TAPorWare– various data cleaning, annotating, and summarizing tools in a web interface
Netlytic– word frequencies, concordance, dictionary tagging, network analysis
Wmatrix– frequency profiles, concordances, compare frequency lists, n-grams and c-grams, collocations
Natural Language Processor & Analyzer- word frequencies, collocations, concordance, tokenizer, etc.
ManyEyes– interactive text visualizations (network diagram, word tree, phrase net, tag cloud, word cloud)
Overview– Automatic topic tagging and visualization
Monk Workbench– Corpus selection from library holdings, frequencies and corpora comparisons, supervised classification
LIWC- Web version will output a few linguistic dimensions; full version can be licensed for ~$100
Downloadable Applications
(no programming required)
AntWord– word frequencies
AntConc– frequency lists, concordances, collocations, keywords, n-grams
TextSTAT– word frequencies, concordances
Concordance– word frequencies, concordances, indexes
Cowo- semantic network
WordHoard- word frequencies, concordances, collocations, scripting (includes tagged literarycorpora)
CasualConc- kwic concordance lines, word clusters, collocation analysis, and word count
NVivo(Duke info) - cancluster sourcesbased on text, also producesphrase netsandtag clouds
Tableau(LibGuide) - word clouds
Other Lists of Tools
TAPoRware recipes(tutorials)
DiRT- digital research tools
Advanced Text Analysis
Text Annotation Tools
Natural Language Processing
National Centre for Text Mining(includes some tools for medical texts)
Reporters' Lab Reviews: Entity Extraction
Natural(natural language facilities for Node.js)
Sentiment Analysis
Most powerful open source sentiment analysis tools
Bing Liu's Resources on Opinion Mining(including a sentiment lexicon)
NaCTeM Sentiment Analysis Test Site(web form)
pattern web mining module(python)
List of sentiment analysis tools for Twitter
Programming Resources
The Programming Historian - Lessons
Basic Unix workflow for Text Processing
Similarity and Dissimilarity Measures
An introduction to text analysis with python
Basic Text Analysis in Mathematica
Zend Framework- PHP framework for collecting data
Text Analysis with R for Students of Literature
Python Programming for the Humanities
Examples of Text Visualizations
Various Text Analysis Projects with Visualizations
Various artistic analyses/interpretations of texts byStefanie Posavec
The state of our union is... dumber
A Christmas Carol(TULP interactive)
Word Frequency Visualizations
Google n-gram viewer- word frequencies over time
bookworm Open Library- word frequencies over time
Historical culturomics of pronoun frequencies- pronoun frequencies by gender over time
The Words They Used- bubble cloud of words from national convention speeches, with size and color coding
Bib.ly- word frequencies throughout the Bible
Ye Shall Know Them By Their Words- word frequencies by topic for presidential nomination speeches (additional description)
FACTA+ Visualizer- tree map of term frequency
Inaugural language(Boston Globe) - radial scatterplots
Mining Books to Map Emotions- frequencies of sentiment terms over time
Topic Model Visualizations
Termite- tabular, proportional symbol visualization of words and topics
PMLA topic network- a network view of the topics from a topic model of PMLA, where links are created for shared words between topics (additional description)
Using Word Clouds for Topic Modeling Results- visualizing the distribution of words for each topic as separate word clouds
https://guides.library.duke.edu/text_analysis/text_vis