文本分析步骤、工具和可视化如何做?

Introduction to Text Analysis

"Text analysis" is a broad term covering various processes by which text and natural language documents can be modified so that they can be organized and described.

This guide collects resources for several phases of the text analysis process, including text collection, text parsing and cleaning, text summary and analysis methods, and text visualization.

Overviews/summaries

Ted Underwood – Where to start with text mining

Tooling Up for Digital Humanities – Text Analysis

Ryan Shaw – Text Mining

John Laudin – Text Analytics 101

O'Connor, Bamman, & Smith (2011) –Computational Text Analysis for Social Science

Ben Schmidt –Comparing Corpuses by Word Use

Possible Sources of Text

Native digital text

Email

(Thunderbird extension,MUSE*)

HTML

RSS feeds

Sample specific services:

Twitter

Wikipedia

Data Liberation Front

New York Times API

CMU Movie Summary Corpus

Corpus of Global Web-Based English (GloWbE)

PLOS Text Mining Collection

Tutorials for data collection from various services

Digitized

Internet Archive

Project Gutenberg

Google Books

Hathi Trust(Hathi Download Helper)

JSTOR Data for Research* (withEarly Journal Content bundle, also fromarchive.org)

PubMed Open Access Subset

Monk Workbench*

Document Cloud*

Open American National Corpus(collection of American English from various sources)

WordHoard*(tagged literary texts)

Corpus of Contemporary American English

* - also has some processing/analysis capabilities

Cleaning Text for Analysis

Before you can do a text analysis project, you often need to do a lot of cleaning and parsing to the text.  This is because most text is created and stored so that humans can understand it, and it is not always easy for a computer to process that text.

Computers work well when there is structure to a data source or, at least, some regular patterns that it can identify.  Most cleaning and parsing for text analysis involves increasing the regularity (for example, fixing typos) or adding structure (tagging certain words as important, or even splitting documents up into different sections that have special meaning - title, authors, chapters, etc.).

The major ways of analyzing texts are listed underAnalysis Methods, and you may need to know a bit about your analysis methods and the tools you'll be using before you know what type of cleaning you need to do.  For example,  some techniques and tools will be very precise when counting the individual words, and they may count a lower-case and an upper-case version of the same word separately.  Here are some other cleaning and parsing techniques you might need to look into:

Removingstop words(deleting very common words like "a", "the", "and", etc.)

Stemmingorlemmatization(ways of combining words that have the same linguistic root or stem)

Tip:Tools likeWordlemay remove stop words, but they will likely count a word and the plural of that word separately, or preserve differences in case as mentioned above.  Try converting everything to lower case and using a quickstemmingtool before loading things into word cloud generators.

File Conversion

Extracting from PDFs:

More timesavers to unlock public records data(PDFs into spreadsheets)

Tabula(Java program for all platforms)

gImageReader(OCR for images, PDFs)

Cleaning HTML/XML:

Beautiful Soup

scrubber(also lemmatizes, removes stop words with prepared lists)

HTML to Text (or Story) fromData Science Toolkit

Changing tabs to commas, removing line breaks, etc.

Sort My List(also changes case, removes punctuation)

TextFixer

Transformer(rescue texts from old file formats)

Text Mechanic

Correcting/Standardizing Text

Google Refinefor entity normalization

Vard 2for cleaning historical text

TextFixerfor changing case, removing whitespace, sorting

Porter stemmer onlinefor stemming text

Microsoft Word to convert formatting to structure

Finding and replacing formatting and special characters in Word

Using regular expressions in Word

Convert text to table and back

Microsoft Excel to split, concatenate, filter data

Excel Text to Columns tool

Excel Concatenate function

Word Frequency in Excel with Filters, COUNTIF

Help with Regular Expressions

Text Editors with Regular Expression Capabilities

Windows

(See alsoTop 10 Cheap Windows Text Editors with Regular Expressions)

Notepad++

GNU Emacs

Vim

Kate

jEdit(instructions)

NoteTab Light

Microsoft Word(Extended Instructions)

Notepad RE

Zeus Lite Editor

Programmer's Notepad

EditPad Lite

PSPad

SciTE

Crimson Editor

Sublime

Mac

(See alsoTop 10 Cheap Mac OS X Text Editors with Regular Expressions)

GNU Emacs

Vim

jEdit(instructions)

Kate

Aquamacs

TextWrangler

Sublime

Microsoft Word(Extended Instructions)

Types of Text Analysis

Basic Text Summaries and Analyses

Word frequency (lists of words and their frequencies)

(See also:Word counts are amazing, Ted Underwood)

Collocation (words commonly appearing near each other)

Concordance (the contexts of a given word or set of words)

N-grams (common two-, three-, etc.- word phrases)

Entity recognition (identifying names, places, time periods, etc.)

Dictionary tagging (locating a specific set of words in the texts)

High-level Goals for Text Analysis

(From Underwood, T. (2012).Where to start with text mining.)

Document categorization

Information retrieval (e.g., search engines)

Supervised classification (e.g., guessing genres)

Unsupervised clustering (e.g., alternative “genres”)

Corpora comparison (e.g., political speeches)

Language use over time (e.g.,Google ngram viewer)

Detecting clusters of document features (i.e., topic modeling)

Entity recognition/extraction (e.g., geoparsing)

Visualization

Tools with Their Analysis Methods

Web Tools

Voyant Tools– word frequencies, concordance, word clouds, visualizations

TAPorWare– various data cleaning, annotating, and summarizing tools in a web interface

Netlytic– word frequencies, concordance, dictionary tagging, network analysis

Wmatrix– frequency profiles, concordances, compare frequency lists, n-grams and c-grams, collocations

Natural Language Processor & Analyzer- word frequencies, collocations, concordance, tokenizer, etc.

ManyEyes– interactive text visualizations (network diagram, word tree, phrase net, tag cloud, word cloud)

Overview– Automatic topic tagging and visualization

Monk Workbench– Corpus selection from library holdings, frequencies and corpora comparisons, supervised classification

LIWC- Web version will output a few linguistic dimensions; full version can be licensed for ~$100

Downloadable Applications

(no programming required)

AntWord– word frequencies

AntConc– frequency lists, concordances, collocations, keywords, n-grams

TextSTAT– word frequencies, concordances

Concordance– word frequencies, concordances, indexes

Cowo- semantic network

WordHoard- word frequencies, concordances, collocations, scripting (includes tagged literarycorpora)

CasualConc- kwic concordance lines, word clusters, collocation analysis, and word count

NVivo(Duke info) - cancluster sourcesbased on text, also producesphrase netsandtag clouds

Tableau(LibGuide) - word clouds

Other Lists of Tools

TAPoR 2

TAPoRware recipes(tutorials)

DiRT- digital research tools

Advanced Text Analysis

Text Annotation Tools

NVivo

brat rapid annotation tool

Natural Language Processing

GATE

nltk

Stanford NLP Group Software

National Centre for Text Mining(includes some tools for medical texts)

Reporters' Lab Reviews: Entity Extraction

Michael Collins' notes on NLP

Natural(natural language facilities for Node.js)

Sentiment Analysis

Most powerful open source sentiment analysis tools

Bing Liu's Resources on Opinion Mining(including a sentiment lexicon)

NaCTeM Sentiment Analysis Test Site(web form)

pattern web mining module(python)

SentiWordNet

Umigon (for tweets, etc.)

List of sentiment analysis tools for Twitter

Programming Resources

The Programming Historian - Lessons

Basic Unix workflow for Text Processing

Helpful Unix commands

Similarity and Dissimilarity Measures

An introduction to text analysis with python

Basic Text Analysis in Mathematica

Zend Framework- PHP framework for collecting data

Text Analysis with R for Students of Literature

Python Programming for the Humanities

Document Similarity with R

Examples of Text Visualizations

Various Text Analysis Projects with Visualizations

With Criminal Intent

Various artistic analyses/interpretations of texts byStefanie Posavec

The state of our union is... dumber

wordcollider

Popcornjs sentiment tracker

Metropho.rs

Novel Views: Les Miserables

A Christmas Carol(TULP interactive)

Tolkien's Books Analyzed

Word Frequency Visualizations

Google n-gram viewer- word frequencies over time

bookworm Open Library- word frequencies over time

Historical culturomics of pronoun frequencies- pronoun frequencies by gender over time

The Words They Used- bubble cloud of words from national convention speeches, with size and color coding

Bib.ly- word frequencies throughout the Bible

Ye Shall Know Them By Their Words- word frequencies by topic for presidential nomination speeches (additional description)

FACTA+ Visualizer- tree map of term frequency

Inaugural language(Boston Globe) - radial scatterplots

Mining Books to Map Emotions- frequencies of sentiment terms over time

Topic Model Visualizations

Termite- tabular, proportional symbol visualization of words and topics

PMLA topic network- a network view of the topics from a topic model of PMLA, where links are created for shared words between topics (additional description)

Using Word Clouds for Topic Modeling Results- visualizing the distribution of words for each topic as separate word clouds

​https://guides.library.duke.edu/text_analysis/text_vis

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 214,100评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,308评论 3 388
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 159,718评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,275评论 1 287
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,376评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,454评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,464评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,248评论 0 269
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,686评论 1 306
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,974评论 2 328
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,150评论 1 342
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,817评论 4 337
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,484评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,140评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,374评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,012评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,041评论 2 351

推荐阅读更多精彩内容

  • The Great A.I. Awakening How Google used artificial intel...
    图羽阅读 1,203评论 0 3
  • by Lene Nielsen The persona method has developed from bei...
    鲜核桃阅读 1,057评论 0 0
  • “三十几岁拿着二十几岁的工资,做的事情也不是兴趣所致,完全为了生存。下班后围着灶台,孩子转,做免费保姆。老公远不...
    木木青苔阅读 368评论 2 3
  • 问题一:最近想深入学习传统文化,请问塔罗牌有什么建议? 解:太阳正位。深入到传统文化的阴暗面,探寻集体的业力。 问...
    土豆炖番茄阅读 232评论 0 0
  • 垒垒荒冢上,火光熊熊,纸灰缭绕,清明到了。这是碧草绿水的春郊。墓畔有白发老翁,有红颜年少,向这一杯黄土致不...
    失重的蜂鸟阅读 174评论 0 0