【文本数据探索性数据分析/可视化完整实例】《A Complete Exploratory Data Analysis and Visualization for Text Data》by Susan Li O网页链接 notebook:O网页链接 pdf:O网页链接
英语原文:A Complete Exploratory Data Analysis and Visualization for Text Data
Visually representing the content of a text document is one of the most important tasks in the field of text mining. As a data scientist or NLP specialist, not only we explore the content of documents from different aspects and at different levels of details, but also we summarize a single document, show the words and topics, detect events, and create storylines.
意译:文本内容可视化表示是文本挖掘领域中的重要研究任务。作为一个数据科学家或自然语言专家,我们不仅要探索不同领域、不同细粒度文本的内容,而且我们需要总结一个文本,展示它的词、主题、时间、故事线。
However, there are some gaps between visualizing unstructured (text) data and structured data. For example, many text visualizations do not represent the text directly, they represent an output of a language model (word count, character length, word sequences, etc.).
意译:可视化非结构化文本和可视化结构化文本之间存在着巨大差异。比如:许多文本可视化并不直接表示文本,它们通过一个语言模型输出文本的单词数量、字符长度、词序列等来间接的表示文本。
In this post. we will use Womens Clothing E-Commerce Reviews data set, and try to explore and visualize as much as we can, using Plotly’s Python graphing library and Bokeh visualization library. Not only we are going to explore text data, but also we will visualize numeric and categorical features. Let’s get started!
意译:文本可视化数据集:女性衣服电子商业评论数据集
注册一个kaggle账号就可以下载数据了。
工具一:Plotly's Python graphing library
工具二:Bokeh visualization library
所有的代码:code jupyter
Let's get started!
The Data
df = pd.read_csv('Womens Clothing E-Commerce Reviews.csv')
After a brief inspection of the data, we found there are a series of data preprocessing we have to conduct.
·Remove the "Title" feature.
·Remove the rows where "Review Text" were missing.
意译:对数据做一些预处理,具体操作看以下截图:
·Clean "Review Text" column.
·Using TextBlob to calculate sentiment polarity which lies in the range of [-1,1] where 1 means positive sentiment and -1 means a negative sentiment.
·Create new feature for the length of the review.
·Create new feature for the word count of the review.
意译:以上英文操作解释可以看下图:
To preview whether the sentiment polarity score works, we randomly select 5 reviews with the highest sentiment polarity score(1):
意译:随机查看5个正情感本文。
Then randomly select 5 reviews with the most neutral sentiment polarity score(0):
意译:随机查看5条中立情感的文本。
There were only 2 reviews with the most negative sentiment polarity score:
意译:输出2个负情感的文本。
Univariate visualization with Plotly
Single-variable or univariate visualization is the simplest type of visualization which consists of observations on only a single characteristic or attribute. Univariate visualization includes histogram, bar plots and line charts.
The distribution of review sentiment polarity score
意译:用Plotly将数据进行可视化,单一变量的可视化是最简单的,可视化形式包括:直方图、条形图和折线图。
What is difference between plot and iplot in pandas?
后面部分内容因为df['polarity'].iplot() 中的.plot()在本地运行报错,暂没解决。所以后面的内容,请查看完文。