Task02：数据读取与数据分析

学习目标

分析赛题数据的分布规律
通过这次学习定义一个自己的句子分析类，可以用来循环使用，进而分析所有相似的data

数据大小：

训练集20w条样本，测试集A包括5w条样本，测试集B包括5w条样本。

数据读取:

import pandas as pd
data_root={
    "train_path":"../data/train_set.csv",
    "test_path":"../data/test_a.csv",
    "sub_path":"../data/test_a_sample_submit.csv"
}
train=pd.read_csv(data_root["train_path"], sep='\t')
test=pd.read_csv(data_root["test_path"])
sub=pd.read_csv(data_root["sub_path"])

训练集样式：第一列为新闻的类别，第二列为新闻的字符。

label	text
6	57 44 66 56 2 3 3 37 5 41 9 55

数据分析：

1. 句子长度分析

注意到文本是长这样的：

train.text.values[199998]
# output 
'6405 3203 6644 983 794 1913 1678 5736 1397 1913 5221 1722 2410'

所以要先split(" ") 然后len

train["text_len"]=train.text.apply(lambda x:len(x.split(" ")))

看一眼统计分析结果

train_df['text_len'].describe()
# output
count    200000.000000
mean        907.207110
std         996.029036
min           2.000000
25%         374.000000
50%         676.000000
75%        1131.000000
max       57921.000000
Name: text_len, dtype: float64

我们用hist作图

train.text_len.hist(bins=100)
plt.xlabel('Text char count')
plt.title("Histogram of char count");

2. 新闻类别分布

df_label=train.groupby("label").agg({"text":["count"],"text_len":["max","min","mean"]})

label	count	len_max	len_min	len_mean
0	38918	18587	12	878.717663
1	36945	57921	9	870.363676
2	31425	41894	2	1014.429562
3	22133	10817	17	784.774726
4	15016	14928	25	649.705647
5	12232	15839	27	1116.054938
6	9985	25728	16	1249.114071
7	8841	14469	11	1157.883271
8	7847	15271	7	712.401172
9	5878	23866	17	833.627084
10	4920	20622	14	911.138008
11	3131	5729	21	608.889812
12	1821	8737	25	1194.969248
13	908	6399	26	735.325991

train["label"].value_counts().plot(kind="bar")
plt.title('News class count')
plt.xlabel("category");

标签的对应的关系如下：{'科技': 0, '股票': 1, '体育': 2, '娱乐': 3, '时政': 4, '社会': 5, '教育': 6, '财经': 7, '家居': 8, '游戏': 9, '房产': 10, '时尚': 11, '彩票': 12, '星座': 13}
从统计结果可以看出，赛题的数据集类别分布存在较为不均匀的情况。在训练集中科技类新闻最多，其次是股票类新闻，最少的新闻是星座新闻。

3. 字符分布统计

注意到每一行都是空格分割的字符，所以我们将不同行也用空格相连，然后对所有项进行split(" ")得到所有字符的list，然后统计个数

from collections import Counter
all_lines=" ".join(list(train["text"]))
word_count=Counter(all_lines.split(" "))

使用Counter 类可以轻松得到出现次数最多和最少的单词（当然最少的单词很多）

出现次数最多的10个单词，其中编号3750的字出现的次数最多。

word_count.most_common(10)
#output
[('3750', 7482224),
 ('648', 4924890),
 ('900', 3262544),
 ('3370', 2020958),
 ('6122', 1602363),
 ('4464', 1544962),
 ('7399', 1455864),
 ('4939', 1387951),
 ('3659', 1251253),
 ('4811', 1159401)]

出现次数最少的5个单词，都是1次。

word_count.most_common()[-5:]
#output
[('155', 1), ('1415', 1), ('1015', 1), ('4468', 1), ('3133', 1)]

统计字典长度

len(word_count)
#output
6869

从统计结果中可以看出，在训练集中总共包括6869个字。

统计了不同字符在句子中出现的次数

这里还可以根据字在每个句子的出现情况，反推出标点符号。
统计了不同字符在句子中出现的次数，只需要对每一句的字符进行去重就可以了，所以只需要对上面的代码进行轻微改动即可。

train['text_unique'] = train['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))
all_lines = ' '.join(list(train['text_unique']))
word_count = Counter(all_lines.split(" "))

其中字符3750，字符900和字符648在20w新闻的覆盖率超过95%，很有可能是标点符号。

for k,v in word_count.most_common(10):
    print("字符编号为 {:>4} 在所有句子中的比例为: {:.2%}".format(k,v/200000))
# output 
字符编号为 3750 在所有句子中的比例为: 99.00%
字符编号为  900 在所有句子中的比例为: 98.83%
字符编号为  648 在所有句子中的比例为: 95.99%
字符编号为 2465 在所有句子中的比例为: 88.66%
字符编号为 6122 在所有句子中的比例为: 88.27%
字符编号为 7399 在所有句子中的比例为: 88.12%
字符编号为 4811 在所有句子中的比例为: 84.69%
字符编号为 4464 在所有句子中的比例为: 83.58%
字符编号为 1699 在所有句子中的比例为: 82.43%
字符编号为 3659 在所有句子中的比例为: 81.59%

数据分析结论：

通过上述分析我们可以得出以下结论：

赛题中每个新闻包含的字符个数平均为1000个，还有一些新闻字符较长；
赛题中新闻类别分布不均匀，科技类新闻样本量接近4w，星座类新闻样本量不到1k；
赛题总共包括7000-8000个字符（包括测试集）；

通过数据分析，我们还可以得出以下结论：

每个新闻平均字符个数较多，可能需要截断；
由于类别不均衡，会严重影响模型的精度；

本章作业：

1. 假设字符3750，字符900和字符648是句子的标点符号，请分析赛题每篇新闻平均由多少个句子构成？

for symbol in ["3750","900","648"]:
    col_name="sentence_count_by_{}".format(symbol)
    train[col_name]=train.text.apply(lambda x:len(x.split(symbol)))

做一些数据分析和可视化展示:

for symbol in ["3750","900","648"]:
    col_name="sentence_count_by_{}".format(symbol)
    print(train[col_name].describe())
    train[col_name].hist(bins=100)
    plt.title(symbol)
    plt.show()

#output
count    200000.00000
mean         38.41112
std          40.87367
min           1.00000
25%          14.00000
50%          27.00000
75%          49.00000
max        1960.00000
Name: sentence_count_by_3750, dtype: float64

#output
count    200000.000000
mean         17.335255
std          18.047099
min           1.000000
25%           7.000000
50%          12.000000
75%          22.000000
max         735.000000
Name: sentence_count_by_900, dtype: float64

#output
count    200000.000000
mean         27.055995
std          31.958496
min           1.000000
25%           8.000000
50%          18.000000
75%          35.000000
max        1394.000000
Name: sentence_count_by_648, dtype: float64

其实还有些奇怪的，因为这些字符都不在整篇文章的结尾，那么这种分句子的方式就会留下一段不完整的东西，所以还要做一个句尾字符分析。

句尾字符分析：
对于最后一个字符,是否截断，给出假设：
如果没有被截断，那么末尾那个数字应该会有很大一部分是一样的。

train["last_word"]=train.text.apply(lambda x: x.split(" ")[-1])
last_word_count=Counter(train["last_word"])
last_word_count.most_common(10)
# output
[('900', 85040),
 ('2662', 39273),
 ('885', 14473),
 ('1635', 7379),
 ('2465', 7076),
 ('57', 3284),
 ('3231', 2758),
 ('1633', 2706),
 ('3568', 1504),
 ('2265', 1389)]

检查上述三个字符是在文章尾巴的次数：由结果可以看出，我们更倾向于900是句号，并且大多数文章没有被截断。

for symbol in ["3750","900","648"]:
    print(last_word_count[symbol])
# output
17
85040
17

2. 统计每类新闻中出现次数对多的字符

用一个字典记录每一类新闻的Counter字典就好了。

def get_word_group_count():
    word_group_count={}
    for name,group in train[["label","text"]].groupby("label"):
        all_lines=" ".join(list(group.text))
        word_count=Counter(all_lines.split(" "))
        word_group_count[name]=word_count
    return word_group_count

word_group_count=get_word_group_count()
for i in range(14):
    print("标签为第{:>2d}组，最多的五个单词为 {} ".format(i,word_group_count[i].most_common(5)))

标签为第 0组，最多的五个单词为 [('3750', 1267331), ('648', 967653), ('900', 577742), ('3370', 503768), ('4464', 307431)] 
标签为第 1组，最多的五个单词为 [('3750', 1200686), ('648', 714152), ('3370', 626708), ('900', 542884), ('4464', 445525)] 
标签为第 2组，最多的五个单词为 [('3750', 1458331), ('648', 974639), ('900', 618294), ('7399', 351894), ('6122', 343850)] 
标签为第 3组，最多的五个单词为 [('3750', 774668), ('648', 494477), ('900', 298663), ('6122', 187933), ('4939', 173606)] 
标签为第 4组，最多的五个单词为 [('3750', 360839), ('648', 231863), ('900', 190842), ('4411', 120442), ('7399', 86190)] 
标签为第 5组，最多的五个单词为 [('3750', 715740), ('648', 329051), ('900', 305241), ('6122', 159125), ('5598', 136713)] 
标签为第 6组，最多的五个单词为 [('3750', 469540), ('648', 345372), ('900', 222488), ('6248', 193757), ('2555', 175234)] 
标签为第 7组，最多的五个单词为 [('3750', 428638), ('648', 262220), ('900', 184131), ('3370', 159156), ('5296', 132136)] 
标签为第 8组，最多的五个单词为 [('3750', 242367), ('648', 202399), ('900', 92207), ('6122', 57345), ('4939', 56147)] 
标签为第 9组，最多的五个单词为 [('3750', 178783), ('648', 157291), ('900', 70680), ('7328', 46477), ('6122', 43411)] 
标签为第10组，最多的五个单词为 [('3750', 180259), ('648', 114512), ('900', 75185), ('3370', 67780), ('2465', 45163)] 
标签为第11组，最多的五个单词为 [('3750', 83834), ('648', 67353), ('900', 37240), ('4939', 18591), ('6122', 18438)] 
标签为第12组，最多的五个单词为 [('3750', 87412), ('4464', 51426), ('3370', 45815), ('648', 37041), ('2465', 36610)] 
标签为第13组，最多的五个单词为 [('3750', 33796), ('648', 26867), ('900', 11263), ('4939', 9651), ('669', 8925)]

3. 作业中的数据分析结论：

1.由句尾分析可以看出，我们更倾向于900是句号，并且大多数文章没有被截断。

标签第12组，“4464” 突然变多，可见彩票类新闻，有大量这个字符，暂时没有猜想，在每类新闻中最多的单词一般都是标点符号。

定义一个句子分析类：

将上面代码整理之后，重新构建一个句子分析的类，这样在分析测试集时，可以重复调用，更详细的参考信息可以参考另外一个我的博客。

代码已经上传至Github

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
class SentenceAnalysis():
    def __init__(self,data_path,n_classes=None,with_label=True):
        self.data_path=data_path
        self.with_label=with_label #测试集无标签导入
        self.n_classes=n_classes
        self.load_dataset()
    
    @property
    def data(self):
        if self.with_label==True:
            return self.X,self.Y
        else:
            return self.X
    
    def load_dataset(self):
        if self.with_label:
            train=pd.read_csv(self.data_path, sep='\t')
            self.X=train[[col for col in train.columns if col!="label"]]
            self.Y=train["label"]
        else:
            test=pd.read_csv(self.data_path)
            self.X=test
            self.Y=None
    
    def __len__(self):
        return self.X.shape[0]

    def __getitem__(self, index):
        '''Generate one  of data'''
        x = self.X.iloc[int(index)]
        if self.with_label == True:
            y=self.Y[int(index)]
            # y=one_hot(y,self.n_classes)
            return x,y
        else:
            return x
    
    def passage_length_ana(self,show_describe=True,show_hist=False):
        """
        句子长度分析
        """
        df=self.X.copy()
        df["text_len"]=df.text.apply(lambda x:len(x.split(" ")))
        if show_describe:
            print(df["text_len"].describe())
        if show_hist:
            train.text_len.hist(bins=100)
            plt.xlabel('Text char count')
            plt.title("Histogram of char count");
        return df["text_len"]

    def show_hist(self,data,bins=100,title="Not define.",xlabel="no xlabel."):
        data.hist(bins=bins)
        plt.xlabel(xlabel)
        plt.title(title);
        return 

    def label_distribution(self,show_bar=True,title='class count',xlabel="category"):
        """
        label分布的分析
        """
        if not self.with_label:
            print("没有可用的标签！")
            return
        df=self.X.copy()
        df["label"]=self.Y.values
        df_label=df.groupby("label").agg({"text":["count"]})
        if show_bar:
            df["label"].value_counts().plot(kind="bar")
            plt.title(title)
            plt.xlabel(xlabel);
        return df_label
    

    def word_distribution(self,show_most=1,show_least=1):
        """
        字符分布
        """
        show_most,show_least=int(show_most),int(show_least)
        df=self.X.copy()
        all_lines=" ".join(list(df["text"]))
        word_count=Counter(all_lines.split(" "))
        if show_most>0:
            print("最多的{}个字符:".format(show_most))
            print(word_count.most_common(int(show_most)))
        if show_least>0:
            print("最少的{}个字符:".format(show_least))
            print(word_count.most_common()[-int(show_least):])
        print("所有文档中拥有字符数： {}".format(len(word_count)))
        return word_count
    
    def word_in_sentece_distribution(self,show_most=1,show_least=0):
        """
        统计了不同字符在句子中出现的次数
        """
        show_most,show_least=int(show_most),int(show_least)
        df=self.X.copy()
        df['text_unique'] = df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))
        all_lines = ' '.join(list(df['text_unique']))
        word_count = Counter(all_lines.split(" "))
        if show_most>0:
            print("最多的{}个字符:".format(show_most))
            for k,v in word_count.most_common(show_most):
                print("字符编号为 {:>4} 在所有句子中的比例为: {:.2%}".format(k,v/self.X.shape[0]))
        if show_least>0:
            print("最少的{}个字符:".format(show_least))
            for k,v in word_count.most_common()[-int(show_least):]:
                print("字符编号为 {:>4} 在所有句子中的比例为: {:.2%}".format(k,v/self.X.shape[0]))
        return word_count
    
    def word_groupbylabel_count(self,show_most=1):
        """
        统计每类新闻中出现次数最多的字符
        """
        show_most=int(show_most)
        if not self.with_label:
            print("没有可用的标签！")
            return
        df=self.X.copy()
        df["label"]=self.Y.values
        word_group_count={}
        for name,group in df[["label","text"]].groupby("label"):
            all_lines=" ".join(list(group.text))
            word_count=Counter(all_lines.split(" "))
            word_group_count[name]=word_count
        if show_most>0:
            if not self.n_classes:
                self.n_classes=self.Y.nunique()
            for i in range(self.n_classes):
                print("标签为第{:>2d}组，最多的{}个单词为 {} ".format(i,show_most,word_group_count[i].most_common(show_most)))
        return word_group_count
    

    def last_word_ana(self,show_most=1,show_least=1):
        """
        句尾分析
        """
        show_most,show_least=int(show_most),int(show_least)
        df=self.X.copy()
        df["last_word"]=df.text.apply(lambda x: x.split(" ")[-1])
        last_word_count=Counter(df["last_word"])
        if show_most>0:
            print("最多的{}个字符:".format(show_most))
            print(last_word_count.most_common(int(show_most)))
        if show_least>0:
            print("最少的{}个字符:".format(show_least))
            print(last_word_count.most_common()[-int(show_least):])
        print("所有文档中不同的最后一个字符数： {}".format(len(last_word_count)))
        return last_word_count

功能展示：

对于训练集：

train_path="../data/train_set.csv"
sentence_train=SentenceAnalysis(train_path,n_classes=14,with_label=True)

# __getitem__
sentence_train[1]
# __len__
len(sentence_train)
# data
train_X,train_y=sentence_train.data
# 文章长度分析
df_length=sentence_train.passage_length_ana()
# 辅助的作图
sentence_train.show_hist(df_length,100,'Text char count',"Histogram of char count")
# 新闻类别分布
df_label=sentence_train.label_distribution()
# 字符个数分布
word_dict=sentence_train.word_distribution(5,5)
# 不同字符在句子中出现的次数
word_in_sentece_dict=sentence_train.word_in_sentece_distribution(5)
# 统计每类标签中出现次数最多的字符
word_group_count=sentence_train.word_groupbylabel_count(5)
# 句尾分析
last_word_count=sentence_train.last_word_ana(2,3)

对于测试集：

test_path="../data/test_a.csv"
sentence_test=SentenceAnalysis(test_path,n_classes=14,with_label=False)

# 功能展示
# __getitem__
sentence_test[1]
# __len__
len(sentence_test)
# data
sentence_test.data
# 文章长度分析
df_length=sentence_test.passage_length_ana()
# 辅助的作图
sentence_test.show_hist(df_length,100,'Text char count',"Histogram of char count")
# 新闻类别分布(没有标签，给出提示不可做分析。)
sentence_test.label_distribution()
# 字符个数分布
word_dict=sentence_test.word_distribution(5)
# 不同字符在句子中出现的次数
word_in_sentece_dict=sentence_test.word_in_sentece_distribution(2,3)
# 统计每类标签中出现次数最多的字符(没有标签，给出提示不可做分析。)
word_group_count=sentence_test.word_groupbylabel_count(5)
# 句尾分析
last_word_count=sentence_test.last_word_ana(2,3)