31 Pandas使用explode实现一行变多行统计

31 Pandas使用explode实现一行变多行统计

解决实际问题:一个字段包含多个值,怎样将这个值拆分成多行,然后实现统计 比如:一个电影有多个分类、一个人有多个喜好,需要按分类、喜好做统计

1、读取数据

import pandas as pd df = pd.read_csv( "./datas/movielens-1m/movies.dat", header=None, names="MovieID::Title::Genres".split("::"), sep="::", engine="python" ) df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
MovieID Title Genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy

问题:怎样实现这样的统计,每个题材有多少部电影?

解决思路:

  • 将Genres按照分隔符|拆分
  • 按Genres拆分成多行
  • 统计每个Genres下的电影数目

2、将Genres字段拆分成列表

df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 3883 entries, 0 to 3882 Data columns (total 3 columns): MovieID 3883 non-null int64 Title 3883 non-null object Genres 3883 non-null object dtypes: int64(1), object(2) memory usage: 91.1+ KB # 当前的Genres字段是字符串类型 type(df.iloc[0]["Genres"]) str # 新增一列 df["Genre"] = df["Genres"].map(lambda x:x.split("|")) df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
MovieID Title Genres Genre
0 1 Toy Story (1995) Animation|Children's|Comedy [Animation, Children's, Comedy]
1 2 Jumanji (1995) Adventure|Children's|Fantasy [Adventure, Children's, Fantasy]
2 3 Grumpier Old Men (1995) Comedy|Romance [Comedy, Romance]
3 4 Waiting to Exhale (1995) Comedy|Drama [Comedy, Drama]
4 5 Father of the Bride Part II (1995) Comedy [Comedy]
# Genre的类型是列表 print(df["Genre"][0]) print(type(df["Genre"][0])) ['Animation', "Children's", 'Comedy'] <class 'list'> df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 3883 entries, 0 to 3882 Data columns (total 4 columns): MovieID 3883 non-null int64 Title 3883 non-null object Genres 3883 non-null object Genre 3883 non-null object dtypes: int64(1), object(3) memory usage: 121.5+ KB

3、使用explode将一行拆分成多行

语法:pandas.DataFrame.explode(column) 将dataframe的一个list-like的元素按行复制,index索引随之复制

df_new = df.explode("Genre") df_new.head(10)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
MovieID Title Genres Genre
0 1 Toy Story (1995) Animation|Children's|Comedy Animation
0 1 Toy Story (1995) Animation|Children's|Comedy Children's
0 1 Toy Story (1995) Animation|Children's|Comedy Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy Adventure
1 2 Jumanji (1995) Adventure|Children's|Fantasy Children's
1 2 Jumanji (1995) Adventure|Children's|Fantasy Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance Comedy
2 3 Grumpier Old Men (1995) Comedy|Romance Romance
3 4 Waiting to Exhale (1995) Comedy|Drama Comedy
3 4 Waiting to Exhale (1995) Comedy|Drama Drama

4、实现拆分后的题材的统计

%matplotlib inline df_new["Genre"].value_counts().plot.bar() <matplotlib.axes._subplots.AxesSubplot at 0x23d73917cc8>

本文使用 文章同步助手 同步

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容