31 Pandas使用explode实现一行变多行统计

解决实际问题：一个字段包含多个值，怎样将这个值拆分成多行，然后实现统计比如：一个电影有多个分类、一个人有多个喜好，需要按分类、喜好做统计

1、读取数据

import pandas as pd

df = pd.read_csv(
    "./datas/movielens-1m/movies.dat",
    header=None,
    names="MovieID::Title::Genres".split("::"),
    sep="::",
    engine="python"
)

df.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	MovieID	Title	Genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy

问题：怎样实现这样的统计，每个题材有多少部电影？

解决思路：

将Genres按照分隔符|拆分
按Genres拆分成多行
统计每个Genres下的电影数目

2、将Genres字段拆分成列表

df.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 3883 entries, 0 to 3882
    Data columns (total 3 columns):
    MovieID    3883 non-null int64
    Title      3883 non-null object
    Genres     3883 non-null object
    dtypes: int64(1), object(2)
    memory usage: 91.1+ KB
    


# 当前的Genres字段是字符串类型
type(df.iloc[0]["Genres"])


    str



# 新增一列
df["Genre"] = df["Genres"].map(lambda x:x.split("|"))

df.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	MovieID	Title	Genres	Genre
0	1	Toy Story (1995)	Animation\|Children's\|Comedy	[Animation, Children's, Comedy]
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy	[Adventure, Children's, Fantasy]
2	3	Grumpier Old Men (1995)	Comedy\|Romance	[Comedy, Romance]
3	4	Waiting to Exhale (1995)	Comedy\|Drama	[Comedy, Drama]
4	5	Father of the Bride Part II (1995)	Comedy	[Comedy]

# Genre的类型是列表
print(df["Genre"][0])
print(type(df["Genre"][0]))

    ['Animation', "Children's", 'Comedy']
    <class 'list'>
    
df.info()


    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 3883 entries, 0 to 3882
    Data columns (total 4 columns):
    MovieID    3883 non-null int64
    Title      3883 non-null object
    Genres     3883 non-null object
    Genre      3883 non-null object
    dtypes: int64(1), object(3)
    memory usage: 121.5+ KB

3、使用explode将一行拆分成多行

语法：pandas.DataFrame.explode(column) 将dataframe的一个list-like的元素按行复制，index索引随之复制

df_new = df.explode("Genre")

df_new.head(10)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	MovieID	Title	Genres	Genre
0	1	Toy Story (1995)	Animation\|Children's\|Comedy	Animation
0	1	Toy Story (1995)	Animation\|Children's\|Comedy	Children's
0	1	Toy Story (1995)	Animation\|Children's\|Comedy	Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy	Adventure
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy	Children's
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy	Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance	Comedy
2	3	Grumpier Old Men (1995)	Comedy\|Romance	Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama	Comedy
3	4	Waiting to Exhale (1995)	Comedy\|Drama	Drama

4、实现拆分后的题材的统计

%matplotlib inline
df_new["Genre"].value_counts().plot.bar()

    <matplotlib.axes._subplots.AxesSubplot at 0x23d73917cc8>

本文使用文章同步助手同步

31 Pandas使用explode实现一行变多行统计