大师兄的Python机器学习笔记:Pandas库

大师兄的Python机器学习笔记:实现评估模型
大师兄的Python机器学习笔记:特征提取

一、关于Pandas

1. Pandas和Numpy
  • Pandas基于NumPy数组,使数据预处理、清洗和分析工作更快更简单。
  • Pandas专为处理表格和混杂数据设计,可以理解为Python中的Excel。
  • NumPy更适合处理统一的数值数组数据。
  • Pandas提供了两种类型的数据结构: DataFrameSeries
import pandas as pd
2. DataFrame结构
  • DataFrame是一个表格型的数据类型,可以把DataFrame理解为Excel的表。
  • DataFrame是由Series组成的字典。
>>>import pandas as pd
>>>data ={"name":["pp","qq","doudou","douding","xiaobudian"],
      "age":[10,1.5,0,5,7],
       "gender":["m","m","m","f","f"]
      }
>>>df = pd.DataFrame(data)
>>>print(df)
         name   age gender
0          pp  10.0      m
1          qq   1.5      m
2      doudou   0.0      m
3     douding   5.0      f
4  xiaobudian   7.0      f
3. Series结构
  • Series是一种类似于一维数组的对象,它由一组数据以及一组与之相关的数据标签组成,即index和values两部分。
  • 可以把Series理解为Excel表中的一列。
>>>import pandas as pd
>>>import numpy as np
>>>random_num = np.random.rand(10)
>>>s = pd.Series(random_num)
>>>print(s)
0    0.241130
1    0.911937
2    0.276555
3    0.570505
4    0.915634
5    0.214568
6    0.179911
7    0.113886
8    0.449848
9    0.025474
dtype: float64

二、创建表格

1. 创建Series
1.1 使用列表创建
>>>import pandas as pd
>>>s = pd.Series(["a","b","c","d","e"])
>>>print(s)
0    a
1    b
2    c
3    d
4    e
dtype: object
1.2 使用Ndarray创建
>>>import pandas as pd
>>>import numpy as np
>>>s = pd.Series(np.arange(5))
>>>print(s)
0    0
1    1
2    2
3    3
4    4
dtype: int32
1.3 使用字典创建
>>>import pandas as pd
>>>import numpy as np
>>>s = pd.Series({'a':1,'b':2,'c':3,'d':4,'e':5})
>>>print(s)
a    1
b    2
c    3
d    4
e    5
dtype: int64
1.4 使用列表生成索引
>>>import pandas as pd
>>>import numpy as np
>>>s = pd.Series(np.arange(5),index=['e','d','c','b','a'])
>>>print(s)
e    0
d    1
c    2
b    3
a    4
dtype: int32
2. 创建DataFrame
1.1 使用Ndarray创建
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df)
   One  Two  Three  Four  Five
A    0    1      2     3     4
B    5    6      7     8     9
C   10   11     12    13    14
D   15   16     17    18    19
1.2 使用Series创建
>>>import pandas as pd
>>>index=list("abcde")
>>>s = {'one':pd.Series(range(5),index=index),
>>>    'two':pd.Series(range(4,9),index=index),
>>>    'three':pd.Series(range(8,13),index=index),
>>>    'four':pd.Series(range(12,17),index=index),}
>>>fd = pd.DataFrame(data=s)
>>>print(fd)
   one  two  three  four
a    0    4      8    12
b    1    5      9    13
c    2    6     10    14
d    3    7     11    15
e    4    8     12    16
1.3 使用字典或Series组成的列表创建
>>>import pandas as pd
>>>l1 = [{'one':1,'two':2,'tree':3},
>>>    {'one':5,'two':6},
>>>    {'three':7,'four':8},
>>>    {'four':4},]
>>>fd = pd.DataFrame(data=l1)
>>>print(fd)
   one  two  tree  three  four
0  1.0  2.0   3.0    NaN   NaN
1  5.0  6.0   NaN    NaN   NaN
2  NaN  NaN   NaN    7.0   8.0
3  NaN  NaN   NaN    NaN   4.0
1.4 使用字典组成的字典创建
>>>import pandas as pd
>>>d1 = {'one':{'a':1,'b':2,'c':3,'d':4},
>>>    'two':{'a':5,'b':7,'c':6,'d':8},
>>>    'three':{'a':11,'c':12},
>>>    'four':{'b':13,'c':14},}
>>>fd = pd.DataFrame(data=d1)
>>>print(fd)
   one  two  three  four
a    1    5   11.0   NaN
b    2    7    NaN  13.0
c    3    6   12.0  14.0
d    4    8    NaN   NaN
3. 从文件读取表格
3.1 相关函数
函数 说明
read_csv() 从文件加载数据,默认分隔符为逗号。
read_table() 从文件加载数据,默认分隔符为制表符。
read_fwf() 读取定宽列格式数据,无分隔符。
read_clipboard 读取剪切板中的数据。
read_excel 从XLS或XLSX文件中加载数据。
read_hdf 从HDF5文件加载数据。
read_html 从HTML文档加载表格。
read_json 从JSON字符串加载数据。
read_msgpack 二进制格式编码的pandas数据。
read_pickle 从pickle对象读取数据。
read_sas 读取存储于SAS系统自定义存储格式的SAS数据集
read_sql 使用SQLAlchemy读取SQL查询结果
read_stata 读取Stata文件格式的数据。
read_feather 读取Feather二进制文件格式。
3.2 常用参数
参数 说明
path 表示文件系统位置、URL、文件型对象的字符串。
sep或delimiter 用于对行中各字段进行拆分的字符序列或正则表达式。
header 用作列名的行号。
默认为0(第一行),如果文件没有标题行就将header参数设置为None。
index_col 用作行索引的列编号或列名。
可以是单个名称/数字或有多个名称/数字组成的列表(层次化索引)。
names 用于结果的列名列表,结合header=None,可以通过names来设置标题行。
skiprows 需要忽略的行数(从0开始),设置的行数将不会进行读取。
na_values 设置需要将值替换成NA的值。
comment 用于注释信息从行尾拆分出去的字符(一个或多个)。
parse_dates 尝试将数据解析为日期,默认为False。
如果为True,则尝试解析所有列。
除此之外,参数可以指定需要解析的一组列号或列名。
如果列表的元素为列表或元组,就会将多个列组合到一起再进行日期解析工作。
keep_date_col 如果连接多列解析日期,则保持参与连接的列。
默认为False。
converters 由列号/列名跟函数之间的映射关系组成的字典。
如,{"age:",f}会对列索引为age列的所有值应用函数f。
dayfirst 当解析有歧义的日期时,将其看做国际格式默认为False。
date_parser 用于解析日期的函数。
nrows 需要读取的行数。
iterator 返回一个TextParser以便逐块读取文件。
chunksize 文件块的大小(用于迭代)。
skip_footer 需要忽略的行数(从文件末尾开始计算)。
verbose 打印各种解析器输出信息,如“非数值列中的缺失值的数量”等。
encoding 用于unicode的文本编码格式。例如,"utf-8"或"gbk"等文本的编码格式。
squeeze 如果数据经过解析之后只有一列的时候,返回Series。
thousands 千分位分隔符,如","或"."。
>>>import pandas as pd
>>>import os
>>>path = os.path.join("d:\\","sample.et")
>>>fd = pd.read_table(path)
>>>print(fd)
  Unnamed: 0  one  two  three  four
0          a    1    2      3     4
1          b    5    6      7     8
2          c    9   10     11    12
3          d   13   14     15    16
4          e   17   18     19    20

三、表格的访问和增删改查

1. 访问数据
1.1 Series访问数据
  • 使用Series[index]的方式访问数据,类似字典的键值对。
>>>import pandas as pd
>>>s1 = pd.Series(["a","b","c"],index=["one","two","three"])
>>>print(s1["two"])
b
1.2 Dataframe访问数据

1) loc()函数

  • 使用column名和index名进行定位
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>v = df.loc['A':'B','One':'Two']
>>>print(v)
  One  Two
A    0    1
B    5    6

2) iloc()函数

  • 绝对位置索引,使用行数和列数定位,起始索引为0。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>v = df.iloc[1:3,2:4]
>>>print(v)
 Three  Four
B      7     8
C     12    13

3) at()函数

  • 用来选择单个值的,用法类似于loc。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>v = df.at['A','Two']
>>>print(v)
1

4) iat()函数

  • 用来选择单个值的,用法类似于iloc。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>v = df.iat[2,4]
>>>print(v)
14
1.3 获得数据表信息

1) 维度

  • df.shape
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df.shape)
(4, 5)

2) 基本信息

  • df.info()
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df.info())
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, A to D
Data columns (total 5 columns):
#   Column  Non-Null Count  Dtype
---  ------  --------------  -----
0   One     4 non-null      int32
1   Two     4 non-null      int32
2   Three   4 non-null      int32
3   Four    4 non-null      int32
4   Five    4 non-null      int32
dtypes: int32(5)
memory usage: 112.0+ bytes
None

3) 数据格式

  • df.dtypes 所有格式
  • df.dtype 某一列格式
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df.dtypes,'\n') # 所有格式
>>>print(df['Two'].dtype) # 某一列格式
One      int32
Two      int32
Three    int32
Four     int32
Five     int32
dtype: object 

int32

4) 判断是否为空

  • df.isnull()
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df.isnull())
    One    Two  Three   Four   Five
A  False  False  False  False  False
B  False  False  False  False  False
C  False  False  False  False  False
D  False  False  False  False  False

5) 获得某一列的所有唯一值

  • df[index].unique()
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df['Three'].unique())
[ 2  7 12 17]

6) 获得所有值

  • df.values
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df.values)
[[ 0  1  2  3  4]
[ 5  6  7  8  9]
[10 11 12 13 14]
[15 16 17 18 19]]

7) 获得列名

  • df.columns
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df.columns)
Index(['One', 'Two', 'Three', 'Four', 'Five'], dtype='object')

8) 查看头部数据/尾部数据

  • df.head() 头部数据
  • df.tail() 尾部数据
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(100).reshape((20,5))
>>>index = list(range(1,21))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df.head(),'\n')
>>>print(df.tail())
  One  Two  Three  Four  Five
1    0    1      2     3     4
2    5    6      7     8     9
3   10   11     12    13    14
4   15   16     17    18    19
5   20   21     22    23    24 

   One  Two  Three  Four  Five
16   75   76     77    78    79
17   80   81     82    83    84
18   85   86     87    88    89
19   90   91     92    93    94
20   95   96     97    98    99
2. 增加数据
2.1 Series增加数据
  • 使用append()函数增加数据。
  • 只可以增加Series元素。
  • 如果不指定Index,则默认从0开始计算。
>>>import pandas as pd
>>>s1 = pd.Series(["a","b","c"],index=["one","two","three"])
>>>s2 = pd.Series(["d"],index=["four"])
>>>s3 = s1.append(s2)
>>>print(s3)
one      a
two      b
three    c
four     d
dtype: object

2.2 DataFrame增加数据
  • 使用append()函数增加数据。
  • 数据可以是Series、字典、数组等。
  • 需要为添加的Series取个名字,或设置ignore_index=True
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>s1 = pd.Series([20,21,22,23,24],index=columns)
>>>df = df.append(s1,ignore_index=True)
>>>print(df)
   One  Two  Three  Four  Five
0    0    1      2     3     4
1    5    6      7     8     9
2   10   11     12    13    14
3   15   16     17    18    19
4   20   21     22    23    24
3. 删除数据
3.1 Series删除数据
  • 使用drop(index)函数删除索引处的值。
>>>import pandas as pd
>>>s1 = pd.Series(["a","b","c"],index=["one","two","three"])
>>>s1 = s1.drop("one")
>>>print(s1)
two      b
three    c
dtype: object
3.2 DataFrame删除数据

1) 删除列

  • 使用drop(columns,axis=1)函数删除。
  • columns为行的索引。
  • axis为1时表示列操作。
  • 如果设置inplace参数为True则在原表格操作。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df = df.drop(columns='Two',axis=1)
>>>print(df)
  One  Three  Four  Five
A    0      2     3     4
B    5      7     8     9
C   10     12    13    14
D   15     17    18    19

2) 删除行

  • 使用drop(index)函数删除。
  • axis默认为0。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df.drop(['A','B'],inplace=True)
>>>print(df)
  One  Two  Three  Four  Five
C   10   11     12    13    14
D   15   16     17    18    19
4. 修改数据
4.1 修改名称
  • rename(column,index)函数可以修改行或列名。
  • column和index是一个新旧名比对的字典。
>>>import pandas as pd
>>>s1 = pd.Series(["a","b","c"],index=["one","two","three"])
>>>s1.rename({"one":1},inplace=True)
>>>print(s1)
1        a
two      b
three    c
dtype: object
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df.rename(columns={'One':'A','Two':'B','Three':'C','Four':'D','Five':'E'},index={'A':'One','B':'Two','C':'Three','D':'Four'},inplace=True)
>>>print(df)
        A   B   C   D   E
One     0   1   2   3   4
Two     5   6   7   8   9
Three  10  11  12  13  14
Four   15  16  17  18  19
4.2 修改数据
  • 访问数据后可直接修改。
>>>import pandas as pd
>>>s1 = pd.Series(["a","b","c"],index=["one","two","three"])
>>>s1["two"] = 2
>>>print(s1)
one      a
two      2
three    c
dtype: object
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df.loc['A':'B','One':'Two'] = "new value"
>>>print(df)
         One        Two  Three  Four  Five
A  new value  new value      2     3     4
B  new value  new value      7     8     9
C         10         11     12    13    14
D         15         16     17    18    19
5. 查询数据
5.1 使用字典的方式查询
>>>import pandas as pd
>>>s1 = pd.Series(["a","b","c"],index=["one","two","three"])
>>>print(s1["two"])
b
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df['Three']) # 读取列
>>>print(f"{'-'*20}")
>>>print(df[3:]) # 读取行
A     2
B     7
C    12
D    17
Name: Three, dtype: int32
--------------------
   One  Two  Three  Four  Five
D   15   16     17    18    19
5.2 使用定位的方式查询
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df.loc['A':'B','One':'Two'])
   One  Two
A    0    1
B    5    6

四、数据清洗

1. 使用指定值填充空值。
  • df.fillna(value=0)
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df.loc['A':'B','One':'Two'] = None
>>>df.fillna(value=999,inplace=True)
>>>print(df)
     One    Two  Three  Four  Five
A  999.0  999.0      2     3     4
B  999.0  999.0      7     8     9
C   10.0   11.0     12    13    14
D   15.0   16.0     17    18    19
2. 清除字符空格:
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df.loc['A':'D','One':'Two'] = " with space "
>>>df['One'].map(str.strip) # 去除前后空格
3. 大小写转换:
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.array(["CONTENT"]*20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df['Two'] = df['Two'].str.lower()
>>>print(df)
       One      Two    Three     Four     Five
A  CONTENT  content  CONTENT  CONTENT  CONTENT
B  CONTENT  content  CONTENT  CONTENT  CONTENT
C  CONTENT  content  CONTENT  CONTENT  CONTENT
D  CONTENT  content  CONTENT  CONTENT  CONTENT
4. 更改数据格式
  • 使用astype(type)改变数据格式。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df = df.astype('float')
>>>print(df)
    One   Two  Three  Four  Five
A   0.0   1.0    2.0   3.0   4.0
B   5.0   6.0    7.0   8.0   9.0
C  10.0  11.0   12.0  13.0  14.0
D  15.0  16.0   17.0  18.0  19.0
5. 去除重复值:
  • 使用drop_duplicates(self, keep='first', inplace=False)去处重复的行。
  • 参数keep='last'则保留最后的行。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df)
   One  Two  Three  Four  Five
A    1    0      2     1     4
B    3    2      4     0     4
C    1    4      2     4     0
D    3    2      3     1     1 
>>>df.drop_duplicates(['One','Two'],keep='first',inplace=True) # 只保留One Two重复的第一组
>>>print(df)
   One  Two  Three  Four  Five
A    1    0      2     1     4
B    3    2      4     0     4
C    1    4      2     4     0
6. 数据替换
  • 使用replace()函数替换表格中的值。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df.replace(2,'B',inplace=True) # 只保留One Two重复的第一组
>>>print(df)
  One Two Three Four  Five
A   1   0     0    0     1
B   B   B     3    B     1
C   3   1     B    B     4
D   0   3     B    3     4

五、数据预处理

1. 数据表合并

1) DataFrame.merge(df,df1,how="inner")函数

  • how参数表示合并的方式,有"inner"、"outer"、"left"、"right"四种方式,默认为"inner"。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>n2 = np.arange(20,40).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df1 = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df2 = pd.DataFrame(data=n2,index=index,columns=columns)
>>>df3 = pd.merge(df1,df2,how='outer')
>>>print(df3)
  One  Two  Three  Four  Five
0    0    1      2     3     4
1    5    6      7     8     9
2   10   11     12    13    14
3   15   16     17    18    19
4   20   21     22    23    24
5   25   26     27    28    29
6   30   31     32    33    34
7   35   36     37    38    39

2) DataFrame.append(df)函数

  • 将两个DataFrame上下拼接在一起。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>n2 = np.arange(20,40).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df1 = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df2 = pd.DataFrame(data=n2,index=index,columns=columns)
>>>df3 = df1.append(df2)
>>>print(df3)
  One  Two  Three  Four  Five
0    0    1      2     3     4
1    5    6      7     8     9
2   10   11     12    13    14
3   15   16     17    18    19
4   20   21     22    23    24
5   25   26     27    28    29
6   30   31     32    33    34
7   35   36     37    38    39

3) DataFrame.join(df)函数

  • 将两个DataFrame左右拼接在一起。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>n2 = np.arange(20,40).reshape((4,5))
>>>index = list('ABCD')
>>>columns1 = ['One','Two','Three','Four','Five']
>>>columns2 = ['Six','Seven','Eight','Nine','Ten']
>>>df1 = pd.DataFrame(data=n1,index=index,columns=columns1)
>>>df2 = pd.DataFrame(data=n2,index=index,columns=columns2)
>>>df3 = df1.join(df2)
>>>print(df3)
  One  Two  Three  Four  Five  Six  Seven  Eight  Nine  Ten
A    0    1      2     3     4   20     21     22    23   24
B    5    6      7     8     9   25     26     27    28   29
C   10   11     12    13    14   30     31     32    33   34
D   15   16     17    18    19   35     36     37    38   39

4) pd.concat(objs, axis=0, join='outer')函数

  • 将多个DataFrame拼接在一起。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(10).reshape((2,5))
>>>n2 = np.arange(10,20).reshape((2,5))
>>>n3 = np.arange(20,30).reshape((2,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df1 = pd.DataFrame(data=n1,index=["A","B"],columns=columns)
>>>df2 = pd.DataFrame(data=n2,index=["C","D"],columns=columns)
>>>df3 = pd.DataFrame(data=n3,index=["E","F"],columns=columns)
>>>df4 = pd.concat([df1,df2,df3])
>>>print(df4)
 One  Two  Three  Four  Five
A    0    1      2     3     4
B    5    6      7     8     9
C   10   11     12    13    14
D   15   16     17    18    19
E   20   21     22    23    24
F   25   26     27    28    29
2. 设置复合索引
  • 使用set_index('id')将列设置为行的索引。
  • 使用reset_index('id')将行索引恢复为列。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(10).reshape((2,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=["A","B"],columns=columns)
>>>df = df.set_index('Four')
>>>print(df)
      One  Two  Three  Five
Four                       
3       0    1      2     4
8       5    6      7     9

>>>df = df.reset_index('Four')
>>>print(df)
   Four  One  Two  Three  Five
0     3    0    1      2     4
1     8    5    6      7     9
3. 排序

1) 按值排序

  • DataFrame.sort_values(by)
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=["A","B","C","D"],columns=columns)
>>>df.sort_values(by=["One"],inplace=True)
>>>print(df)
  One  Two  Three  Four  Five
D    2    0      3     3     3
A    3    0      2     3     2
C    3    2      3     2     1
B    4    2      2     3     3

2) 按索引排序

  • DataFrame.sort_index()
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=[4,1,3,2],columns=columns)
>>>df.sort_index(inplace=True)
>>>print(df)
  One  Two  Three  Four  Five
1    5    6      7     8     9
2   15   16     17    18    19
3   10   11     12    13    14
4    0    1      2     3     4
3. 分组标记

1. 根据值分组标记

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df['group'] = np.where(df['Three'] > 10,'high','low')
>>>print(df)
  One  Two  Three  Four  Five group
A    0    1      2     3     4   low
B    5    6      7     8     9   low
C   10   11     12    13    14  high
D   15   16     17    18    19  high

2. 对复合条件进行分组标记

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df.loc[(df['Three']==4)&(df['Four']<3),'sign']= "target"
>>>print(df)
  One  Two  Three  Four  Five    sign
A    0    1      2     0     3     NaN
B    2    4      4     0     1  target
C    4    1      0     2     4     NaN
D    3    1      4     0     0  target

3. 对字段分组并创建新表

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>d1 = pd.DataFrame(((x,(np.where(x > 3,'high','low'))) for x in >>>>df['Two']),index=df.index,columns=['value','type'])
>>>print(d1)
  value  type
A      1   low
B      4  high
C      3   low
D      4  high

六、数据筛选

1. “与”筛选
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df1 = df.loc[(df['One']>3)&(df['Four']<3),columns]
>>>print(df1)
   One  Two  Three  Four  Five
B    4    2      2     2     4
2. “或”筛选
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df1 = df.loc[(df['One']>3)|(df['Four']<3),columns]
>>>print(df1)
   One  Two  Three  Four  Five
B    1    4      4     0     2
C    3    1      4     2     2
3. “非”筛选
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df1 = df.loc[(df['One']!=3),columns]
>>>print(df1)
   One  Two  Three  Four  Five
A    1    2      1     2     0
B    4    2      4     0     2
4. DataFrame.query(expr, inplace = False, ** kwargs)函数
  • query()函数可以使用布尔表达式查询列。
  • expr为查询条件字符串,可以使用'@'引入变量。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>num = 3
>>>df.query('One<@num and Two>@num',inplace=True)
>>>print(df)
   One  Two  Three  Four  Five
A    1    4      0     0     3

七、数据统计

1. 统计数据长度
  • 使用count(axis)统计列或行的长度。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>print(df,'\n')
>>>print(df.count(1),'\n')
>>>print(df.count(0))
   One  Two  Three  Four  Five
A    4    2      4     1     0
B    1    4      0     1     3
C    0    3      4     0     1
D    4    3      3     4     0 

A    5
B    5
C    5
D    5
dtype: int64 

One      4
Two      4
Three    4
Four     4
Five     4
dtype: int64
2. 聚合操作
  • 使用FramData.agg()进行基于列的聚合操作。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>np2 = df.agg([len,np.sum, np.mean]) # 计算长度,合以及平均数
>>>np2
    One  Two  Three  Four  Five
len   4.0  4.0    4.0   4.0  4.00
sum   6.0  6.0   10.0   8.0  9.00
mean  1.5  1.5    2.5   2.0  2.25
3.简单的数据采样
  • 使用DataFrame.sample(n)进行简单采样。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df.sample(n=2)
    One Two Three   Four    Five
D   2   2   4   0   1
A   1   1   1   1   2
4.权重数据采样
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>weights = [0,1,0,1] # 设置权重
>>>df.sample(n=2,weights=weights)
One Two Three   Four    Five
B   2   2   4   3   1
D   0   2   1   0   0
5.采样数据放回
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df.sample(n=2,replace=True) # 参数replace
6.描述性统计
  • DataFrame.describe()获得数据的统计信息。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df.describe().round(2).T
       count  mean   std  min   25%  50%   75%  max
One      4.0  1.75  1.50  1.0  1.00  1.0  1.75  4.0
Two      4.0  2.50  1.73  0.0  2.25  3.0  3.25  4.0
Three    4.0  2.50  1.29  1.0  1.75  2.5  3.25  4.0
Four     4.0  3.50  1.00  2.0  3.50  4.0  4.00  4.0
Five     4.0  3.25  0.50  3.0  3.00  3.0  3.25  4.0
7.计算标准差
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df.std() # 计算标准差
One      1.825742
Two      1.258306
Three    1.414214
Four     0.957427
Five     1.414214
dtype: float64
8.计算协方差
import pandas as pd
import numpy as np
n1 = np.random.randint(5, size=(4,5))
columns = ['One','Two','Three','Four','Five']
df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
df.cov() # 计算协方差
            One       Two     Three      Four      Five
One    3.583333  3.083333 -1.666667 -1.833333 -2.916667
Two    3.083333  2.916667 -2.333333 -1.166667 -3.083333
Three -1.666667 -2.333333  4.000000 -0.666667  3.000000
Four  -1.833333 -1.166667 -0.666667  1.666667  0.833333
Five  -2.916667 -3.083333  3.000000  0.833333  4.250000
9.相关性分析
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df.corr()
            One       Two     Three      Four      Five
One    1.000000 -0.885615 -0.342997 -0.792118 -0.980196
Two   -0.885615  1.000000  0.258199  0.670820  0.948683
Three -0.342997  0.258199  1.000000 -0.288675  0.408248
Four  -0.792118  0.670820 -0.288675  1.000000  0.707107
Five  -0.980196  0.948683  0.408248  0.707107  1.000000
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,014评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,796评论 3 386
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,484评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,830评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,946评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,114评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,182评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,927评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,369评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,678评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,832评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,533评论 4 335
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,166评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,885评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,128评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,659评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,738评论 2 351