pandas使用

pandas数据结构

使用前需引入import pandas as pd

Series
DataFrame
Panel/PanelND

Series

类似于excel中的一列数据，但是有行索引index。也可以理解为一个一维数组，但是有行索引。

Series可以通过三种形式创建：python的dict、numpy当中的ndarray（numpy中的基本数据结构）、具体某个数值。index赋值必须是list类型。

通过dict创建

dic = {'a':1,'b':2,'d':3}
s = pd.Series(dic)
a    1
b    2
d    3
dtype: int64

如果给了索引，但是找不到对应值，就会是NaN值
s = pd.Series(dic,index=['a','b','c','d'])
a    1.0
b    2.0
c    NaN
d    3.0
dtype: float64

numpy中的ndarray
import numpy as np

narray = np.random.randn(5)
s = pd.Series(narray)
0   -0.247940
1   -0.984350
2    1.688936
3    0.411693
4   -0.816367

s = pd.Series(narray,index=['a','b','c','d','e'])
a   -0.247940
b   -0.984350
c    1.688936
d    0.411693
e   -0.816367
dtype: float64

当index少给的时候，就会报错
ValueError: Wrong number of items passed 5, placement implies 4

具体某个值

s = pd.Series(5)
0    5
dtype: int64
s = pd.Series([5,4,3])
0    5
1    4
2    3
dtype: int64
s = pd.Series([5,4,3],index=list('abc'))
a    5
b    4
c    3
dtype: int64
s = pd.Series(5,index=list('abc'))
a    5
b    5
c    5
dtype: int64

常见的查询函数是查询值和查询索引

s.values
array([5, 5, 5])
s.index
Index(['a', 'b', 'c'], dtype='object')

DataFrame

DataFrame如同excel表格
相当于一个二维数组。
行索引是 index
列索引是 columns

DataFrame统一的创建形式为：pd.DataFrame(data,columns=,index=)其中columns为列的索引，index为行的索引。index或者columns如果不进行设置则默认为0开始的整数，也是行的绝对位置，不会被覆盖；而通过外部数据（比如打开文件）创建DataFrame的话需要注意列名匹配的问题，给columns赋的值如果和数据来源当中列名不一样的话，对应的列下面会出现NAN。还有个常用参数为orient，默认为空.

行列中间的数据部分就是Data，Data的创建形式有以下几种：一维数据类型进行创建、二维ndarray创建、外部输入。
二维数组创建，由于比较简单就先说：pd.DataFrame(二维数组,columns = ,index=)
外部输入就是读取文件等手段，如csv、excel等文件：
概括来说就是先读取一个文件对象（pd.read_xxx，xxx是对应的文件类型，常用有csv、excel、table等)的对象，然后再通过该对象创建DataFrame，但要注意columns列名的命名。
一维数据类型创建（一维数据类型主要有：一维ndarray、列表、字典、Series等）

# 字典和Series类型创建DataFrame
a = {'a':1,"b":2}
b = pd.Series([1,2,3],index=list('abc'))
pd.DataFrame([a,b],columns=list('abcd'))

   a  b    c   d
0  1  2  NaN NaN
1  1  2  3.0 NaN

a = {'m':1,"n":2}
pd.DataFrame([a,b],columns=list('abcd'))
     a    b    c   d
0  NaN  NaN  NaN NaN
1  1.0  2.0  3.0 NaN
# 将两者放入字典里面创建
a = {'a':1,"b":2}
b = pd.Series([1,2,3],index=list('abc'))
data = {'one':a,'two':b}
pd.DataFrame(data,columns=['one','two','a','b'])
one  two    a    b
a  1.0    1  NaN  NaN
b  2.0    2  NaN  NaN
c  NaN    3  NaN  NaN

上面两种方法都需要注意列名匹配。
类似于Series，DataFrame.index,DataFrame.columns可以查询DataFrame二维参数的数值

Panel/PanelND

Panel可以理解为三维数组，panelND可以理解为N维数组。
暂时略过，后续用到补充。

对数据类型操作

对Series操作

查看：简单来说就是通过索引查看：一种是通过index对应的标签；另一种就是通过绝对位置查看。

通过绝对位置查询
如果通过绝对位置查看，会使用s[XXX]，XXX可以是绝对位置的数字，列表，或者表达式等
例如：s = pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

#数字
In [48]: s[0]
Out[48]: 5.0
#表达式
In [49]: s[s>1]
Out[49]: 
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64
#列表
In [50]: s[[4,3,1]]
Out[50]: 
e    5.0
d    5.0
b    5.0
dtype: float64
#数组 这里使用了切片
In [52]: s[3:]
Out[52]: 
d    5.0
e    5.0
dtype: float64

通过标签查询
如果通过标签查询的话可以使用s[‘a’]、’e’ in s、或者s.get('f',np.nan)三种方式查看：s[‘a’]返回标签对应数值或者NaN；’e’ in s返回true/false；s.get(‘f’) 返回label对应的值，如果没有读取到就无返回值，加入np.nan参数可在没有读取到时返回NaN。

In [53]: s['a']
Out[53]: 5.0

In [54]: 'f' in s
Out[54]: False

In [55]: s.get('e')
Out[55]: 5.0

In [56]: s.get('f')

In [57]: s.get('f',np.nan)
Out[57]: nan

运算：常见操作运算符，+、-、*、/、np.exp以及关系运算等运算符，两个Series运算是其中一个Series中每个index位置和另一个Series对应index位置进行算数运算；也可以选取部分进行运算，在选取部分运算的时候要注意只能运算index相同的部分，不重合的部分则是NaN。

s-s             
a    0.0            
b    0.0            
c    0.0            
d    0.0            
e    0.0            
s[1:]+s[:3]
a     NaN
b    10.0
c    10.0
d     NaN
e     NaN

命名：创建的时候使用使用name参数；使用rename方法。可以通过name方法进行查询。

s = pd.Series(np.random.randn(5), name='something')
s.name
输出：'something'
s2 = s.rename("different")
s2.name
输出：'different'

对DataFrame操作

查询：DataFrame.head可以查询前几行的数据，默认为前五行；DataFrame.tail查看后几行书，默认为5行；DataFrame.describe查看全部数据。

排序：df.sort_index(axis=,ascending=) axis为0/1的参数，表示按行/按列排序；ascending为boolean参数，False表示降序，True表示升序。

df.sort_value(by=，ascending=) by表示按哪一个columns参数排序。

删除：使用del或者pop(‘columns’)方法。需要注意的是所有删除的方法都会改变原来DataFrame，而不是像其他方法一样内存当中新建一个DataFrame。pop由于弹出特定的列，会返回被弹出的列中的数值.

df = pd.DataFrame.from_items([('A', [1, 2, 3]), ('B',[4,5,6])],orient='index', columns=['one', 'two', 'three']) #后面用到的df都是从这边开始一直往下走的
del df['two']
df.pop('one')
输出： A    1
        B    4
        Name: one, dtype: int64
df
输出：   three
   A      3
   B      6
运算：+、-、*、/、exp以及关系运算等，类似于Series，两个DataFrame运算是一个DataFrame每个位置的值和对应位置另一个DataFrame的值进行运算，因此这里的*不是矩阵相乘（叉乘）；在处理矩阵的时候会用到numpy.linalg函数（用来处理矩阵相关运算的函数），在此不赘述。另外转置的方法为DataFrame.T。

同时除了可以整个Data'frame参与运算以外还可以选取特定的columns参与运算，例如

df['three'] = df['one'] * df['two']
DataFrame修改和添加：利用=即可实现修改功能，同时可以在=右边加上赋值的范围，赋值号同样会改变原来DataFrame当中的数值。举例：

df['fore'] = 1  
df  
输出：one  two  three  fore    
  A    1    2      3     1  
  B    4    5      6     1
            
df['five'] = df['one'][:1]
df                              
输出：one  two  three  fore  five                   
  A    1    2      3     1   1.0
  B    4    5      6     1   NaN
同样的需要注意，控制赋值范围时当心其余范围的NaN处理。

添加新的列 首先肯定是重新创建一个新的DataFrame；其二就是上述的赋值做法，给原来DataFrame当中的新列进行赋值，如上面df[‘five’]的例子；其三就是通过insert(loc, column, value, allow_duplicates=False)方法进行，insert同样会改变DataFrame数据，例如：

df.insert(1, 'bar', df['one'])
df
输出： one   bar       two  three  fore five
   A    1    1      2      3     1  1.0
   B    4    4      5      6     1  NaN
另外可以通过DataFrame.assign对表格进行改动，该方法会返回改动后的DataFrame，但不是改动原来的DataFrame

df.assign(ration = df['one'] / df['one'])
输出：one  two  three  fore  five  ration
A    1    2      3     1   1.0     1.0
B    4    5      6     1   NaN     1.0

df
输出：one  two  three  fore  five
A    1    2      3     1   1.0
B    4    5      6     1   NaN
当然使用loc、iloc等都可以添加新列，这个就不赘述了。

选择/切片：

直接按照行/列进行选择：

用columns选择列，用index选择行。注意：选择列的时候单次只能选择某一列的数据，不能同时选择多列；而使用index的时候一定要使用范围（类似于[1:2]），单独某个index会报错。

df['one']>2
输出：A    False   
     B     True 
     Name: one, dtype: bool 
    
df['two']
输出：A    2                       
     B    5 
     Name: two, dtype: int64

df[:1]
输出：one  bar  two  three  fore  five
A    1    1    2      3     1   1.0
使用loc方法，通过位置标签选择：

统一格式为DataFrame.loc[index:index,[‘columns’]]，loc方法当中的columns可以选择多列，如果表示只按列选择的话index可以不填但是冒号（：）和逗号（，）一定要写，例如：

df.loc[:,['two','one']] 
输出：two  one  
A    2    1     
B    5    4

df.loc['A':'B',['one','two']]              
输出：one  two
A    1    2     
B    4    5
另外，如果loc还能这么用：DataFrame.loc[index,[‘columns’]]，这时的index为特定能够的label或值，这样用会返回一个Series；DataFrame.loc[index,‘columns’]，这里面的index和columns都是唯一的，返回一个值。由于降维的问题，pandas会对精度进行转换。举例：

df.loc['A',['one']] 
输出：one    1.0   
     Name: A, dtype: float64    
        
df.loc['A','one']           
输出：1.0
使用iloc方法，通过绝对位置选择：

思路与loc方法基本相同，只是把标签换成绝对位置。简答举个例子：

df.iloc[[0,1],2:3]
输出：two
A    2
B    5
使用where操作通过表达式过滤部分值，并且将过滤掉的值作为NaN，不过即使用了where操作还是需要跟上其他操作，个人实际使用不多。

df[df>3]
输出：one  bar  two  three  fore  five
  A  NaN  NaN  NaN    NaN   NaN   NaN
  B  4.0  4.0  5.0    6.0   NaN   NaN
使用isin([value])方法：

通过isin方法可以去除特定列当中与变量值相等的行，返回一个DataFrame。举个例子，

df[df['one'].isin([1])]
输出：one  bar  two  three  fore  five
A    1    1    2      3     1   1
对于NaN的处理：

DataFrame.dropna.(axis，how) 常用参数为axis和how，axis为0/1参数；how为any/all参数，any是存在NaN就把对应的整行/列删除，all是全部为NaN才把对应的整行/列删除。

df.dropna(axis = 1, how ='any')
输出：one  bar  two  three  fore
A    1    1    2      3     1
B    4    4    5      6     1
DataFrame.fillna(value) 将所有NaN赋值为value，比较简单就不举例了

DataFrame.isnull() 判断DataFrame是否为null，返回是boolean 的DataFrame，也比较好理解

合并：

在做合并的时候尽量保证columns是相同的，有利于后续操作

pd.concat([DataFrame1,···],ignore_index) 可以多个DataFrame进行合并，ignore_index是boolean值，用来确定要不要重新对index从0开始赋值。

pd.merge(DataFrame1,DataFrame2) DataFrame1在合并后的上面DataFrame2在合并后的下面；on是确定合并的列。同时merge会重新分配index，不会出现index重合。merge是个大坑，合并完一定是个乱七八糟的，后面一定要跟上一系列选择剔除的操作才能好好用。而且merge参数较多，情况复杂，之后的分享当中会继续深挖。

DataFrame.append(object,ignore_index) 在DataFrame尾部添加一个object，可以是DataFrame也可以是Series，ignore_index就是用来确定要不要重新对index从0开始赋值，这个比较好理解。

分组：

分组是通过groupby命令实现的，主要实现的功能是按照一些规则将数据分为不同的组；对于每组数据分别执行一个函数；将结果组合到一个数据结构中。

DataFrame.groupby(by=None, axis=0, as_index=True)
by是按照分组的列名；axis是作用维度，0为行，1为列；as_index指的是分组依据是否作为索引存在，有多个分组依据时，会合并成一个tuple，作为一列。
通过aggregate(arg)方法可以打印分好组的group，arg可以为dict类型或者list类型。

df2
输出：A      B         C         D
0  foo    one      1         1
1  bar    one      1         1
2  foo    two      1         1
3  bar  three      1         1
4  foo    two      1         1
5  bar    two      1         1
6  foo    one      1         1
7  foo  three      1         1

g = df2.groupby(['A','B'])
g.aggregate(np.sum)
输出：      C  D
A   B
bar one    1  1
    three  1  1
    two    1  1
foo one    2  2
    three  1  1
    two    2  2

g = df.groupby(['A','B'],as_index=False)
g.aggregate(np.sum)
输出：A      B  C  D
0  bar    one  1  1
1  bar  three  1  1
2  bar    two  1  1
3  foo    one  2  2
4  foo  three  1  1
5  foo    two  2  2
然后可以通过agg(arg)方法对分好组的group进行计算（arg可以为dict类型或者list类型）。例如：

g = df.groupby('A')
g['D'].agg([np.mean])
输出：mean
A
bar     1
foo     1