pandas表连接

Pandas Dataframe有三种连接方法，分别是merge，join，concat。

merge

merge相当于SQL中的join。通过在两个Dataframe中共有的列或索引进行合并。
通过on参数指定连接的共有列名或索引名，how参数指定连接方式为left、right、outer、inner（默认）。如果不指定on参数，则默认在两个Dataframe的列交集作为连接键。
对于非公有列，采用left_on、right_on分别制定两个Dataframe用于连接的列。

>>>df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})
>>>df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [5, 6, 7, 8]})
>>>df1
    lkey value
0   foo      1
1   bar      2
2   baz      3
3   foo      5
>>>df2
    rkey value
0   foo      5
1   bar      6
2   baz     7
3   foo      8
>>>df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=('_left', '_right'))
  lkey  value_left rkey  value_right
0  foo           1  foo            5
1  foo           1  foo            8
2  foo           5  foo            5
3  foo           5  foo            8
4  bar           2  bar            6
5  baz           3  baz            7

join

join方法也用来横向连接Dataframe，与merge类似。但主要基于行索引进行合并。
在不指定on参数的时候，默认按照行索引进行简单合并。类似于axis=1时的concat方法。

>>>df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
                   'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>>df
  key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K3  A3
4  K4  A4
5  K5  A5
>>>other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
                      'B': ['B0', 'B1', 'B2']})
>>>other
  key   B
0  K0  B0
1  K1  B1
2  K2  B2

>>>df.join(other, lsuffix='_caller', rsuffix='_other')
  key_caller   A key_other    B
0         K0  A0        K0   B0
1         K1  A1        K1   B1
2         K2  A2        K2   B2
3         K3  A3       NaN  NaN
4         K4  A4       NaN  NaN
5         K5  A5       NaN  NaN

如果希望使用列进行连接，需要将列设置为行索引在进行连接。有如下两种方式。

>>>df.set_index('key').join(other.set_index('key'))
      A    B
key
K0   A0   B0
K1   A1   B1
K2   A2   B2
K3   A3  NaN
K4   A4  NaN
K5   A5  NaN

>>>df.join(other.set_index('key'), on='key')
  key   A    B
0  K0  A0   B0
1  K1  A1   B1
2  K2  A2   B2
3  K3  A3  NaN
4  K4  A4  NaN
5  K5  A5  NaN

concat

concat则用来堆叠连接Dataframe。
参数axis用来指定堆叠方向。默认为0，沿行索引方向（纵向）堆叠，1则沿列方向。
通过join参数，指定用inner、outer方式来处理堆叠方向外的轴方向上索引的处理方式。

>>>df1 = pd.DataFrame([['a', 1], ['b', 2]], columns=['letter', 'number'])
>>>df1
  letter  number
0      a       1
1      b       2
>>>df2 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
                   columns=['letter', 'number', 'animal'])
>>>df2
  letter  number animal
0      c       3    cat
1      d       4    dog
>>>pd.concat([df1, df2], join="inner")
  letter  number
0      a       1
1      b       2
0      c       3
1      d       4

此外，常用的参数还包括ignore_index，用来决定是否保留原Dataframe中的索引。