Pandas 如何拼接两个数据集而不重复|极客笔记 (deepinout.com)
要点
df = df1.append(df2,ignore_index = True),增加行最好
df = pd.concat(objs, axis = 0, ignore_index = False, join = "outer),可以增加行,也可以增加列
df = pd.merge(data1, data2, how = "inner",on = "None")
df = data1.join(data2,on = "key", how = "left")
20241008此话对不对?
concat/append,拼接,append为concat简化版,merge/join,合并,join为merge简化版
KEY1:append
- 增加行,一般是columns是一样的,df1.append(df2,ignore_index =True)
import pandas as pd
import numpy as np
data = np.random.randint(0,100,size=(5,3))
df = pd.DataFrame(data)
#增加行
df.append(df.sum(),ignore_index=True)
#增加列
pd.concat([df,df.sum(axis=1)],axis=1,ignore_index=True)
KEy2:concat
- 2个表用[]
- axis=0,columns相同,增加行
- axis=1,axis相同,增加列
pd.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)
- pd.concat([df1,df2]),2个列表,用[]
- 合并列,用axis=1
- join---'inner','outer'.inner,无nan值
- ignore_index = True,重新序号排列
- keys,多级索引,
concat,增加列,axis=1
import pandas as pd
df1 = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2'],
'C': ['C0', 'C1', 'C2']
}, index=[0, 1, 2])
result = pd.concat([df1, df1['A'].str.len()],axis=1)
print(result)
示范代码2,keys=['第一组', '第二组']
pd.concat([df1, df2], keys=['第一组', '第二组'])
示范代码3,合并三个DataFrame
result = pd.concat([df1, df2, df3])
示范代码4
- 拼接数据,使用list,然后pd.concat拼接
import pandas as pd
import numpy as np
data = []
for i in range(10):
temp = pd.DataFrame(np.random.randint(1,10,(3,3)))
data.append(temp)
cat = pd.concat(data,ignore_index=True)
display(cat)
- 直接使用df.append拼接
import pandas as pd
import numpy as np
df = pd.DataFrame()
for i in range(3):
temp = pd.DataFrame(np.random.randint(1,10,(3,3)))
df = df.append(temp,ignore_index=True)
print(df)
- 按照列结合
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1,10,(4,3)),columns=list('ABC'))
data = []
for key,item in df.iloc[:,:2].iteritems():
data.append(item)
# 按列结合
cat = pd.concat(data,axis=1)
display(cat)
df.join
如果有相同columns名,需要用lsuffix or rsuffix
主要是横向来连接,DataFrame
head.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False) -> 'DataFrame'
import pandas as pd
head = df.head()
tail = df.tail().reset_index(drop=True)
head.join(tail,lsuffix='_a',rsuffix='_b')
df.combine
用于比较两个序列的大小数据汇总
s1.combine(other, func, fill_value=None) -> 'Series'
s1 = pd.Series({'falcon': 330.0, 'eagle': 160.0})
s2 = pd.Series({'falcon': 345.0, 'eagle': 200.0, 'duck': 30.0})
s1.combine(s2,min,fill_value = 0)
duck 0.0
eagle 160.0
falcon 330.0
dtype: float64
df.combine_first
用于两个DataFrame之间的连接,以df1为主,吸收合并df2
import pandas as pd
df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]})
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
df1.combine_first(df2)
df1.update(df2)
使用combine_first会只更新左表的nan值。而update则会更新左表的所有能在右表中找到的值(两表位置相对应)。
import numpy as np
import pandas as pd
df1 = pd.DataFrame([["Emil", "Tobias", "Linus"], [16, 14, 10]])
df2 = pd.DataFrame([["Em00il"], [17]])
df1.update(df2) # df1的数据全部更新
pd.merge 两个数据合并
pd.merge(
left,
right,
how: str = 'inner', # {'left', 'right', 'outer', 'inner'}, default 'inner'
on=None,
left_on=None,
right_on=None,
left_index: bool = False,
right_index: bool = False,
sort: bool = False,
suffixes=('_x', '_y'),
copy: bool = True,
indicator: bool = False,
validate=None,
)
# 关键的几个参数
pd.merge(
left,
right,
how: str = 'inner',
on=None,
left_on=None,
right_on=None,
left_index: bool = False,
right_index: bool = False,
)