1、concat连接
1.1、DataFrame连接
可以使用concat将下面的三个DataFrame连接。
ignore_index:是否忽略前面的索引。
axis:columns或者1表示按照行添加。
import pandas as pd
df1 = pd.read_csv('data/concat_1.csv')
df2 = pd.read_csv('data/concat_2.csv')
df3 = pd.read_csv('data/concat_3.csv')
row_concat = pd.concat([df1,df2,df3], ignore_index=False)
# A B C D
#0 a0 b0 c0 d0
#1 a1 b1 c1 d1
#2 a2 b2 c2 d2
#3 a3 b3 c3 d3
# A B C D
#0 a4 b4 c4 d4
#1 a5 b5 c5 d5
#2 a6 b6 c6 d6
#3 a7 b7 c7 d7
# A B C D
#0 a8 b8 c8 d8
#1 a9 b9 c9 d9
#2 a10 b10 c10 d10
#3 a11 b11 c11 d11
row_concat = pd.concat([df1,df2,df3], axis='columns')
# A B C D A B C D A B C D
#0 a0 b0 c0 d0 a4 b4 c4 d4 a8 b8 c8 d8
#1 a1 b1 c1 d1 a5 b5 c5 d5 a9 b9 c9 d9
#2 a2 b2 c2 d2 a6 b6 c6 d6 a10 b10 c10 d10
#3 a3 b3 c3 d3 a7 b7 c7 d7 a11 b11 c11 d11
1.2、连接DataFrame和Series
1、使用concat连接DataFrame和Series
new_series = pd.Series(['n1', 'n2', 'n3', 'n4'])
print(new_series)
#0 n1
#1 n2
#2 n3
#3 n4
#dtype: object
pd.concat([df1, new_series])
# A B C D 0
#0 a0 b0 c0 d0 NaN
#1 a1 b1 c1 d1 NaN
#2 a2 b2 c2 d2 NaN
#3 a3 b3 c3 d3 NaN
#0 NaN NaN NaN NaN n1
#1 NaN NaN NaN NaN n2
#2 NaN NaN NaN NaN n3
#3 NaN NaN NaN NaN n4
# 由于Series是列数据,concat方法默认是添加行,但是Series数据没有行索引
# 所以添加了一个新列,缺失的数据用NaN填充
2、将['n1','n2','n3','n4']作为行连接到df1后,可以创建DataFrame并指定列名
new_row_df = pd.DataFrame([['n1','n2', 'n3', 'n4']], columns=['A','B','C','D'])
# A B C D
#0 n1 n2 n3 n4
2、append函数
1、concat可以连接多个对象,如果只需要向现有DataFrame追加一个对象,可以通过append函数来实现
注:append函数未来可能会被替换
df1.append(df2)
2、添加列dataframe['列名']=['值']
axis=1表示按列添加
col_concat = pd.concat([df1,df2,df3], axis=1)
col_concat['new_col']=['n1','n2','n3','n4']
# A B C D A B C D A B C D new_col
#0 a0 b0 c0 d0 a4 b4 c4 d4 a8 b8 c8 d8 n1
#1 a1 b1 c1 d1 a5 b5 c5 d5 a9 b9 c9 d9 n2
#2 a2 b2 c2 d2 a6 b6 c6 d6 a10 b10 c10 d10 n3
#3 a3 b3 c3 d3 a7 b7 c7 d7 a11 b11 c11 d11 n4
3、dataframe['列名'] = Series对象
col_concat['new_series']=pd.Series(['n1','n2','n3','n4'])
# A B C D A B C D A B C D new_col new_series
#0 a0 b0 c0 d0 a4 b4 c4 d4 a8 b8 c8 d8 n1 n1
#1 a1 b1 c1 d1 a5 b5 c5 d5 a9 b9 c9 d9 n2 n2
#2 a2 b2 c2 d2 a6 b6 c6 d6 a10 b10 c10 d10 n3 n3
#3 a3 b3 c3 d3 a7 b7 c7 d7 a11 b11 c11 d11 n4 n4
3、从数据库加载数据
3.1、从数据库中读取表
第一个参数表名,第二个参数数据库连接对象
import _sqlite3
con = _sqlite3.connect('data/chinook.db')
tracks = pd.read_sql_query('SELECT * from tracks', con)
tracks.head(1)
3.2、一对一合并
1、通过DataFrame把一列与另一列数据连接(不重复)
tracks_subset = tracks.loc[[0,62,76,98,110,193,204,281,322,359]]
# TrackId Name AlbumId MediaTypeId \
#0 1 For Those About To Rock (We Salute You) 1 1
#62 63 Desafinado 8 1
#76 77 Enter Sandman 9 1
#98 99 Your Time Has Come 11 1
#110 111 Money 12 1
#193 194 First Time I Met The Blues 20 1
#204 205 Jorge Da Capadócia 21 1
#281 282 Girassol 26 1
#322 323 Dig-Dig, Lambe-Lambe (Ao Vivo) 29 1
#359 360 Vai-Vai 2001 32 1
2、通过GenreId列合并数据,how参数指定连接方式
how = ’left‘ 对应SQL中的 left outer 保留左侧表中的所有key
how = ’right‘ 对应SQL中的 right outer 保留右侧表中的所有key
how = 'outer' 对应SQL中的 full outer 保留左右两侧侧表中的所有key
how = 'inner' 对应SQL中的 inner 只保留左右两侧都有的key
3、left
genres = pd.read_sql_query('SELECT * from genres', con)
genres_track = genres.merge(tracks_subset[['TrackId', 'GenreId', 'Milliseconds']], on='GenreId', how='left')
4、right
genres_track = genres.merge(tracks_subset[['TrackId', 'GenreId', 'Milliseconds']], on='GenreId', how='right')
3.2、多对一合并
1、将cust中的CustomerId与invoice中CustomerId合并
将cust中的InvoiceId与ii中InvoiceId合并
cust = pd.read_sql_query("SELECT CustomerId,FirstName,LastName from customers",con)
invoice = pd.read_sql_query('SELECT InvoiceId,CustomerId from invoices',con)
ii = pd.read_sql_query('SELECT InvoiceId,UnitPrice,Quantity from invoice_items',con)
cust_inv = cust.merge(invoice, on='CustomerId').merge(ii, on='InvoiceId')
2、DataFrame的assign方法 创建新列
total = cust_inv['Quantity'] * cust_inv['UnitPrice']
cust_inv = cust_inv.assign(Total = total)
# CustomerId FirstName LastName InvoiceId UnitPrice Quantity Total
#0 1 Luís Gonçalves 98 1.99 1 1.99
3.3、join合并
1、使用join合并,可以是依据两个DataFrame的行索引,或者一个DataFrame的行索引另一个DataFrame的列索引进行数据合并
stocks_2016 = pd.read_csv('data/stocks_2016.csv')
stocks_2017 = pd.read_csv('data/stocks_2017.csv')
stocks_2018 = pd.read_csv('data/stocks_2018.csv')
2、依据两个DataFrame的行索引
如果合并的两个数据有相同的列名,需要通过lsuffix,和rsuffix,指定合并后的列名的后缀
stocks_2016.join(stocks_2017, lsuffix='_2016', rsuffix='_2017', how='outer’)
3、将两个DataFrame的Symbol设置为行索引,再次join数据
stocks_2016.set_index('Symbol').join(stocks_2018.set_index('Symbol'),lsuffix='_2016', rsuffix='_2018’)
4、将一个DataFrame的Symbol列设置为行索引,与另一个DataFrame的Symbol列进行join
stocks_2016.join(stocks_2018.set_index('Symbol'),lsuffix='_2016', rsuffix='_2018',on='Symbol')