Pandas具有功能全面的高性能内存中连接操作,与SQL等关系数据库非常相似。
Pandas提供了一个单独的merge()
函数,作为DataFrame对象之间所有标准数据库连接操作的入口 -
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True)
在这里,有以下几个参数可以使用 -
True
,则使用左侧DataFrame中的索引(行标签)作为其连接键。 在具有MultiIndex(分层)的DataFrame的情况下,级别的数量必须与来自右DataFrame的连接键的数量相匹配。True
,设置为False
时,在很多情况下大大提高性能。现在创建两个不同的DataFrame并对其执行合并操作。
import pandas as pd left = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5']}) right = pd.DataFrame( {'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']}) print (left) print("========================================") print (right)
执行上面示例代码,得到以下结果 -
Name id subject_id Alex 1 sub1 Amy 2 sub2 Allen 3 sub4 Alice 4 sub6 Ayoung 5 sub5 ======================================== Name id subject_id Billy 1 sub2 Brian 2 sub4 Bran 3 sub3 Bryce 4 sub6 Betty 5 sub5
在一个键上合并两个数据帧
import pandas as pd left = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5']}) right = pd.DataFrame( {'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']}) rs = pd.merge(left,right,on='id') print(rs)
执行上面示例代码,得到以下结果 -
Name_x id subject_id_x Name_y subject_id_y Alex 1 sub1 Billy sub2 Amy 2 sub2 Brian sub4 Allen 3 sub4 Bran sub3 Alice 4 sub6 Bryce sub6 Ayoung 5 sub5 Betty sub5
合并多个键上的两个数据框
import pandas as pd left = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5']}) right = pd.DataFrame( {'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']}) rs = pd.merge(left,right,on=['id','subject_id']) print(rs)
执行上面示例代码,得到以下结果 -
Name_x id subject_id Name_y Alice 4 sub6 Bryce Ayoung 5 sub5 Betty
如何合并参数指定如何确定哪些键将被包含在结果表中。如果组合键没有出现在左侧或右侧表中,则连接表中的值将为NA
。
这里是how
选项和SQL等效名称的总结 -
合并方法 | SQL等效 | 描述 |
---|---|---|
left |
LEFT OUTER JOIN |
使用左侧对象的键 |
right |
RIGHT OUTER JOIN |
使用右侧对象的键 |
outer |
FULL OUTER JOIN |
使用键的联合 |
inner |
INNER JOIN |
使用键的交集 |
Left Join示例
import pandas as pd left = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5']}) right = pd.DataFrame( {'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']}) rs = pd.merge(left, right, on='subject_id', how='left') print (rs)
执行上面示例代码,得到以下结果 -
Name_x id_x subject_id Name_y id_y Alex 1 sub1 NaN NaN Amy 2 sub2 Billy 1.0 Allen 3 sub4 Brian 2.0 Alice 4 sub6 Bryce 4.0 Ayoung 5 sub5 Betty 5.0
Right Join示例
import pandas as pd left = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5']}) right = pd.DataFrame( {'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']}) rs = pd.merge(left, right, on='subject_id', how='right') print (rs)
执行上面示例代码,得到以下结果 -
Name_x id_x subject_id Name_y id_y Amy 2.0 sub2 Billy 1 Allen 3.0 sub4 Brian 2 Alice 4.0 sub6 Bryce 4 Ayoung 5.0 sub5 Betty 5 NaN NaN sub3 Bran 3
Outer Join示例
import pandas as pd left = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5']}) right = pd.DataFrame( {'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']}) rs = pd.merge(left, right, how='outer', on='subject_id') print (rs)
执行上面示例代码,得到以下结果 -
Name_x id_x subject_id Name_y id_y Alex 1.0 sub1 NaN NaN Amy 2.0 sub2 Billy 1.0 Allen 3.0 sub4 Brian 2.0 Alice 4.0 sub6 Bryce 4.0 Ayoung 5.0 sub5 Betty 5.0 NaN NaN sub3 Bran 3.0
Inner Join示例
连接将在索引上进行。连接(Join
)操作将授予它所调用的对象。所以,a.join(b)
不等于b.join(a)
。
import pandas as pd left = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5']}) right = pd.DataFrame( {'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']}) rs = pd.merge(left, right, on='subject_id', how='inner') print (rs)
执行上面示例代码,得到以下结果 -
Name_x id_x subject_id Name_y id_y Amy 2 sub2 Billy 1 Allen 3 sub4 Brian 2 Alice 4 sub6 Bryce 4 Ayoung 5 sub5 Betty 5