任何分组(groupby)操作都涉及原始对象的以下操作之一。它们是 -
在许多情况下,我们将数据分成多个集合,并在每个子集上应用一些函数。在应用函数中,可以执行以下操作 -
下面来看看创建一个DataFrame对象并对其执行所有操作 -
import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print (df)
执行上面示例代码,得到以下结果 -
Points Rank Team Year 1 Riders 2014 2 Riders 2015 2 Devils 2014 3 Devils 2015 3 Kings 2014 4 kings 2015 1 Kings 2016 1 Kings 2017 2 Riders 2016 4 Royals 2014 1 Royals 2015 2 Riders 2017
Pandas对象可以分成任何对象。有多种方式来拆分对象,如 -
现在来看看如何将分组对象应用于DataFrame对象
示例
import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print (df.groupby('Team'))
执行上面示例代码,得到以下结果 -
<pandas.core.groupby.DataFrameGroupBy object at 0x00000245D60AD518>
import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print (df.groupby('Team').groups)
执行上面示例代码,得到以下结果 -
{ 'Devils': Int64Index([2, 3], dtype='int64'), 'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64'), 'kings': Int64Index([5], dtype='int64') }
示例
按多列分组 -
import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print (df.groupby(['Team','Year']).groups)
执行上面示例代码,得到以下结果 -
{ ('Devils', 2014): Int64Index([2], dtype='int64'), ('Devils', 2015): Int64Index([3], dtype='int64'), ('Kings', 2014): Int64Index([4], dtype='int64'), ('Kings', 2016): Int64Index([6], dtype='int64'), ('Kings', 2017): Int64Index([7], dtype='int64'), ('Riders', 2014): Int64Index([0], dtype='int64'), ('Riders', 2015): Int64Index([1], dtype='int64'), ('Riders', 2016): Int64Index([8], dtype='int64'), ('Riders', 2017): Int64Index([11], dtype='int64'), ('Royals', 2014): Int64Index([9], dtype='int64'), ('Royals', 2015): Int64Index([10], dtype='int64'), ('kings', 2015): Int64Index([5], dtype='int64') }
使用groupby
对象,可以遍历类似itertools.obj
的对象。
import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') for name,group in grouped: print (name) print (group)
执行上面示例代码,得到以下结果 -
2014 Points Rank Team Year 1 Riders 2014 2 Devils 2014 3 Kings 2014 4 Royals 2014 Points Rank Team Year 2 Riders 2015 3 Devils 2015 4 kings 2015 1 Royals 2015 Points Rank Team Year 1 Kings 2016 2 Riders 2016 Points Rank Team Year 1 Kings 2017 2 Riders 2017
默认情况下,groupby
对象具有与分组名相同的标签名称。
使用get_group()
方法,可以选择一个组。参考以下示例代码 -
import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') print (grouped.get_group(2014))
执行上面示例代码,得到以下结果 -
Points Rank Team Year 1 Riders 2014 2 Devils 2014 3 Kings 2014 4 Royals 2014
聚合函数为每个组返回单个聚合值。当创建了分组(group by)对象,就可以对分组数据执行多个聚合操作。
一个比较常用的是通过聚合或等效的agg
方法聚合 -
import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') print (grouped['Points'].agg(np.mean))
执行上面示例代码,得到以下结果 -
Year 795.25 769.50 725.00 739.00 Name: Points, dtype: float64
另一种查看每个分组的大小的方法是应用size()
函数 -
import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Team') print (grouped.agg(np.size))
执行上面示例代码,得到以下结果 -
Team Devils 2 2 2 Kings 3 3 3 Riders 4 4 4 Royals 2 2 2 kings 1 1 1
通过分组系列,还可以传递函数的列表或字典来进行聚合,并生成DataFrame
作为输出 -
import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Team') agg = grouped['Points'].agg([np.sum, np.mean, np.std]) print (agg)
执行上面示例代码,得到以下结果 -
sum mean std Team Devils 1536 768.000000 134.350288 Kings 2285 761.666667 24.006943 Riders 3049 762.250000 88.567771 Royals 1505 752.500000 72.831998 kings 812 812.000000 NaN
分组或列上的转换返回索引大小与被分组的索引相同的对象。因此,转换应该返回与组块大小相同的结果。
import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Team') score = lambda x: (x - x.mean()) / x.std()*10 print (grouped.transform(score))
执行上面示例代码,得到以下结果 -
Points Rank Year 12.843272 -15.000000 -11.618950 3.020286 5.000000 -3.872983 7.071068 -7.071068 -7.071068 -7.071068 7.071068 7.071068 -8.608621 11.547005 -10.910895 NaN NaN NaN -2.360428 -5.773503 2.182179 10.969049 -5.773503 8.728716 -7.705963 5.000000 3.872983 -7.071068 7.071068 -7.071068 7.071068 -7.071068 7.071068 -8.157595 5.000000 11.618950
过滤根据定义的标准过滤数据并返回数据的子集。filter()
函数用于过滤数据。
import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) filter = df.groupby('Team').filter(lambda x: len(x) >= 3) print (filter)
执行上面示例代码,得到以下结果 -
Points Rank Team Year 1 Riders 2014 2 Riders 2015 3 Kings 2014 1 Kings 2016 1 Kings 2017 2 Riders 2016 2 Riders 2017
在上述过滤条件下,要求返回三次以上参加IPL的队伍。