http://joyfulpandas.datawhale.club/Content/index.html
为了更加清晰地说明具有多级索引的 DataFrame 结构,下面新构造一张表,读者可以忽略这里的构造方法,它们将会在第4小节被更详细地讲解。
In [70]: np.random.seed(0) In [71]: multi_index = pd.MultiIndex.from_product([list('ABCD'), ....: df.Gender.unique()], names=('School', 'Gender')) ....: In [72]: multi_column = pd.MultiIndex.from_product([['Height', 'Weight'], ....: df.Grade.unique()], names=('Indicator', 'Grade')) ....: In [73]: df_multi = pd.DataFrame(np.c_[(np.random.randn(8,4)*5 + 163).tolist(), ....: (np.random.randn(8,4)*5 + 65).tolist()], ....: index = multi_index, ....: columns = multi_column).round(1) ....: In [74]: df_multi Out[74]: Indicator Height Weight Grade Freshman Senior Sophomore Junior Freshman Senior Sophomore Junior School Gender A Female 171.8 165.0 167.9 174.2 60.6 55.1 63.3 65.8 Male 172.3 158.1 167.8 162.2 71.2 71.0 63.1 63.5 B Female 162.5 165.1 163.7 170.3 59.8 57.9 56.5 74.8 Male 166.8 163.6 165.2 164.7 62.5 62.8 58.7 68.9 C Female 170.5 162.0 164.6 158.7 56.9 63.9 60.5 66.9 Male 150.2 166.3 167.3 159.3 62.4 59.1 64.9 67.1 D Female 174.3 155.7 163.2 162.1 65.3 66.5 61.8 63.2 Male 170.7 170.3 163.8 164.9 61.6 63.2 60.9 56.4
下图通过颜色区分,标记了 DataFrame 的结构。与单层索引的表一样,具备元素值、行索引和列索引三个部分。其中,这里的行索引和列索引都是 MultiIndex 类型,只不过 索引中的一个元素是元组 而不是单层索引中的标量。例如,行索引的第四个元素为 (“B”, “Male”) ,列索引的第二个元素为 (“Height”, “Senior”) ,这里需要注意,外层连续出现相同的值时,第一次之后出现的会被隐藏显示,使结果的可读性增强。
与单层索引类似, MultiIndex 也具有名字属性,图中的 School 和 Gender 分别对应了表的第一层和第二层行索引的名字, Indicator 和 Grade 分别对应了第一层和第二层列索引的名字。
索引的名字和值属性分别可以通过 names 和 values 获得:
In [75]: df_multi.index.names Out[75]: FrozenList(['School', 'Gender']) In [76]: df_multi.columns.names Out[76]: FrozenList(['Indicator', 'Grade']) In [77]: df_multi.index.values Out[77]: array([('A', 'Female'), ('A', 'Male'), ('B', 'Female'), ('B', 'Male'), ('C', 'Female'), ('C', 'Male'), ('D', 'Female'), ('D', 'Male')], dtype=object) In [78]: df_multi.columns.values Out[78]: array([('Height', 'Freshman'), ('Height', 'Senior'), ('Height', 'Sophomore'), ('Height', 'Junior'), ('Weight', 'Freshman'), ('Weight', 'Senior'), ('Weight', 'Sophomore'), ('Weight', 'Junior')], dtype=object)
如果想要得到某一层的索引,则需要通过 get_level_values 获得:
In [79]: df_multi.index.get_level_values(0) Out[79]: Index(['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'], dtype='object', name='School')
但对于索引而言,无论是单层还是多层,用户都无法通过 index_obj[0] = item 的方式来修改元素,也不能通过 index_name[0] = new_name 的方式来修改名字,关于如何修改这些属性的话题将在第三节被讨论。
熟悉了结构后,现在回到原表,将学校和年级设为索引,此时的行为多级索引,列为单级索引,由于默认状态的列索引不含名字,因此对应于刚刚图中 Indicator 和 Grade 的索引名位置是空缺的。
In [80]: df_multi = df.set_index(['School', 'Grade']) In [81]: df_multi.head() Out[81]: Name Gender Weight Transfer School Grade Shanghai Jiao Tong University Freshman Gaopeng Yang Female 46.0 N Peking University Freshman Changqiang You Male 70.0 N Shanghai Jiao Tong University Senior Mei Sun Male 89.0 N Fudan University Sophomore Xiaojuan Sun Female 41.0 N Sophomore Gaojuan You Male 74.0 N
由于多级索引中的单个元素以元组为单位,因此之前在第一节介绍的 loc 和 iloc 方法完全可以照搬,只需把标量的位置替换成对应的元组。
当传入元组列表或单个元组或返回前二者的函数时,需要先进行索引排序以避免性能警告:
In [82]: with warnings.catch_warnings(): ....: warnings.filterwarnings('error') ....: try: ....: df_multi.loc[('Fudan University', 'Junior')].head() ....: except Warning as w: ....: Warning_Msg = w ....: In [83]: Warning_Msg Out[83]: pandas.errors.PerformanceWarning('indexing past lexsort depth may impact performance.') In [84]: df_sorted = df_multi.sort_index() In [85]: df_sorted.loc[('Fudan University', 'Junior')].head() Out[85]: Name Gender Weight Transfer School Grade Fudan University Junior Yanli You Female 48.0 N Junior Chunqiang Chu Male 72.0 N Junior Changfeng Lv Male 76.0 N Junior Yanjuan Lv Female 49.0 NaN Junior Gaoqiang Zhou Female 43.0 N In [86]: df_sorted.loc[[('Fudan University', 'Senior'), ....: ('Shanghai Jiao Tong University', 'Freshman')]].head() ....: Out[86]: Name Gender Weight Transfer School Grade Fudan University Senior Chengpeng Zheng Female 38.0 N Senior Feng Zhou Female 47.0 N Senior Gaomei Lv Female 34.0 N Senior Chunli Lv Female 56.0 N Senior Chengpeng Zhou Male 81.0 N In [87]: df_sorted.loc[df_sorted.Weight > 70].head() # 布尔列表也是可用的 Out[87]: Name Gender Weight Transfer School Grade Fudan University Freshman Feng Wang Male 74.0 N Junior Chunqiang Chu Male 72.0 N Junior Changfeng Lv Male 76.0 N Senior Chengpeng Zhou Male 81.0 N Senior Chengpeng Qian Male 73.0 Y In [88]: df_sorted.loc[lambda x:('Fudan University','Junior')].head() Out[88]: Name Gender Weight Transfer School Grade Fudan University Junior Yanli You Female 48.0 N Junior Chunqiang Chu Male 72.0 N Junior Changfeng Lv Male 76.0 N Junior Yanjuan Lv Female 49.0 NaN Junior Gaoqiang Zhou Female 43.0 N
当使用切片时需要注意,在单级索引中只要切片端点元素是唯一的,那么就可以进行切片,但在多级索引中,无论元组在索引中是否重复出现,都必须经过排序才能使用切片,否则报错:
In [89]: try: ....: df_multi.loc[('Fudan University', 'Senior'):].head() ....: except Exception as e: ....: Err_Msg = e ....: In [90]: Err_Msg Out[90]: pandas.errors.UnsortedIndexError('Key length (2) was greater than MultiIndex lexsort depth (0)') In [91]: df_sorted.loc[('Fudan University', 'Senior'):].head() Out[91]: Name Gender Weight Transfer School Grade Fudan University Senior Chengpeng Zheng Female 38.0 N Senior Feng Zhou Female 47.0 N Senior Gaomei Lv Female 34.0 N Senior Chunli Lv Female 56.0 N Senior Chengpeng Zhou Male 81.0 N In [92]: df_unique = df.drop_duplicates(subset=['School','Grade'] ....: ).set_index(['School', 'Grade']) ....: In [93]: df_unique.head() Out[93]: Name Gender Weight Transfer School Grade Shanghai Jiao Tong University Freshman Gaopeng Yang Female 46.0 N Peking University Freshman Changqiang You Male 70.0 N Shanghai Jiao Tong University Senior Mei Sun Male 89.0 N Fudan University Sophomore Xiaojuan Sun Female 41.0 N Tsinghua University Freshman Xiaoli Qian Female 51.0 N In [94]: try: ....: df_unique.loc[('Fudan University', 'Senior'):].head() ....: except Exception as e: ....: Err_Msg = e ....: In [95]: Err_Msg Out[95]: pandas.errors.UnsortedIndexError('Key length (2) was greater than MultiIndex lexsort depth (0)') In [96]: df_unique.sort_index().loc[('Fudan University', 'Senior'):].head() Out[96]: Name Gender Weight Transfer School Grade Fudan University Senior Chengpeng Zheng Female 38.0 N Sophomore Xiaojuan Sun Female 41.0 N Peking University Freshman Changqiang You Male 70.0 N Junior Juan Xu Female NaN N Senior Changli Lv Female 41.0 N
此外,在多级索引中的元组有一种特殊的用法,可以对多层的元素进行交叉组合后索引,但同时需要指定 loc 的列,全选则用 : 表示。其中,每一层需要选中的元素用列表存放,传入 loc 的形式为 [(level_0_list, level_1_list), cols] 。例如,想要得到所有北大和复旦的大二大三学生,可以如下写出:
In [97]: res = df_multi.loc[(['Peking University', 'Fudan University'], ....: ['Sophomore', 'Junior']), :] ....: In [98]: res.head() Out[98]: Name Gender Weight Transfer School Grade Peking University Sophomore Changmei Xu Female 43.0 N Sophomore Xiaopeng Qin Male NaN N Sophomore Mei Xu Female 39.0 N Sophomore Xiaoli Zhou Female 55.0 N Sophomore Peng Han Female 34.0 NaN In [99]: res.shape Out[99]: (33, 4)
下面的语句和上面类似,但仍然传入的是元素(这里为元组)的列表,它们的意义是不同的,表示的是选出北大的大三学生和复旦的大二学生:
In [100]: res = df_multi.loc[[('Peking University', 'Junior'), .....: ('Fudan University', 'Sophomore')]] .....: In [101]: res.head() Out[101]: Name Gender Weight Transfer School Grade Peking University Junior Juan Xu Female NaN N Junior Changjuan You Female 47.0 N Junior Gaoli Xu Female 48.0 N Junior Gaoquan Zhou Male 70.0 N Junior Qiang You Female 56.0 N In [102]: res.shape Out[102]: (16, 4)