version: postgresql10.17
总结
-------- | meta | -------- | root | -------- | left branch | branch | branch | ..... | branch | -------------------------------------------------------- | leaf | leaf | leaf | leaf | leaf | right leaf | -------------------------------------------------------- ! left branch的data为空 ! branch的data是指向页面的最小值 ! right leaf的第一条是当前页面的最小值 ! leaf的第一条是右sibling的最小值得copy
支持哪些索引类型?
postgres=# \d pg_am Table "pg_catalog.pg_am" Column | Type | Collation | Nullable | Default -----------+---------+-----------+----------+--------- amname | name | | not null | amhandler | regproc | | not null | amtype | "char" | | not null | postgres=# select * from pg_am; amname | amhandler | amtype --------+-------------+-------- btree | bthandler | i hash | hashhandler | i gist | gisthandler | i gin | ginhandler | i spgist | spghandler | i brin | brinhandler | i
ps. 9.6前这张系统表有很多method的属性信息,新版本这些属性保存在C代码中了。
amhandler的相关函数在pg_proc中,都是C函数。
select proname, prosrc from pg_proc where oid in (330,331,332,333,334,335); proname | prosrc -------------+------------- bthandler | bthandler hashhandler | hashhandler gisthandler | gisthandler ginhandler | ginhandler spghandler | spghandler brinhandler | brinhandler
例子:
postgres=# \d t2 Table "public.t2" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- id | integer | | not null | info | integer | | | Indexes: "pk_t2_a" PRIMARY KEY, btree (id)
系统表含义:
postgres=# select * from pg_index where indexrelid='pk_t2_a'::regclass; -[ RECORD 1 ]--+------- indexrelid | 126817 indrelid | 126814 indnatts | 1 -- 列数 indisunique | t -- 唯一索引 indisprimary | t -- 主键索引 indisexclusion | f -- 支持exclusion constraint indimmediate | t -- 强制每次insert时检查唯一性 indisclustered | f -- 根据这个索引做了聚簇 indisvalid | t -- 是否用于查询 indcheckxmin | f -- 查询时需要检查索引元组xmin<事务的xmin才能用 indisready | t -- 索引是否可以进行插入、更新 indislive | t -- f表示索引正在被drop,其他操作应该忽略这个索引 indisreplident | f -- ALTER TABLE ... REPLICA IDENTITY USING INDEX indkey | 1 -- 用哪几个列建的 indcollation | 0 indclass | 1978 indoption | 0 indexprs | indpred | -- indisreplident订阅了update和delete,需要指定REPLICA IDENTITY才能执行更新删除 postgres=# select oid,* from pg_class order by oid desc limit 1; -[ RECORD 1 ]-------+-------- oid | 126817 relname | pk_t2_a relnamespace | 2200 reltype | 0 reloftype | 0 relowner | 10 relam | 403 relfilenode | 126817 reltablespace | 0 relpages | 1 reltuples | 0 relallvisible | 0 reltoastrelid | 0 relhasindex | f relisshared | f relpersistence | p relkind | i relnatts | 1 relchecks | 0 relhasoids | f relhaspkey | f relhasrules | f relhastriggers | f relhassubclass | f relrowsecurity | f relforcerowsecurity | f relispopulated | t relreplident | n relispartition | f relfrozenxid | 0 relminmxid | 0 relacl | reloptions | relpartbound |
|-----------------------------------------------------| | linp0(page header) | linp1 | linp2 | | |-----------------------------------------------------| | | | | |-----------------------------------------------------| | | itup2 | itup1 | special space(BTMetaPageData)| |-----------------------------------------------------|
BTPageOpaqueData在special space中,维护B-linked-tree的结构
typedef struct BTPageOpaqueData { BlockNumber btpo_prev; /* 左兄弟的块号,便于反向扫描 */ BlockNumber btpo_next; /* 右兄弟的块号,正向扫描*/ union { uint32 level; /* 在索引树中的层级level*/ TransactionId xact; /* 删除无用的ID */ } btpo; uint16 btpo_flags; /* 页面类型 */ BTCycleId btpo_cycleid; /* vacuum cycle ID of latest split */ } BTPageOpaqueData; /* Bits defined in btpo_flags */ #define BTP_LEAF (1 << 0) /* 有该标志表明有叶子页面 */ #define BTP_ROOT (1 << 1) /* 根页面*/ #define BTP_DELETED (1 << 2) /* 从索引树中删除的页面 */ #define BTP_META (1 << 3) /* 元页面*/ #define BTP_HALF_DEAD (1 << 4) /* 空页面,但仍保留*/ #define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */ #define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples,LP——DEAD元组 */ #define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */ // BTMetaPageData typedef struct BTMetaPageData { uint32 btm_magic; /* should contain BTREE_MAGIC */ uint32 btm_version; /* should contain BTREE_VERSION */ BlockNumber btm_root; /* current root location */ uint32 btm_level; /* tree level of the root page */ BlockNumber btm_fastroot; /* current "fast" root location */ uint32 btm_fastlevel; /* tree level of the "fast" root page */ } BTMetaPageData;
总结
观察存储结构:
---------------- | meta page | ---------------- | root page | ----------------
创建测试表
create table t3(id int primary key, info text);
测试表的第0个页面是meta page,指向root page
postgres=# \d t3 Table "public.t3" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- id | integer | | not null | info | text | | | Indexes: "t3_pkey" PRIMARY KEY, btree (id) postgres=# select * from bt_page_stats('t3_pkey',0); ERROR: block 0 is a meta page postgres=# select * from bt_page_items('t3_pkey',0); ERROR: block 0 is a meta page postgres=# select * from bt_metap('t3_pkey'); magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 1 | 0 | 1 | 0
create table t3(id int primary key, info text); insert into t3 select generate_series(1,100), md5(random()::text);
level = 0
select * from bt_metap('t3_pkey');
btpo = 等级
btpo_flags = 类型
postgres=# select * from bt_page_stats('t3_pkey',1); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 1 | l | 100 | 0 | 16 | 8192 | 6148 | 0 | 0 | 0 | 3 (1 row)
bt_page_stats数据都是从SpecialSpage的BTPageOpaque中取的,含义可以参考BTPageOpaque数据结构
bt_page_stats GetBTPageStatistics BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); ... stat->btpo_prev = opaque->btpo_prev; stat->btpo_next = opaque->btpo_next; stat->btpo.level = opaque->btpo.level; stat->btpo_flags = opaque->btpo_flags; stat->btpo_cycleid = opaque->btpo_cycleid; ...
https://developer.aliyun.com/article/53701
层次可以从bt_page_stats的btpo得到,代表当前index page所处的层级。
注意层级并不是唯一的,例如btpo=3的层级,可能有分几个档。
btpo=0级表示最底层,处于这个层级的index pages存储的items(ctid)是指向heap page的。
类别和层级不挂钩,类别里面又可以有多个层级,但是只有层级=0的index page存储的ctid内容才是指向heap page的
其他层级index page存储的ctid内容都是指向同层级其他index page(双向链表),或者指下级的index page。
0层结构,只有meta和root页。
root页最多可以存储的item数,取决于索引字段数据的长度、以及索引页的大小。
当前
btpo = 等级 = 0
btpo_flags = 类型 = 3 = BTP_LEAF | BTP_ROOT
#define BTP_LEAF (1 << 0) /* leaf page, i.e. not internal page */ #define BTP_ROOT (1 << 1) /* root page (has no parent) */ #define BTP_DELETED (1 << 2) /* page has been deleted from tree */ #define BTP_META (1 << 3) /* meta-page */ #define BTP_HALF_DEAD (1 << 4) /* empty, but still in tree */ #define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */ #define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples */ #define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
btpo知道这是一个0级节点,0级节点指向的是heap tuple
postgres=# select * from bt_page_items('t3_pkey',1) limit 10; itemoffset | ctid | itemlen | nulls | vars | data ------------+--------+---------+-------+------+------------------------- 1 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00 2 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00 3 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00 4 | (0,4) | 16 | f | f | 04 00 00 00 00 00 00 00 5 | (0,5) | 16 | f | f | 05 00 00 00 00 00 00 00 6 | (0,6) | 16 | f | f | 06 00 00 00 00 00 00 00 7 | (0,7) | 16 | f | f | 07 00 00 00 00 00 00 00 8 | (0,8) | 16 | f | f | 08 00 00 00 00 00 00 00 9 | (0,9) | 16 | f | f | 09 00 00 00 00 00 00 00 10 | (0,10) | 16 | f | f | 0a 00 00 00 00 00 00 00 postgres=# select * from t3 where ctid='(0,10)'; id | info ----+---------------------------------- 10 | 986dd76dcec6d1b705ea163fe6e11f37
create table t4(id int primary key, info text); insert into t4 select generate_series(1,1000), md5(random()::text);
level = 1(只有leaf、meta和root不计入)
select * from bt_metap('t4_pkey');
postgres=# select * from bt_metap('t4_pkey'); magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 3 | 1 | 3 | 1
-- btpo=0:已经是最底层了 -- btpo_flags=1:BTP_LEAF postgres=# select * from bt_page_stats('t4_pkey',1); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 1 | l | 367 | 0 | 16 | 8192 | 808 | 0 | 2 | 0 | 1 -- btpo=0:已经是最底层了 -- btpo_flags=1:BTP_LEAF postgres=# select * from bt_page_stats('t4_pkey',2); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 2 | l | 367 | 0 | 16 | 8192 | 808 | 1 | 4 | 0 | 1 -- btpo=1:不是最底层 -- btpo_flags=2:BTP_ROOT postgres=# select * from bt_page_stats('t4_pkey',3); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 3 | r | 3 | 0 | 13 | 8192 | 8096 | 0 | 0 | 1 | 2 -- btpo=0:已经是最底层了 -- btpo_flags=1:BTP_LEAF postgres=# select * from bt_page_stats('t4_pkey',4); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 4 | l | 268 | 0 | 16 | 8192 | 2788 | 2 | 0 | 0 | 1 postgres=# select * from bt_page_stats('t4_pkey',5); ERROR: block number out of range
指向3个页面,data存储的是这个leaf page存储的最小值
postgres=# select * from bt_page_items('t4_pkey',3); itemoffset | ctid | itemlen | nulls | vars | data ------------+-------+---------+-------+------+------------------------- 1 | (1,1) | 8 | f | f | 2 | (2,1) | 16 | f | f | 6f 01 00 00 00 00 00 00 3 | (4,1) | 16 | f | f | dd 02 00 00 00 00 00 00
第一个0级页面:
postgres=# select * from bt_page_items('t4_pkey',1); itemoffset | ctid | itemlen | nulls | vars | data ------------+---------+---------+-------+------+------------------------- 1 | (3,7) | 16 | f | f | 6f 01 00 00 00 00 00 00 2 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00 3 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00 4 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00 ... 365 | (3,4) | 16 | f | f | 6c 01 00 00 00 00 00 00 366 | (3,5) | 16 | f | f | 6d 01 00 00 00 00 00 00 367 | (3,6) | 16 | f | f | 6e 01 00 00 00 00 00 00
第二个0级页面:
postgres=# select * from bt_page_items('t4_pkey',2); itemoffset | ctid | itemlen | nulls | vars | data ------------+---------+---------+-------+------+------------------------- 1 | (6,13) | 16 | f | f | dd 02 00 00 00 00 00 00 2 | (3,7) | 16 | f | f | 6f 01 00 00 00 00 00 00 3 | (3,8) | 16 | f | f | 70 01 00 00 00 00 00 00 4 | (3,9) | 16 | f | f | 71 01 00 00 00 00 00 00 ... 365 | (6,10) | 16 | f | f | da 02 00 00 00 00 00 00 366 | (6,11) | 16 | f | f | db 02 00 00 00 00 00 00 367 | (6,12) | 16 | f | f | dc 02 00 00 00 00 00 00
第四个0级页面:
postgres=# select * from bt_page_items('t4_pkey',4); itemoffset | ctid | itemlen | nulls | vars | data ------------+---------+---------+-------+------+------------------------- 1 | (6,13) | 16 | f | f | dd 02 00 00 00 00 00 00 2 | (6,14) | 16 | f | f | de 02 00 00 00 00 00 00 3 | (6,15) | 16 | f | f | df 02 00 00 00 00 00 00 4 | (6,16) | 16 | f | f | e0 02 00 00 00 00 00 00 ... 266 | (8,38) | 16 | f | f | e6 03 00 00 00 00 00 00 267 | (8,39) | 16 | f | f | e7 03 00 00 00 00 00 00 268 | (8,40) | 16 | f | f | e8 03 00 00 00 00 00 00
总结:
create table t5(id int primary key, info text); insert into t5 select trunc(random()*10000000), md5(random()::text) from generate_series(1,1000000) on conflict on constraint t5_pkey do nothing;
level = 2(有branch和leaf)root是412
select * from bt_metap('t5_pkey'); magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 412 | 2 | 412 | 2
查看root(data存的是左兄弟的最大值)
btpo = 2(在第二层)
btpo_flags = 2 = BTP_ROOT
postgres=# select * from bt_page_stats('t5_pkey', 412); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 412 | r | 11 | 0 | 15 | 8192 | 7936 | 0 | 0 | 2 | 2 postgres=# select * from bt_page_items('t5_pkey', 412); itemoffset | ctid | itemlen | nulls | vars | data ------------+----------+---------+-------+------+------------------------- 1 | (3,1) | 8 | f | f | 2 | (2714,1) | 16 | f | f | 66 07 0a 00 00 00 00 00 3 | (1233,1) | 16 | f | f | a0 8c 16 00 00 00 00 00 4 | (2381,1) | 16 | f | f | 0c 62 23 00 00 00 00 00 5 | (583,1) | 16 | f | f | 4d 71 30 00 00 00 00 00 6 | (2286,1) | 16 | f | f | 6b 80 3d 00 00 00 00 00 7 | (1062,1) | 16 | f | f | 59 fb 4b 00 00 00 00 00 8 | (2046,1) | 16 | f | f | de b2 5a 00 00 00 00 00 9 | (411,1) | 16 | f | f | d6 df 6a 00 00 00 00 00 10 | (2006,1) | 16 | f | f | 06 6d 79 00 00 00 00 00 11 | (1380,1) | 16 | f | f | 8c 34 8a 00 00 00 00 00
这样看起来几个branch page是3、2714、1233、2381等11个页面。
postgres=# select * from bt_page_stats('t5_pkey', 3); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 3 | i | 214 | 0 | 15 | 8192 | 3876 | 0 | 2714 | 1 | 0 postgres=# select * from bt_page_stats('t5_pkey', 2714); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 2714 | i | 273 | 0 | 15 | 8192 | 2696 | 3 | 1233 | 1 | 0
下面看一下branch的内容,例如3页面:
postgres=# select * from bt_page_items('t5_pkey', 3); itemoffset | ctid | itemlen | nulls | vars | data ------------+----------+---------+-------+------+------------------------- 1 | (1852,1) | 16 | f | f | 66 07 0a 00 00 00 00 00 2 | (1,1) | 8 | f | f | 3 | (3176,1) | 16 | f | f | 5c 08 00 00 00 00 00 00 4 | (1612,1) | 16 | f | f | ef 11 00 00 00 00 00 00 5 | (3045,1) | 16 | f | f | 2d 1a 00 00 00 00 00 00 6 | (724,1) | 16 | f | f | e2 24 00 00 00 00 00 00 7 | (2822,1) | 16 | f | f | ae 2d 00 00 00 00 00 00 8 | (1369,1) | 16 | f | f | 96 38 00 00 00 00 00 00 9 | (2612,1) | 16 | f | f | ac 42 00 00 00 00 00 00 10 | (315,1) | 16 | f | f | 0d 4e 00 00 00 00 00 00 11 | (2458,1) | 16 | f | f | d6 58 00 00 00 00 00 00 ... ...
查看leaf页面1852的内容,第一条是右兄弟最小值的复制;第二条是当前最小值。
所以data是保存指向页面的最小数据。
postgres=# select * from bt_page_items('t5_pkey', 1852); itemoffset | ctid | itemlen | nulls | vars | data ------------+------------+---------+-------+------+------------------------- 1 | (151,77) | 16 | f | f | 2b 18 0a 00 00 00 00 00 2 | (1231,33) | 16 | f | f | 66 07 0a 00 00 00 00 00 <----------- 3 | (4098,24) | 16 | f | f | 67 07 0a 00 00 00 00 00 4 | (2373,32) | 16 | f | f | 73 07 0a 00 00 00 00 00 5 | (7888,95) | 16 | f | f | 9e 07 0a 00 00 00 00 00
查看root发现只有(3,1)没有data,data本应保存右兄弟的最小值。没有说明(3,1)是第一个页面。
postgres=# select * from bt_page_items('t5_pkey', 412); itemoffset | ctid | itemlen | nulls | vars | data ------------+----------+---------+-------+------+------------------------- 1 | (3,1) | 8 | f | f | 2 | (2714,1) | 16 | f | f | 66 07 0a 00 00 00 00 00 3 | (1233,1) | 16 | f | f | a0 8c 16 00 00 00 00 00 4 | (2381,1) | 16 | f | f | 0c 62 23 00 00 00 00 00
查看branch
postgres=# select * from bt_page_items('t5_pkey', 3); itemoffset | ctid | itemlen | nulls | vars | data ------------+----------+---------+-------+------+------------------------- 1 | (1852,1) | 16 | f | f | 66 07 0a 00 00 00 00 00 2 | (1,1) | 8 | f | f | 3 | (3176,1) | 16 | f | f | 5c 08 00 00 00 00 00 00 4 | (1612,1) | 16 | f | f | ef 11 00 00 00 00 00 00 5 | (3045,1) | 16 | f | f | 2d 1a 00 00 00 00 00 00 6 | (724,1) | 16 | f | f | e2 24 00 00 00 00 00 00 7 | (2822,1) | 16 | f | f | ae 2d 00 00 00 00 00 00 8 | (1369,1) | 16 | f | f | 96 38 00 00 00 00 00 00 9 | (2612,1) | 16 | f | f | ac 42 00 00 00 00 00 00
查看leaf,第一条是右sibling的最小值,第二条是当前页面的最小值。
postgres=# select * from bt_page_items('t5_pkey', 1); itemoffset | ctid | itemlen | nulls | vars | data ------------+------------+---------+-------+------+------------------------- 1 | (212,57) | 16 | f | f | 5c 08 00 00 00 00 00 00 2 | (7238,62) | 16 | f | f | 2a 00 00 00 00 00 00 00 3 | (430,12) | 16 | f | f | 2b 00 00 00 00 00 00 00 4 | (5535,81) | 16 | f | f | 3e 00 00 00 00 00 00 00 5 | (1278,47) | 16 | f | f | 3f 00 00 00 00 00 00 00 6 | (4255,45) | 16 | f | f | 46 00 00 00 00 00 00 00
查看数据
postgres=# select * from bt_page_items('t5_pkey', 1); itemoffset | ctid | itemlen | nulls | vars | data ------------+------------+---------+-------+------+------------------------- 1 | (212,57) | 16 | f | f | 5c 08 00 00 00 00 00 00 2 | (7238,62) | 16 | f | f | 2a 00 00 00 00 00 00 00 3 | (430,12) | 16 | f | f | 2b 00 00 00 00 00 00 00 4 | (5535,81) | 16 | f | f | 3e 00 00 00 00 00 00 00 5 | (1278,47) | 16 | f | f | 3f 00 00 00 00 00 00 00 6 | (4255,45) | 16 | f | f | 46 00 00 00 00 00 00 00 postgres=# select * from t5 where ctid='(7238,62)'; id | info ----+---------------------------------- 42 | faca6c2f77cfb6eea5e9ee6bf2740578 postgres=# select min(id) from t5; min ----- 42
https://developer.aliyun.com/article/53701
下一篇开始翻源码