最近在看checkpoint机制的时候,着重看了一下WAL日志的管理、切换和回收机制,在此简单总结一下。
PostgreSQL将xlog记录写入pg_xlog子目录的WAL段文件中(10版本后是pg_wal),当旧的段文件写满时就会切换至新的段文件。WAL文件的数量会因为某些参数发生变化,一些服务器的行为也会相应变化。
关于WAL的一些相关理论知识参考之前的文章——PostgreSQL可靠性浅谈。
当出现下列情况时,WAL段就会发生切换:
1) WAL段文件被写满
2) 执行函数pg_switch_xlog()(10版本后是pg_switch_wal)
3) 启用了archive_mode,且超过了archive_timeout的时间
被切换的WAL文件通常会被回收重利用,但如果不需要的话,也会被移除.
主要涉及到的函数都在\src\backend\access\transam\xlog.c中
其中,在CreateCheckPoint函数中,可以看到如下逻辑:
... /* * Update the average distance between checkpoints if the prior checkpoint * exists. */ if (PriorRedoPtr != InvalidXLogRecPtr) UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr); /* * Delete old log files, those no longer needed for last checkpoint to * prevent the disk holding the xlog from growing full. */ XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size); KeepLogSeg(recptr, &_logSegNo); _logSegNo--; RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr); /* * Make more log segments if needed. (Do this after recycling old log * segments, since that may supply some of the needed files.) */ if (!shutdown) PreallocXlogFiles(recptr); |
首先根据函数UpdateCheckPointDistanceEstimate估算两次checkpoint之间产生的xlog量,可以看到假如上次估算量比这次估算的小,则更新为这次的估算量,否则适量增加CheckPointDistanceEstimate =(0.90 * CheckPointDistanceEstimate + 0.10 * (double) nbytes);
static void UpdateCheckPointDistanceEstimate(uint64 nbytes) { PrevCheckPointDistance = nbytes; if (CheckPointDistanceEstimate < nbytes) CheckPointDistanceEstimate = nbytes; else CheckPointDistanceEstimate = (0.90 * CheckPointDistanceEstimate + 0.10 * (double) nbytes); } |
然后,根据XLByteToSeg宏定义函数算出的WAL段文件(第几个WAL日志)
#define XLByteToSeg(xlrp, logSegNo, wal_segsz_bytes) \ logSegNo = (xlrp) / (wal_segsz_bytes) #define XLByteToPrevSeg(xlrp, logSegNo, wal_segsz_bytes) \ logSegNo = ((xlrp) - 1) / (wal_segsz_bytes) |
然后计算需要保留的文件段号,从该段号_logSegNo开始的文件都不能被删除,之前的需要删除或回收,根据备机请求(复制槽)以及wal_keep_segments计算KeepLogSeg(recptr, &_logSegNo);
static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo) { XLogSegNosegno; XLogRecPtrkeep; XLByteToSeg(recptr, segno, wal_segment_size); keep = XLogGetReplicationSlotMinimumLSN(); /* compute limit for wal_keep_segments first */ if (wal_keep_segments > 0) { /* avoid underflow, don't go below 1 */ --假如当前WAL日志是第5个,那么就向前推进,保留到第一个WAL日志 if (segno <= wal_keep_segments) segno = 1; --否则推进到两者相减的那个WAL日志,之前的都可删除 else segno = segno - wal_keep_segments; } /* then check whether slots limit removal further */ if (max_replication_slots > 0 && keep != InvalidXLogRecPtr) { XLogSegNoslotSegNo; XLByteToSeg(keep, slotSegNo, wal_segment_size); --计算slot,假如计算出的值小于wal_keep_segments,则保留 if (slotSegNo <= 0) segno = 1; else if (slotSegNo < segno) segno = slotSegNo; } /* don't delete WAL segments newer than the calculated segment */ if (segno < *logSegNo) *logSegNo = segno; } |
大体结构如图:
图3.1
然后遍历pg_xlog目录,根据RemoveOldXlogFiles、RemoveXlogFile函数进行删除或回收。
static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr RedoRecPtr, XLogRecPtr endptr) { DIR *xldir; struct dirent *xlde; charlastoff[MAXFNAMELEN]; ... ... /* * We ignore the timeline part of the XLOG segment identifiers in * deciding whether a segment is still needed. This ensures that we * won't prematurely remove a segment from a parent timeline. We could * probably be a little more proactive about removing segments of * non-parent timelines, but that would be a whole lot more * complicated. * * We use the alphanumeric sorting property of the filenames to decide * which ones are earlier than the lastoff segment. */ if (strcmp(xlde->d_name + 8, lastoff + 8) <= 0) { if (XLogArchiveCheckDone(xlde->d_name)) { /* Update the last removed location in shared memory first */ UpdateLastRemovedPtr(xlde->d_name); RemoveXlogFile(xlde->d_name, RedoRecPtr, endptr); } } } |
RemoveXlogFile计算回收文件重命名的未来最大文件段号recycleSegNo。如果本次是第一次checkpoint,则未来最大段号recycleSegNo=当前段文件号+10,否则调用函数XLOGfileslop计算。
static void RemoveXlogFile(const char *segname, XLogRecPtr RedoRecPtr, XLogRecPtr endptr) { charpath[MAXPGPATH]; #ifdef WIN32 charnewpath[MAXPGPATH]; #endif struct stat statbuf; XLogSegNoendlogSegNo; XLogSegNorecycleSegNo; if (wal_recycle) { /* * Initialize info about where to try to recycle to. */ XLByteToSeg(endptr, endlogSegNo, wal_segment_size); if (RedoRecPtr == InvalidXLogRecPtr) recycleSegNo = endlogSegNo + 10; else recycleSegNo = XLOGfileslop(RedoRecPtr); } |
所以这里也可以看到,wal_recycle这个参数,官方文档的解释如下:If set to on (the default), this option causes WAL files to be recycled by renaming them, avoiding the need to create new ones. On COW file systems, it may be faster to create new ones, so the option is given to disable this behavior.默认为on,在PostgreSQL12中,我们可以设置为off,不回收利用WAL日志。函数XLOGfileslop
static XLogSegNo XLOGfileslop(XLogRecPtr RedoRecPtr) { XLogSegNominSegNo; XLogSegNomaxSegNo; doubledistance; XLogSegNorecycleSegNo; /* * Calculate the segment numbers that min_wal_size_mb and max_wal_size_mb * correspond to. Always recycle enough segments to meet the minimum, and * remove enough segments to stay below the maximum. */ minSegNo = RedoRecPtr / wal_segment_size + ConvertToXSegs(min_wal_size_mb, wal_segment_size) - 1; maxSegNo = RedoRecPtr / wal_segment_size + ConvertToXSegs(max_wal_size_mb, wal_segment_size) - 1; /* * Between those limits, recycle enough segments to get us through to the * estimated end of next checkpoint. * * To estimate where the next checkpoint will finish, assume that the * system runs steadily consuming CheckPointDistanceEstimate bytes between * every checkpoint. */ distance = (1.0 + CheckPointCompletionTarget) * CheckPointDistanceEstimate; /* add 10% for good measure. */ distance *= 1.10; recycleSegNo = (XLogSegNo) ceil(((double) RedoRecPtr + distance) / wal_segment_size); if (recycleSegNo < minSegNo) recycleSegNo = minSegNo; if (recycleSegNo > maxSegNo) recycleSegNo = maxSegNo; return recycleSegNo; } |
如果当前段文件号endlogSegNo < recycleSegNo,则调用InstallXLogFileSegment进行回收:
if (wal_recycle && endlogSegNo <= recycleSegNo && lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) && InstallXLogFileSegment(&endlogSegNo, path, true, recycleSegNo, true)) { ereport(DEBUG2, (errmsg("recycled write-ahead log file \"%s\"", segname))); CheckpointStats.ckpt_segs_recycled++; /* Needn't recheck that slot on future iterations */ endlogSegNo++; } else { /* No need for any more future segments... */ intrc; ereport(DEBUG2, (errmsg("removing write-ahead log file \"%s\"", segname))); |
1) 在endlogSegNo和recycleSegNo之间找一个free slot num,即没有该段文件号的xlog文件
2) 将需要删除的WAL文件重命名为该free slot号的文件名
3) 如果没有找到free slot则直接删除WAL文件
WAL的回收机制如下:
图3.2
每当检查点进程启动时,PostgreSQL就会估计并准备下一个检查点周期需要的WAL段文件,这种估计基于前一个检查点周期中消耗的文件数量,从包含上一个重做点的段文件开始计数,也就是前面提到的UpdateCheckPointDistanceEstimate这个函数做的事,这个值应该在min_wal_size和max_wal_size之间。如果检查点进程启动,必要的段文件就会被保留或者回收,不必要的段文件就会被移除。
在源码中,我们还可以看到注释中写道,WAL文件的数量会根据服务器活动自动适配。如果WAL数据写入量持续增加,则WAL段文件的估计数量以及WAL文件的总大小也会随之增加,反之亦然。
在PostgreSQL11中,移除了secondary checkpoint,即Prior Redo Point,可以参考
https://paquier.xyz/postgresql-2/postgres-11-secondary-checkpoint/,
https://www.postgresql-archive.org/Remove-secondary-checkpoint-tt5989050.html
因此图3.2中,会保留更多的WAL文件,如下
图4.1
在PostgreSQL9.1中,在pg_control文件中可以看到如下,存有Prior checkpoint location:
[xiongcancan@localhost ~]$ pg_controldata pg_data/ | grep checkpoint
Latest checkpoint location: 0/183D1A8
Prior checkpoint location: 0/183D140
Latest checkpoint's REDO location: 0/183D1A8
Latest checkpoint's REDO WAL file: 000000010000000000000001
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/1821
Latest checkpoint's NextOID: 24576
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 1808
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 0
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Time of latest checkpoint: Wed 12 Feb 2020 10:04:37 AM PST
在PostgreSQL12.1中,在pg_control文件中可以看到如下,没有存Prior checkpoint location:
-bash-4.1$ pg_controldata -D ~/12/data/ | grep checkpoint
Latest checkpoint location: 0/1823C88
Latest checkpoint's REDO location: 0/1823C88
Latest checkpoint's REDO WAL file: 000000010000000000000001
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:538
Latest checkpoint's NextOID: 65536
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 479
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 0
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 13836
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint: Wed 12 Feb 2020 06:17:07 AM PST