作为一个DBA难免不了会遇到性能问题,那么我们遇到性能问题该如何进行排查呢?例如我们在高并发的业务下,出现业务响应慢,处理时间长我们又该如何入手进行排查,本片文章将分析io高的情况下如何分析及定位。
Shell> sysbench --test=/usr/local/share/sysbench/oltp_insert.lua --mysql-host=XXX --mysql-port=3306 --mysql-user=pcms --mysql-password=abc123 --mysql-db=sysbench --percentile=99 --table-size=2000000000 --tables=5 --threads=1000 prepare
使用sysbench进行模拟高并发
shell> sysbench --test=/usr/local/share/sysbench/oltp_write_only.lua --mysql-host=xxx --mysql-port=3306 --mysql-user=pcms --mysql-password=abc123 --mysql-db=sysbench --percentile=99 --table-size=2000000000 --tables=5 --threads=1000 --max-time=60000 --report-interval=1 --threads=1000 --max-requests=0 --mysql-ignore-errors=all run
执行笛卡尔积sql语句
mysql> select SQL_NO_CACHE b.id,a.k from sbtest_a a left join sbtest_b b on a.id=b.id group by a.k order by b.c desc;
shell> top top - 19:49:05 up 10 days, 8:16, 2 users, load average: 72.56, 40.21, 17.08 Tasks: 1288 total, 1 running, 586 sleeping, 0 stopped, 0 zombie %Cpu(s): 19.7 us, 4.2 sy, 0.0 ni, 75.9 id, 1.0 wa, 0.0 hi, 0.2 si, 0.0 st KiB Mem : 53542118+total, 23667507+free, 22735366+used, 71392448 buff/cache KiB Swap: 0 total, 0 free, 0 used. 23128006+avail Mem
由上可知:目前一分钟负载为72.56,且呈上升趋势,并且存在io压力
shell> iostat -m -x 1 Linux 4.14.0-115.el7a.0.1.aarch64 (mysql-4) 01/08/2022 _aarch64_ (128 CPU) Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 14213.00 27430.00 222.08 465.15 33.80 5.39 0.13 0.14 0.12 0.02 86.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
由上可知:目前有多块物理磁盘,sda磁盘的io压力较大
shell> iostat -d /dev/sda -m -x 1 Linux 4.14.0-115.el7a.0.1.aarch64 (mysql-4) 01/08/2022 _aarch64_ (128 CPU) Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.17 7.86 81.23 0.29 3.96 97.88 0.23 2.53 0.22 2.76 0.04 0.33 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 21109.00 42839.00 329.81 710.90 33.33 19.47 0.30 0.16 0.37 0.02 96.00 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 19287.00 41404.00 301.36 692.29 33.53 15.73 0.26 0.18 0.30 0.02 93.00 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 22135.00 43044.00 345.86 1165.18 47.48 100.87 1.55 0.20 2.24 0.01 97.00
由上可知:目前sda磁盘的压力比较大,每秒写入比每秒读差距较大,证明目前有大量的io写入
shell> pidstat -d 1 Linux 4.14.0-115.el7a.0.1.aarch64 (mysql-4) 01/08/2022 _aarch64_ (128 CPU) 08:01:43 PM UID PID kB_rd/s kB_wr/s kB_ccwr/s Command 08:01:44 PM 1000 73739 62018.35 171346.79 0.00 mysqld 08:01:44 PM UID PID kB_rd/s kB_wr/s kB_ccwr/s Command 08:01:45 PM 1000 73739 145328.00 435216.00 0.00 mysqld 08:01:45 PM UID PID kB_rd/s kB_wr/s kB_ccwr/s Command 08:01:46 PM 1000 73739 141488.00 433232.00 0.00 mysqld
由上可知:占用io高的应用程序是mysql,且pid为73739
shell> pidstat -dt -p 73739 1 执行两三秒即可 Average: 1000 - 73823 0.00 233133.98 0.00 |__mysqld Average: 1000 - 74674 0.00 174291.26 0.00 |__mysqld 11:56:18 PM 1000 - 74770 124928.00 74688.00 0.00 |__mysqld 11:56:17 PM 1000 - 74770 124603.77 73358.49 0.00 |__mysqld Average: 1000 - 74770 124761.17 74003.88 0.00 |__mysqld
由上可知:74770这个线程占用的io比较高
Shell> perf trace -t 74770 -o /tmp/tmp_aa.pstrace Shell> cat /tmp/tmp_aa.pstrace 2850.656 ( 1.915 ms): futex(uaddr: 0x653ae9c4, op: WAIT|PRIVATE_FLAG, val: 1) = 0 2852.572 ( 0.001 ms): futex(uaddr: 0x653ae990, op: WAKE|PRIVATE_FLAG, val: 1) = 0 2852.601 ( 0.001 ms): clock_gettime(which_clock: MONOTONIC, tp: 0xfff7bd470f68) = 0 2852.690 ( 0.040 ms): write(fd: 159, buf: 0xd7a30020, count: 65536) = 65536 2852.796 ( 0.001 ms): clock_gettime(which_clock: MONOTONIC, tp: 0xfff7bd470f68) = 0 2852.798 ( 0.001 ms): clock_gettime(which_clock: MONOTONIC, tp: 0xfff7bd470f58) = 0 2852.939 ( 0.001 ms): clock_gettime(which_clock: MONOTONIC, tp: 0xfff7bd470f38) = 0 2852.950 ( 0.001 ms): clock_gettime(which_clock: MONOTONIC, tp: 0xfff7bd470f68) = 0 2852.977 ( 0.001 ms): clock_gettime(which_clock: MONOTONIC, tp: 0xfff7bd470f68) = 0 2853.029 ( 0.035 ms): write(fd: 64, buf: 0xcd51e020, count: 65536) = 65536 2853.164 ( 0.001 ms): clock_gettime(which_clock: MONOTONIC, tp: 0xfff7bd470f68) = 0 2853.167 ( 0.001 ms): clock_gettime(which_clock: MONOTONIC, tp: 0xfff7bd470f58) = 0 2853.302 ( 0.001 ms): clock_gettime(which_clock: MONOTONIC, tp: 0xfff7bd470f38) = 0
由上可知:目前这个线程在写入多个文件,fd为文件句柄,文件句柄号有64、159
shell> lsof -p 73739|grep 159u mysqld 73739 mysql 159u REG 8,0 212143246 7046482357 /mysql/mysqldata/16320fff-5fd5-4c47-889a-a9e1a8591d0d/tmp/#7046482357 (deleted) [root@mysql-4 ~]# lsof -p 73739|grep 64u mysqld 73739 mysql 64u REG 8,0 211872724 6979323031 /mysql/mysqldata/16320fff-5fd5-4c47-889a-a9e1a8591d0d/tmp/#6979323031 (deleted)
由上可知:这个线程在大量的写入临时文件
mysql> select * from information_schema.processlist where command !='sleep'; | 9 | pcms | 172.16.76.12:57596 | sysbench | Query | 67 | executing | select SQL_NO_CACHE b.id,a.k from sbtest_a a left join sbtest_b b on a.id=b.id group by a.k order by b.c desc | 66477 | 0 | 0 |
由上可知:目前这个sql已经执行了67s,且此sql使用了group by和order by,必然会产生io
mysql> select * from threads where thread_os_id=74770\G; *************************** 1. row *************************** THREAD_ID: 95 NAME: thread/sql/one_connection TYPE: FOREGROUND PROCESSLIST_ID: 9 PROCESSLIST_USER: pcms PROCESSLIST_HOST: 172.16.76.12 PROCESSLIST_DB: sysbench PROCESSLIST_COMMAND: Query PROCESSLIST_TIME: 91 PROCESSLIST_STATE: NULL PROCESSLIST_INFO: select SQL_NO_CACHE b.id,a.k from sbtest_a a left join sbtest_b b on a.id=b.id group by a.k order by b.c desc PARENT_THREAD_ID: 1 ROLE: NULL INSTRUMENTED: YES HISTORY: YES CONNECTION_TYPE: TCP/IP THREAD_OS_ID: 74770 RESOURCE_GROUP: USR_default 1 row in set (0.00 sec)
由上可知:通过查询threads表可以进行验证,该线程在频繁创建临时表的原因就来源于此sql
mysql> explain select SQL_NO_CACHE b.id,a.k from sbtest_a a left join sbtest_b b on a.id=b.id group by a.k order by b.c desc\G; *************************** 1. row *************************** id: 1 select_type: SIMPLE table: a partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 1 filtered: 100.00 Extra: Using temporary; Using filesort *************************** 2. row *************************** id: 1 select_type: SIMPLE table: b partitions: NULL type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: sysbench.a.id rows: 1 filtered: 100.00 Extra: NULL 2 rows in set, 2 warnings (0.00 sec)
由上可知:该sql的执行计划用到了临时表及临时文件,符合
mysql> show global status like '%tmp%'; +-------------------------+-------+ | Variable_name | Value | +-------------------------+-------+ | Created_tmp_disk_tables | 3 | | Created_tmp_files | 22 | | Created_tmp_tables | 8 | +-------------------------+-------+
多执行几次,可以看出tmp_files和tmp_disk_tables的值在增长,证明在大量的创建临时文件及磁盘临时表,符合该线程的行为
通过上述的一系列排查,我们已经分析出来,目前sda磁盘的io使用率最高,且mysqld程序占用的最多,通过排查有一个线程在频繁的创建临时表或临时文件且通过登录mysql排查会话及线程视图可以找到是由某一个慢sql导致的,查看此慢sql的执行计划也会创建临时表和临时文件符合我们之前排查的预期。
此时我们就需要针对此慢sql进行优化,优化步骤由DBA进行处理,此处进行忽略。慢sql优化完成后可以进行io的继续观察,看io是否有下降
我们可以使用pstack进行跟踪线程号,获取当前的线程堆栈信息。切记pstack会调用gdb进行debug调试
shell> pstack 74770 >/tmp/74770.pstack Thread 1 (process 74770): #0 ha_innobase::general_fetch (this=0xea654228, buf=0xea662028 "\212t\317\030\002", direction=1, match_mode=0 ) at /builds/naiwei.fang/percona-server/storage/innobase/handler/ha_innodb.cc:11159 #1 0x0000000000d9913c in handler::ha_rnd_next (this=0xea654228, buf=0xea662028 "\212t\317\030\002") at /build s/naiwei.fang/percona-server/sql/handler.cc:3173 #2 0x0000000000f77db0 in TableScanIterator::Read (this=0xd256d5e8) at /builds/naiwei.fang/percona-server/sql/ row_iterator.h:208 #3 0x000000000124c714 in WriteRowsToChunks (xxhash_seed=899339, write_to_build_chunk=true, write_rows_with_nu ll_in_join_key=false, join_key_buffer=0xd01fdb98, tables_to_get_rowid_for=0, chunks=0xd01fdb58, join_condition s=..., tables=..., iterator=0xd256d5e8, thd=0xdb888000) at /builds/naiwei.fang/percona-server/sql/hash_join_it erator.cc:282 #4 HashJoinIterator::BuildHashTable (this=this@entry=0xd01fd028) at /builds/naiwei.fang/percona-server/sql/ha sh_join_iterator.cc:495 #5 0x000000000124c8ac in Init (this=0xd01fd028) at /builds/naiwei.fang/percona-server/sql/hash_join_iterator. cc:203 #6 HashJoinIterator::Init (this=0xd01fd028) at /builds/naiwei.fang/percona-server/sql/hash_join_iterator.cc:1 45 #7 0x00000000010eca14 in Query_expression::ExecuteIteratorQuery (this=0xdec3a8b8, thd=thd@entry=0xdb888000) a t /builds/naiwei.fang/percona-server/sql/sql_union.cc:1224 #8 0x00000000010ecccc in Query_expression::execute (this=this@entry=0xdec3a8b8, thd=thd@entry=0xdb888000) at /builds/naiwei.fang/percona-server/sql/sql_union.cc:1284 #9 0x0000000001083db0 in Sql_cmd_dml::execute_inner (this=0xd256bcb0, thd=0xdb888000) at /builds/naiwei.fang/ percona-server/sql/sql_select.cc:791 #10 0x000000000108cac8 in Sql_cmd_dml::execute (this=0xd256bcb0, thd=0xdb888000) at /builds/naiwei.fang/percon a-server/sql/sql_select.cc:575 #11 0x00000000010384e8 in mysql_execute_command (thd=thd@entry=0xdb888000, first_level=first_level@entry=true) at /builds/naiwei.fang/percona-server/sql/sql_parse.cc:4677 #12 0x000000000103b314 in dispatch_sql_command (thd=thd@entry=0xdb888000, parser_state=parser_state@entry=0xff f7bd4735b0, update_userstat=update_userstat@entry=false) at /builds/naiwei.fang/percona-server/sql/sql_parse.c c:5273 #13 0x000000000103ccf0 in dispatch_command (thd=thd@entry=0xdb888000, com_data=0xffffb467c4d0, com_data@entry= 0xfff7bd474640, command=COM_QUERY) at /builds/naiwei.fang/percona-server/sql/sql_parse.cc:1938 #14 0x000000000103da40 in do_command (thd=thd@entry=0xdb888000) at /builds/naiwei.fang/percona-server/sql/sql_ parse.cc:1386 #15 0x0000000001152ca8 in handle_connection (arg=arg@entry=0xda53ab10) at /builds/naiwei.fang/percona-server/s ql/conn_handler/connection_handler_per_thread.cc:307 #16 0x00000000022bc3ec in pfs_spawn_thread (arg=<optimized out>) at /builds/naiwei.fang/percona-server/storage /perfschema/pfs.cc:2899 #17 0x0000ffffb43c7c48 in start_thread () from /lib64/libpthread.so.0 #18 0x0000ffffb3c0f600 in thread_start () from /lib64/libc.so.6
Enjoy GreatSQL :)