日常开发中,除了开辟新项目,业务需求开发,一般还要做负责系统的日常运维。比如线上告警了,出bug了,必须及时修复。这天,运维反馈mysql cpu告警了,然后抓了该时间节点的慢sql日志,要开发分析解决。
拿到的慢sql日志:
# Query 1: 1.16 QPS, 1.96x concurrency, ID 0x338A0AEE1CFE3C1D at byte 7687104 # This item is included in the report because it matches --limit. # Scores: V/M = 0.02 # Time range: 2022-08-12T16:30:00 to 2022-08-12T17:11:32 # Attribute pct total min max avg 95% stddev median # ============ === ======= ======= ======= ======= ======= ======= ======= # Count 99 2880 # Exec time 99 4893s 1s 2s 2s 2s 172ms 2s # Lock time 99 187ms 52us 343us 64us 84us 11us 60us # Rows sent 97 248 0 1 0.09 0.99 0.28 0 # Rows examine 96 871.46M 308.56k 311.13k 309.85k 298.06k 0 298.06k # Query size 99 812.81k 289 289 289 289 0 289 # String: # Hosts 10.22.9.183 (742/25%), 10.26.9.126 (730/25%)... 2 more # Users order # Query_time distribution # 1us # 10us # 100us # 1ms # 10ms # 100ms # 1s ################################################################ # 10s+ # Tables # SHOW TABLE STATUS LIKE 'serial_number_store'\G # SHOW CREATE TABLE `serial_number_store`\G # EXPLAIN /*!50100 PARTITIONS*/ select * from serial_number_store sn where 1=1 and company_code = '8511378117' and warehouse_code = '851' and sku_no = '6902952880' and (serial_no = '5007894' or sub_serial_no = 'v')\G
查询数据库定义,发现定义了几个index
PRIMARY KEY (`ID`), KEY `IDX_SERIAL_NUMBER_2` (`WAREHOUSE_CODE`), KEY `IDX_SERIAL_NUMBER_3` (`SKU_NO`), KEY `IDX_SERIAL_NUMBER_4` (`SERIAL_NO`), KEY `IDX_SERIAL_NUMBER_5` (`SUB_SERIAL_NO`), KEY `IDX_SERIAL_NUMBER_6` (`SKU_NAME`), KEY `IDX_SERIAL_NUMBER_1` (`COMPANY_CODE`,`WAREHOUSE_CODE`,`SKU_NO`,`SERIAL_NO`) USING BTREE
按最左匹配原则,这条sql应该只会命中一个索引。因为or的另一半无法match。
explain发现实际执行计划:
key: IDX_SERIAL_NUMBER_3 key_len: 259 ref: const rows: 45864 filtered: 0.95 Extra: Using where
表总数量: 13658763
or的优化技巧之一就是拆成2个可以命中索引的sql, 然后union all.
优化为union all
explain select * from serial_number_store sn where company_code = '9311046897' and warehouse_code = '931DCA' and sku_no = '6935117818696' and serial_no = '862517054251459' union all select * from serial_number_store sn where company_code = '9311046897' and warehouse_code = '931DCA' and sku_no = '6935117818696' and sub_serial_no = '862517054251459';
最终explain
key: IDX_SERIAL_NUMBER_4 IDX_SERIAL_NUMBER_5 ref: const const rows: 1 1 filtered: 5.0 5.0 extra: using where
正常到这里,找到解决方案,就算完事了。但作为线上问题的处理,你得分析为啥以前没事,现在出问题了。
查询对应的链路追踪情况:
和猜测一致,短时间内批量查询。几乎每条sql2s多耗时。虽然是后台任务,但数据量太大导致cpu 100%.
定位实际的代码,mybatis是这么写:
<sql id="servialNumberStoreEntityParams"> <if test="id!=null and id!=''"> and ID = #{id}</if> <if test="companyCode!=null and companyCode!=''"> and company_code = #{companyCode}</if> <if test="warehouseCode!=null and warehouseCode!=''"> and warehouse_code = #{warehouseCode}</if> <if test="sku!=null and sku!=''"> and sku_no = #{sku}</if> <if test="serialNo!=null and serialNo!=''"> and (serial_no = #{serialNo} or sub_serial_no = #{serialNo})</if> <if test="lotNum!=null and lotNum!=''"> and lot_num = #{lotNum}</if> </sql>
这个查询片段有多个sql引用了。比如
select * from serial_number_store sn where 1=1 <include refid="servialNumberStoreEntityParams" />
改造成union也不是不行,比如
select * from serial_number_store sn where 1=1 <include refid="servialNumberStoreEntityParams" /> <if test="serialNo!=null and serialNo!=''"> and serial_no = #{serialNo} union all select * from cwsp_tb_serial_number_store sn where 1=1 <include refid="servialNumberStoreEntityParams" /> and sub_serial_no = #{serialNo} </if>
但前面说了多个片段引用了,对应多个sql查询方法,然后这多个sql查询方法又会对应多个业务调用。那问题来了,如果改完要测的话,业务场景该怎么测?一时犹豫了,要不要再花额外的时间去搞回归测试,验证。
和运维小哥说,反正是个后台任务,先不改吧。运维看没影响到业务(没人投诉)也就不管了。
然后第二天上班又收到了告警。逃不掉了。
定位代码的时候,发现有个update
<update id="update"> update serial_number_store <set> <if test="companyCode!=null and companyCode!=''"> COMPANY_CODE = #{companyCode},</if> <if test="warehouseCode!=null and warehouseCode!=''"> WAREHOUSE_CODE = #{warehouseCode},</if> <if test="sku!=null and sku!=''"> SKU_NO = #{sku},</if> <if test="serialNo!=null and serialNo!=''"> SERIAL_NO = #{serialNo},</if> <if test="subSerialNo!=null and subSerialNo!=''"> SUB_SERIAL_NO = #{subSerialNo},</if> <if test="erpno!=null and erpno!=''"> ERP_ORDER = #{erpno},</if> <if test="docType!=null and docType!=''"> DOCTYPE = #{docType},</if> <if test="editTime!=null and editTime!=''"> EDITTM = #{editTime},</if> <if test="editWho!=null and editWho!=''"> EDITEMP = #{editWho},</if> </set> where 1=1 <include refid="servialNumberStoreEntityParams" /> </update>
这种sql,假如参数没传,岂不是全表被覆盖? 当然,也能改。前提是梳理调用链路,把这些sql引用的业务场景梳理一遍,确定入参场景,然后修改,然后再模拟场景做测试。想想整个流程,1天不知道搞不搞的定,测试上线等等,还有更长的流程。
这种在设计之初就应该做好优化设计而不是出了问题再改,但当接手古老系统的时候,开发可能换了一波又一波了,这时候除了吐槽之外,只能填坑。与此同时,自己所开发的代码,在若干时间后,也许会被另外一个人吐槽(如果自己发现的坑是自己挖的,自然不会吐槽自己)