author:sufei
版本:xtrabackup 2.4.12
说明:生产使用备份工具xtrabackup造成OOM,本文作为一个记录,并对相关显现进行分享说明。
一、现象说明
生产环境中,在使用xtrabaxkup进行备份时,触发了死锁,这个死锁的原因这里就不多说了,有很多相关分析,可以看截图如下:
xtrabaxkup是卡在ftwrl中的commit lock等待中。
本来这是一个很正常的现象,但是这次在出现的hang时,但是xtrabackup进程会一直在申请内存,最后把系统内存给吃光了,产生OOM,最后触发xtrabackup进程被kill。下面是相关的系统日志:
xxx-crm-db-043 telegraf: 2020/09/24 09:17:38 authenticated: id=176476686980599134, timeout=40000
xxx-crm-db-043 telegraf: 2020/09/24 09:17:38 re-submitting `0` credentials after reconnect
xxx-crm-db-043 systemd-logind: Failed to save session data /run/systemd/sessions/13683961: No space left on device
xxx-crm-db-043 systemd-logind: Failed to save user data /run/systemd/users/1006: No space left on device
xxx-crm-db-043 systemd-logind: Removed session 13683961.
xxx-crm-db-043 systemd-logind: Failed to save user data /run/systemd/users/1006: No space left on device
xxx-crm-db-043 kernel: kthreadd invoked oom-killer: gfp_mask=0x3000d0, order=2, oom_score_adj=0
xxx-crm-db-043 kernel: kthreadd cpuset=/ mems_allowed=0
xxx-crm-db-043 kernel: CPU: 2 PID: 2 Comm: kthreadd Not tainted 3.10.0-693.el7.x86_64 #1
xxx-crm-db-043 kernel: Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 1.09 01/31/2019
xxx-crm-db-043 kernel: ffff880127360fd0 00000000ec0267f4 ffff880127383a20 ffffffff816a3d91
xxx-crm-db-043 kernel: ffff880127383ab0 ffffffff8169f186 ffffffff810e939c ffff880127133210
xxx-crm-db-043 kernel: ffff880127133228 0000000000000202 0000000000000202 ffff880127383aa0
xxx-crm-db-043 kernel: Call Trace:
xxx-crm-db-043 kernel: [<ffffffff816a3d91>] dump_stack+0x19/0x1b
xxx-crm-db-043 kernel: [<ffffffff8169f186>] dump_header+0x90/0x229
xxx-crm-db-043 kernel: [<ffffffff810e939c>] ? ktime_get_ts64+0x4c/0xf0
xxx-crm-db-043 kernel: [<ffffffff8113d36f>] ? delayacct_end+0x8f/0xb0
xxx-crm-db-043 kernel: [<ffffffff81186394>] oom_kill_process+0x254/0x3d0
xxx-crm-db-043 kernel: [<ffffffff81185e3d>] ? oom_unkillable_task+0xcd/0x120
xxx-crm-db-043 kernel: [<ffffffff81185ee6>] ? find_lock_task_mm+0x56/0xc0
xxx-crm-db-043 kernel: [<ffffffff81186bd6>] out_of_memory+0x4b6/0x4f0
xxx-crm-db-043 kernel: [<ffffffff8169fc8a>] __alloc_pages_slowpath+0x5d6/0x724
xxx-crm-db-043 kernel: [<ffffffff8118cd85>] __alloc_pages_nodemask+0x405/0x420
xxx-crm-db-043 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
xxx-crm-db-043 kernel: [<ffffffff8108510d>] copy_process+0x1dd/0x1970
xxx-crm-db-043 kernel: [<ffffffff81029557>] ? __switch_to+0xd7/0x510
xxx-crm-db-043 kernel: [<ffffffff81086a51>] do_fork+0x91/0x320
xxx-crm-db-043 kernel: [<ffffffff816a8f8d>] ? __schedule+0x39d/0x8b0
xxx-crm-db-043 kernel: [<ffffffff81086d06>] kernel_thread+0x26/0x30
xxx-crm-db-043 kernel: [<ffffffff810b1341>] kthreadd+0x2c1/0x300
xxx-crm-db-043 kernel: [<ffffffff810b1080>] ? kthread_create_on_cpu+0x60/0x60
xxx-crm-db-043 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
xxx-crm-db-043 kernel: [<ffffffff810b1080>] ? kthread_create_on_cpu+0x60/0x60
xxx-crm-db-043 kernel: Mem-Info:
xxx-crm-db-043 kernel: active_anon:62418042 inactive_anon:1310674 isolated_anon:0#012 active_file:144 inactive_file:214 isolated_file:10#012 unevictable:0 dirty:16 writeback:4 unstable:0#012 slab_reclaimable:104852 slab_unreclaimable:67281#012 mapped:3721 shmem:709624 pagetables:188395 bounce:0#012 free:1283802 free_pcp:321 free_cma:0
xxx-crm-db-043 kernel: Node 0 DMA free:15896kB min:244kB low:304kB high:364kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:4kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
xxx-crm-db-043 kernel: lowmem_reserve[]: 0 1379 257354 257354
xxx-crm-db-043 kernel: Node 0 DMA32 free:1045880kB min:21988kB low:27484kB high:32980kB active_anon:154856kB inactive_anon:161936kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):40kB present:1675200kB managed:1414856kB mlocked:0kB dirty:56kB writeback:0kB mapped:0kB shmem:8kB slab_reclaimable:15908kB slab_unreclaimable:2400kB kernel_stack:384kB pagetables:1616kB unstable:0kB bounce:0kB free_pcp:1268kB local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:16 all_unreclaimable? yes
xxx-crm-db-043 kernel: lowmem_reserve[]: 0 0 255974 255974
xxx-crm-db-043 kernel: Node 0 Normal free:4073432kB min:4073760kB low:5092200kB high:6110640kB active_anon:249517312kB inactive_anon:5080760kB active_file:916kB inactive_file:1884kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:266338304kB managed:262117464kB mlocked:0kB dirty:8kB writeback:16kB mapped:15012kB shmem:2838488kB slab_reclaimable:403500kB slab_unreclaimable:266720kB kernel_stack:35728kB pagetables:751964kB unstable:0kB bounce:0kB free_pcp:16kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:80302 all_unreclaimable? yes
xxx-crm-db-043 kernel: lowmem_reserve[]: 0 0 0 0
xxx-crm-db-043 kernel: Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15896kB
xxx-crm-db-043 kernel: Node 0 DMA32: 358*4kB (UEM) 340*8kB (UEM) 202*16kB (UEM) 571*32kB (UEM) 437*64kB (UEM) 1014*128kB (UEM) 602*256kB (UEM) 402*512kB (UEM) 260*1024kB (UEM) 48*2048kB (UEM) 34*4096kB (UM) = 1047160kB
xxx-crm-db-043 kernel: Node 0 Normal: 8290*4kB (UEM) 3627*8kB (UEM) 1813*16kB (UEM) 7945*32kB (UEM) 4505*64kB (UEM) 2935*128kB (UEM) 1497*256kB (UEM) 533*512kB (UEM) 244*1024kB (EM) 877*2048kB (EM) 90*4096kB (M) = 4080144kB
xxx-crm-db-043 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
xxx-crm-db-043 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
xxx-crm-db-043 kernel: 718080 total pagecache pages
xxx-crm-db-043 kernel: 7571 pages in swap cache
xxx-crm-db-043 kernel: Swap cache stats: add 43936925, delete 43929354, find 255985525/257994660
xxx-crm-db-043 kernel: Free swap = 0kB
xxx-crm-db-043 kernel: Total swap = 67129340kB
xxx-crm-db-043 kernel: 67007372 pages RAM
xxx-crm-db-043 kernel: 0 pages HighMem/MovableOnly
xxx-crm-db-043 kernel: 1120317 pages reserved
xxx-crm-db-043 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
xxx-crm-db-043 kernel: [ 770] 0 770 27716 528 58 62 0 systemd-journal
xxx-crm-db-043 kernel: [ 796] 0 796 10959 2 23 135 -1000 systemd-udevd
xxx-crm-db-043 kernel: [ 1052] 0 1052 13863 31 28 82 -1000 auditd
xxx-crm-db-043 kernel: [ 1085] 0 1085 5640 244 16 67 0 irqbalance
xxx-crm-db-043 kernel: [ 1087] 999 1087 133565 104 56 1831 0 polkitd
xxx-crm-db-043 kernel: [ 1090] 0 1090 6163 50 16 150 0 systemd-logind
xxx-crm-db-043 kernel: [ 1092] 81 1092 6134 76 16 46 -900 dbus-daemon
xxx-crm-db-043 kernel: [ 1103] 0 1103 31567 27 21 141 0 crond
xxx-crm-db-043 kernel: [ 1137] 0 1137 27511 1 10 29 0 agetty
xxx-crm-db-043 kernel: [ 1547] 0 1547 140597 82 93 3141 0 tuned
xxx-crm-db-043 kernel: [ 1554] 0 1554 3878 7 12 37 0 ctcm_agentd
xxx-crm-db-043 kernel: [ 1555] 0 1555 3878 592 12 25 0 ctcm_agentd
xxx-crm-db-043 kernel: [ 1556] 0 1556 3878 61 12 18 0 ctcm_agentd
xxx-crm-db-043 kernel: [ 1557] 0 1557 3878 6 11 38 0 ctcm_agentd
xxx-crm-db-043 kernel: [ 1597] 0 1597 10892 6 23 129 0 ctcss-execd
xxx-crm-db-043 kernel: [ 1609] 1000 1609 28750 56 26 134 0 ctcss-agentd
xxx-crm-db-043 kernel: [ 1617] 0 1617 10913 21 24 121 0 ctcss-logcollec
xxx-crm-db-043 kernel: [ 1629] 0 1629 10661 18 23 383 0 ctcss-syscheckd
xxx-crm-db-043 kernel: [ 4092] 0 4092 3878 6 11 36 0 ctcm_agentd
xxx-crm-db-043 kernel: [ 4094] 0 4094 3878 594 12 27 0 ctcm_agentd
xxx-crm-db-043 kernel: [ 4095] 0 4095 3878 32 10 26 0 ctcm_agentd
xxx-crm-db-043 kernel: [ 4128] 0 4128 3878 6 13 36 0 ctcm_agentd
xxx-crm-db-043 kernel: [ 4129] 0 4129 3878 592 14 26 0 ctcm_agentd
xxx-crm-db-043 kernel: [ 4130] 0 4130 3878 32 12 26 0 ctcm_agentd
xxx-crm-db-043 kernel: [ 3249] 0 3249 11038 4 23 136 0 keepalived
xxx-crm-db-043 kernel: [ 3250] 0 3250 11038 19 23 144 0 keepalived
xxx-crm-db-043 kernel: [ 664] 0 664 28848 1 12 102 0 bash
xxx-crm-db-043 kernel: [ 668] 0 668 1328640 72048 438 16566 0 filebeat
xxx-crm-db-043 kernel: [ 2433] 0 2433 1305720 74938 490 15311 0 filebeat
xxx-crm-db-043 kernel: [187163] 1005 187163 20434 10 40 193 0 zabbix_agentd
xxx-crm-db-043 kernel: [187164] 1005 187164 20434 585 42 182 0 zabbix_agentd
xxx-crm-db-043 kernel: [187165] 1005 187165 20434 16 40 204 0 zabbix_agentd
xxx-crm-db-043 kernel: [187166] 1005 187166 20434 17 40 204 0 zabbix_agentd
xxx-crm-db-043 kernel: [187167] 1005 187167 20434 16 40 204 0 zabbix_agentd
xxx-crm-db-043 kernel: [187168] 1005 187168 20434 60 40 181 0 zabbix_agentd
xxx-crm-db-043 kernel: [ 7329] 1008 7329 37234 2 29 88 0 ps
xxx-crm-db-043 kernel: [130090] 0 130090 26499 34 54 211 -1000 sshd
xxx-crm-db-043 kernel: [33976] 1008 33976 37234 0 29 90 0 ps
xxx-crm-db-043 kernel: [112227] 1008 112227 37234 2 29 88 0 ps
xxx-crm-db-043 kernel: [126338] 1008 126338 37234 2 29 88 0 ps
xxx-crm-db-043 kernel: [179285] 1008 179285 37234 0 28 90 0 ps
xxx-crm-db-043 kernel: [38321] 1008 38321 37234 2 29 88 0 ps
xxx-crm-db-043 kernel: [117415] 1008 117415 37234 0 30 89 0 ps
xxx-crm-db-043 kernel: [212839] 1003 212839 28315 2 13 87 0 mysqld_safe
xxx-crm-db-043 kernel: [214370] 1003 214370 24301176 9279471 44191 8368726 0 mysqld
xxx-crm-db-043 kernel: [190668] 1003 190668 28315 2 12 86 0 mysqld_safe
xxx-crm-db-043 kernel: [192347] 1003 192347 19795602 10438845 37189 6971474 0 mysqld
xxx-crm-db-043 kernel: [130629] 0 130629 684679 18330 178 4169 0 filebeat
xxx-crm-db-043 kernel: [107361] 0 107361 804691 2447 139 2890 0 dkron
xxx-crm-db-043 kernel: [107368] 0 107368 133765 125 37 1648 0 dkron-executor-
xxx-crm-db-043 kernel: [107383] 0 107383 119408 92 36 1148 0 dkron-executor-
xxx-crm-db-043 kernel: [107396] 0 107396 135824 91 39 2187 0 dkron-executor-
xxx-crm-db-043 kernel: [107417] 0 107417 102684 106 35 642 0 dkron-executor-
xxx-crm-db-043 kernel: [77160] 38 77160 6417 35 19 111 0 ntpd
xxx-crm-db-043 kernel: [ 9166] 0 9166 185146 1052 244 344 0 rsyslogd
xxx-crm-db-043 kernel: [141268] 1009 141268 5023 6 13 41 0 zabbix_agentd
xxx-crm-db-043 kernel: [141269] 1009 141269 5023 596 14 32 0 zabbix_agentd
xxx-crm-db-043 kernel: [141270] 1009 141270 5023 50 13 23 0 zabbix_agentd
xxx-crm-db-043 kernel: [141271] 1009 141271 5023 50 13 24 0 zabbix_agentd
xxx-crm-db-043 kernel: [141272] 1009 141272 5023 49 13 26 0 zabbix_agentd
xxx-crm-db-043 kernel: [141273] 1009 141273 5023 28 12 30 0 zabbix_agentd
xxx-crm-db-043 kernel: [ 9724] 1008 9724 37234 2 30 88 0 ps
xxx-crm-db-043 kernel: [ 9901] 1008 9901 177200 2555 92 3360 0 python
xxx-crm-db-043 kernel: [225017] 0 225017 147896 12602 119 10106 0 java
xxx-crm-db-043 kernel: [ 4265] 0 4265 142776 730 97 9462 0 java
xxx-crm-db-043 kernel: [121182] 1004 121182 3530288 319081 1408 285577 0 java
xxx-crm-db-043 kernel: [121671] 1004 121671 3497006 134740 1395 455433 0 java
xxx-crm-db-043 kernel: [220928] 1004 220928 2327527 157952 641 17454 0 java
xxx-crm-db-043 kernel: [222409] 1004 222409 2327527 145042 650 21089 0 java
xxx-crm-db-043 kernel: [120981] 1007 120981 897218 6506 156 18 0 telegraf
xxx-crm-db-043 kernel: [14208] 1003 14208 14969 182 31 0 0 mysqlreport
xxx-crm-db-043 kernel: [14209] 1003 14209 202664 429 57 1 0 mysqlreport
xxx-crm-db-043 kernel: [164882] 0 164882 37079 325 73 0 0 sshd
xxx-crm-db-043 kernel: [164884] 1006 164884 37079 323 72 0 0 sshd
xxx-crm-db-043 kernel: [164885] 1006 164885 28936 180 13 0 0 bash
xxx-crm-db-043 kernel: [165192] 0 165192 37079 324 75 0 0 sshd
xxx-crm-db-043 kernel: [165194] 1006 165194 37079 322 73 0 0 sshd
xxx-crm-db-043 kernel: [165195] 1006 165195 28912 180 15 0 0 bash
xxx-crm-db-043 kernel: [155589] 0 155589 37078 324 75 0 0 sshd
xxx-crm-db-043 kernel: [155591] 0 155591 37078 323 77 0 0 sshd
xxx-crm-db-043 kernel: [155592] 1003 155592 37078 330 73 0 0 sshd
xxx-crm-db-043 kernel: [155594] 1003 155594 37078 323 75 0 0 sshd
xxx-crm-db-043 kernel: [155595] 1003 155595 12777 133 29 0 0 sftp-server
xxx-crm-db-043 kernel: [155606] 1003 155606 28283 61 13 0 0 bash
xxx-crm-db-043 kernel: [155617] 1003 155617 50329587 42324156 97694 248269 0 innobackupex
xxx-crm-db-043 kernel: [155618] 1003 155618 1607 28 9 0 0 sshpass
xxx-crm-db-043 kernel: [155619] 1003 155619 44504 685 42 0 0 ssh
xxx-crm-db-043 kernel: [159336] 0 159336 37079 324 75 0 0 sshd
xxx-crm-db-043 kernel: [159338] 1006 159338 37079 333 73 0 0 sshd
xxx-crm-db-043 kernel: [159340] 1006 159340 28904 144 16 0 0 bash
xxx-crm-db-043 kernel: [110630] 0 110630 37079 325 75 0 0 sshd
xxx-crm-db-043 kernel: [110633] 1003 110633 37079 335 73 0 0 sshd
xxx-crm-db-043 kernel: [110634] 1003 110634 28847 100 14 0 0 bash
xxx-crm-db-043 kernel: [111511] 1003 111511 33443 171 22 0 0 mysql
xxx-crm-db-043 kernel: [ 4741] 0 4741 37079 324 77 0 0 sshd
xxx-crm-db-043 kernel: [ 4752] 1003 4752 37079 338 73 0 0 sshd
xxx-crm-db-043 kernel: [ 4753] 1003 4753 28847 103 14 0 0 bash
xxx-crm-db-043 kernel: [24250] 0 24250 37079 324 76 0 0 sshd
xxx-crm-db-043 kernel: [24252] 1003 24252 37079 332 74 0 0 sshd
xxx-crm-db-043 kernel: [24253] 1003 24253 28847 104 15 0 0 bash
xxx-crm-db-043 kernel: [31919] 0 31919 37079 324 73 0 0 sshd
xxx-crm-db-043 kernel: [31921] 1004 31921 37079 329 71 0 0 sshd
xxx-crm-db-043 kernel: [31922] 1004 31922 28847 92 13 0 0 bash
xxx-crm-db-043 kernel: [32836] 0 32836 48893 172 53 0 0 sudo
xxx-crm-db-043 kernel: [32837] 0 32837 47397 145 48 0 0 su
xxx-crm-db-043 kernel: [32838] 0 32838 28847 99 14 0 0 bash
xxx-crm-db-043 kernel: [33133] 0 33133 47397 144 52 0 0 su
xxx-crm-db-043 kernel: [33134] 1003 33134 28848 99 12 0 0 bash
xxx-crm-db-043 kernel: [35883] 1003 35883 33443 161 21 0 0 mysql
xxx-crm-db-043 kernel: Out of memory: Kill process 155617 (innobackupex) score 516 or sacrifice child
xxx-crm-db-043 kernel: Killed process 155617 (innobackupex) total-vm:201318348kB, anon-rss:169296624kB, file-rss:0kB, shmem-rss:0kB
xxx-crm-db-043 telegraf: 2020/09/24 09:17:54 recv loop terminated: err=EOF
xxx-crm-db-043 telegraf: 2020/09/24 09:17:54 send loop terminated: err=<nil>
xxx-crm-db-043 systemd-logind: New session 13705934 of user db.
xxx-crm-db-043 systemd: Started Session 13705934 of user db.
可以看到,在被kill之前,xtrabackup已经使用了200多G内存了。这就不是正常应该出现的情况了。并且在监控平台可以看到:
可以看到基本是在死锁发生之后,xtrabackup一直在申请内存,而没有释放,造成了内存泄漏。
而对应的xtrabackup日志则是:
## 备份参数
start to run innobackup,cmd is xtrabackup --defaults-file=/app/db/mysql/mysql_8804/etc/mysql8804.cnf --backup --ftwrl-wait-threshold=10 --ftwrl-wait-query-type=all --ftwrl-wait-timeout=20 --kill-long-query-type=all --kill-long-queries-timeout=60 --stream=xbstream --parallel=4 --compress-threads=4 --user=agent --password='*****' --socket=/tmp/mysql_8804.sock --target-dir=/tmp | sshpass -p'passwd' ssh -p22 -oPreferredAuthentications=password -o StrictHostKeyChecking=no -o GSSAPIAuthentication=no db@134.84.46.9 'xbstream -x -C /app/db/mysql-bak/set_6798804/134.84.46.6_8804/backup_0000000079/2020-11-12_00-00-10'
## 相关日志
201112 03:08:21 >> log scanned up to (1895593797155)
201112 03:08:22 >> log scanned up to (1895593857713)
201112 03:08:23 [01] ...done
201112 03:08:23 Waiting 20 seconds for queries running longer than 10 seconds to finish
201112 03:08:23 Executing FLUSH TABLES WITH READ LOCK...
201112 03:08:23 Kill query timeout 60 seconds.
201112 03:08:23 >> log scanned up to (1895595105480)
201112 03:08:24 >> log scanned up to (1895595108000)
201112 03:08:25 >> log scanned up to (1895595121941)
201112 03:08:26 >> log scanned up to (1895595135882)
……
201112 05:39:49 >> log scanned up to (1895595179845)
201112 05:39:50 >> log scanned up to (1895595179845)
201112 05:39:51 >> log scanned up to (1895595179845)
[2020/11/12 05:40:12 CST] [EROR] (db/pkg/backuprecovery/backup.(*nationalBackupManager).dataInnobackup:72) [set_2348801]: backup failed
[2020/11/12 05:40:12 CST] [EROR] (db/pkg/backuprecovery/backup.(*basicBackupManager).logErrorAndReport:161) Xtrabackup run backup failed!, can't find completed OK! in last half ret
二、分析
2.1 xtrabackup备份逻辑
这一部分就不过多说明,简单需要了解的就是:xtrabackup进行mysql备份时,首先会开启一个辅助线程,进行redo log拷贝;主线程则首先并行的进行ibd文件拷贝;然后执行ftwrl得到一致性位点;最后进行frm等非事务文件的拷贝。主线程通知redo log拷贝线程停止工作,并释放锁退出。
从上面的日志可以看出:
- xtrabackup已经完成了ibd文件的拷贝,正在进行ftwrl,发生死锁
- 此时只有主线程,redolog复制线程;
- 可以看到我们在xtrabackup中开启了kill-long-queries-timeout选项,所以,在执行ftwrl时,会开启一个辅助线程,从Kill query timeout 60 seconds.日志也可以看出。
可以看到,发生内存泄漏只可能发生在上述三个线程中,而主线程此时应该阻塞在ftwrl中,而redo log复制线程由于死锁,没有redolog,应该也不会存在内存申请的过程。那可以发生泄漏就只有kill-long-queries-timeout辅助线程了,下面是相关源码分析(基于2.4.12)。
2.1 源码分析
首先,我们查看执行ftwrl附近相关源代码
msg_ts("Executing FLUSH TABLES WITH READ LOCK...\n"); // 打印正在执行ftwrl日志
if (opt_kill_long_queries_timeout) { // 如果开启了kill-long-queries-timeout选项,则开启辅助线程,内部通过os_event进行唤醒退出
start_query_killer();
}
……
xb_mysql_query(connection, "FLUSH TABLES WITH READ LOCK", false); // 执行ftwrl
if (opt_kill_long_queries_timeout) { // 执行完ftwrl后,通过os_event通知辅助线程退出
stop_query_killer();
}
tables_locked = true;
而复杂线程的执行代码如下:
kill_query_thread(
/*===============*/
void *arg __attribute__((unused)))
{
MYSQL *mysql;
time_t start_time;
start_time = time(NULL);
os_event_set(kill_query_thread_started);
msg_ts("Kill query timeout %d seconds.\n", // 打印日志
opt_kill_long_queries_timeout);
while (time(NULL) - start_time <
(time_t)opt_kill_long_queries_timeout) {
if (os_event_wait_time(kill_query_thread_stop, 1000) !=
OS_SYNC_TIME_EXCEEDED) {
goto stop_thread;
}
}
if ((mysql = xb_mysql_connect()) == NULL) {
msg("Error: kill query thread failed\n");
goto stop_thread;
}
while (true) { // 循环执行kill操作,kill超时的语句,直到遇到kill_query_thread_stop被唤醒为止
kill_long_queries(mysql, time(NULL) - start_time);
if (os_event_wait_time(kill_query_thread_stop, 1000) !=
OS_SYNC_TIME_EXCEEDED) {
break;
}
}
mysql_close(mysql);
stop_thread:
msg_ts("Kill query thread stopped\n");
os_event_set(kill_query_thread_stopped);
os_thread_exit();
OS_THREAD_DUMMY_RETURN;
}
从上述逻辑看,并没有什么问题,但是当我们查看kill_long_queries函数时,发现了内存泄漏,由于在不断的while循环之中,最终导致OOM,下面时kill_long_queries函数具体实现
static
void
kill_long_queries(MYSQL *connection, uint timeout)
{
MYSQL_RES *result;
MYSQL_ROW row;
bool all_queries;
char kill_stmt[100];
result = xb_mysql_query(connection, "SHOW FULL PROCESSLIST", true); // 查看mysql现有线程执行情况,结果保存在result中
all_queries = (opt_kill_long_query_type == QUERY_TYPE_ALL);
while ((row = mysql_fetch_row(result)) != NULL) {
const char *info = row[7];
int duration = atoi(row[5]);
char *id = row[0];
if (info != NULL && // 只有info不为null,才能被kill
duration >= (int)timeout &&
((all_queries && is_query(info)) ||
is_select_query(info))) {
msg_ts("Killing query %s (duration %d sec): %s\n",
id, duration, info);
ut_snprintf(kill_stmt, sizeof(kill_stmt),
"KILL %s", id);
xb_mysql_query(connection, kill_stmt, false, false);
}
}
}
上述处理过程存在两个问题:
- info != NULL判断条件,我们在死锁界面可以看到,所有线程的info都为null,所以kill_long_queries并不能解决死锁问题,造成kill_long_queries会不断执行;
- xb_mysql_query返回结果result没有释放,造成内存泄漏。这一点是坚决不被允许的,在mysql客户端开发手册中明确指出
三、规避与修复
指导了上述问题原因,相应的应对措施也就可以对症下药了。
-
规避
关闭kill_long_queries选项
-
修复
我们可以直接在kill_long_queries中进行修复,即在函数结尾调用mysql_free_result(result)。我们也查到在xtrabackup官方直到2019年7月5号进行了修复。相关的提交如下https://github.com/percona/percona-xtrabackup/commit/b1a35193732d276d107887ecd5d2e03f3e7d30c8
2.4.15之后的版本都进行了修复。