背景
近期,兄弟部门反馈关于我们产品自带的MYSQL数据库的问题。他们指出,数据库在每天凌晨12点会出现宕机现象。这一现象也是最新一段才出现的,这段时间只是增加了一些自动化程序。频繁宕机的问题影响了兄弟部门的工作流程,也产生了一定的负面影响。为了解决这一问题,我们需要深入调查并给出解决措施。
排查过程
确认MYSQL所在的服务器资源情况
我们了解到这段时间做的变更中提到增加了一些自动化的程序,是不是自动化程序中涉及到操作数据库,造成系统资源不足,被系统杀掉?于是查看各类系统资源。无论是磁盘还是内存都有较大剩余,也查看了部分系统日志,均没有因为资源不足杀进程的日志,所以应该不是此原因导致宕机。
结论:不是因为系统资源导致MYSQL宕机
排查MYSQL错误日志
在my.ini中指定当mysql发生故障时的的错误日志地址,具体配置方式如下:
# start error log
log-error=mysql.err
mysql.err内容如下
2024-01-08 9:30:46 0 [Note] InnoDB: Uses event mutexes
2024-01-08 9:30:46 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2024-01-08 9:30:46 0 [Note] InnoDB: Number of pools: 1
2024-01-08 9:30:46 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2024-01-08 9:30:46 0 [Note] mysqld: O_TMPFILE is not supported on /tmp (disabling future attempts)
2024-01-08 9:30:46 0 [Note] InnoDB: Using Linux native AIO
2024-01-08 9:30:46 0 [Note] InnoDB: Initializing buffer pool, total size = 536870912, chunk size = 134217728
2024-01-08 9:30:46 0 [Note] InnoDB: Completed initialization of buffer pool
2024-01-08 9:30:46 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=296938092837,296938109566
2024-01-08 9:30:47 0 [Note] InnoDB: Transaction 57863035 was in the XA prepared state.
2024-01-08 9:30:47 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 0 row operations to undo
2024-01-08 9:30:47 0 [Note] InnoDB: Trx id counter is 57863036
2024-01-08 9:30:47 0 [Note] InnoDB: Starting final batch to recover 54726 pages from redo log.
2024-01-08 9:30:49 0 [Note] InnoDB: Last binlog file './mysql-bin.000078', position 111348635
2024-01-08 9:30:49 0 [Note] InnoDB: 128 rollback segments are active.
2024-01-08 9:30:49 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
2024-01-08 9:30:49 0 [Note] InnoDB: Rollback of non-prepared transactions completed
2024-01-08 9:30:49 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2024-01-08 9:30:49 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2024-01-08 9:30:49 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2024-01-08 9:30:49 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2024-01-08 9:30:49 0 [Note] InnoDB: 10.5.11 started; log sequence number 297037605262; transaction id 57863037
2024-01-08 9:30:49 0 [Note] InnoDB: Loading buffer pool(s) from /home/hsrisk/disk01/see/mysql/data/ib_buffer_pool
2024-01-08 9:30:49 0 [Note] Plugin 'FEEDBACK' is disabled.
2024-01-08 9:30:49 0 [Note] Recovering after a crash using mysql-bin
2024-01-08 9:30:49 0 [Note] InnoDB: Buffer pool(s) load completed at 240108 9:30:49
2024-01-08 9:30:49 0 [Note] Starting crash recovery...
2024-01-08 9:30:49 0 [Note] InnoDB: Starting recovery for XA transactions...
2024-01-08 9:30:49 0 [Note] InnoDB: Transaction 57863035 in prepared state after recovery
2024-01-08 9:30:49 0 [Note] InnoDB: Transaction contains changes to 1 rows
2024-01-08 9:30:49 0 [Note] InnoDB: 1 transactions in prepared state after recovery
2024-01-08 9:30:49 0 [Note] Found 1 prepared transaction(s) in InnoDB
2024-01-08 9:30:49 0 [Note] Crash recovery finished.
2024-01-08 9:30:49 0 [Note] Server socket created on IP: '::'.
2024-01-08 9:30:49 0 [Note] Reading of all Master_info entries succeeded
2024-01-08 9:30:49 0 [Note] Added new Master_info '' to hash table
2024-01-08 9:30:49 0 [Note] ./mysqld: ready for connections.
Version: '10.5.11-MariaDB-log' socket: '/tmp/see20/mysql.sock' port: 3306 MariaDB Server
这段日志是客户因为数据库宕机后手动重启的日志。这个日志中发现有Recovering after a crash using mysql-bin,Starting crash recovery,Crash recovery finished 等字样,表示数据库之前确实是异常关闭的,但是这种异常是MYSQL SERVER预料之外的。预料之外的含义是这种异常数据库未定义,如果是意料之中的异常的话,日志中会留下更多的信息,例如下图,虽然没写明是什么原因导致的数据库崩溃,但是他清楚的告诉了报错的地点以及保留了帮助定位的信息。
2023-09-17 21:36:12 0x7f384413a700 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/storage/innobase/lock/lock0lock.cc line 7014
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/xtradbinnodb-recovery-modes/
InnoDB: about forcing recovery.
230917 21:36:12 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.3.10-MariaDB-log
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=322
max_threads=6002
thread_count=261
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 13325602 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x7f35980009a8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f3844139e30 thread_stack 0x49000
mysys/stacktrace.c:270(my_print_stacktrace)[0x556c2c650b4b]
sql/signal_handler.cc:168(handle_fatal_signal)[0x556c2c13bba7]
sigaction.c:0(__restore_rt)[0x7f3849863680]
:0(__GI_raise)[0x7f384866f207]
:0(__GI_abort)[0x7f38486708f8]
ut/ut0rbt.cc:461(rbt_eject_node(ib_rbt_node_t*, ib_rbt_node_t*) [clone .part.4])[0x556c2bea97b9]
lock/lock0lock.cc:7014(DeadlockChecker::check_and_resolve(ib_lock_t const*, trx_t*))[0x556c2c319ec2]
lock/lock0lock.cc:1741(lock_rec_enqueue_waiting(ib_lock_t*, unsigned long, buf_block_t const*, unsigned long, dict_index_t*, que_thr_t*, lock_prdt*))[0x556c2c31aaf4]
lock/lock0lock.cc:1990(lock_rec_lock(bool, unsigned long, buf_block_t const*, unsigned long, dict_index_t*, que_thr_t*))[0x556c2c31df48]
row/row0sel.cc:1284(sel_set_rec_lock(btr_pcur_t*, unsigned char const*, dict_index_t*, unsigned long const*, unsigned long, unsigned long, que_thr_t*, mtr_t*))[0x556c2c3b783c]
row/row0sel.cc:5047(row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long))[0x556c2c3b8a10]
handler/ha_innodb.cc:9264(ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function))[0x556c2c2e11fc]
sql/handler.cc:2798(handler::ha_rnd_pos(unsigned char*, unsigned char*))[0x556c2c140c82]
sql/handler.h:3568(handler::rnd_pos_by_record(unsigned char*))[0x556c2c14855b]
sql/sql_class.h:6384(handler::ha_rnd_pos_by_record(unsigned char*))[0x556c2c219170]
sql/log_event.cc:14265(Update_rows_log_event::do_exec_row(rpl_group_info*))[0x556c2c2194f4]
sql/log_event.cc:11514(Rows_log_event::do_apply_event(rpl_group_info*))[0x556c2c20ddcc]
sql/log_event.h:5159(Log_event::apply_event(rpl_group_info*))[0x556c2bef5c5c]
sql/slave.cc:3920(apply_event_and_update_pos_apply(Log_event*, THD*, rpl_group_info*, int))[0x556c2bee83db]
sql/mysqld.h:808(thread_safe_increment64)[0x556c2c09073e]
sql/rpl_parallel.cc:1300(handle_rpl_parallel_thread)[0x556c2c0931fb]
pthread_create.c:0(start_thread)[0x7f384985bdd5]
*** stack smashing detected ***: /home/hundsun/mysql/mysql/bin/mysqld terminated
结论:是异常宕机,并且该异常不是系统定义,无法被MYSQL SERVER所捕获。
排查MYSQL BINLOG日志
在my.ini中指定当开启binlog日志,具体配置方式如下:
log-bin=mysql-bin #开启binlog
重启MYSQL后会创建如下格式的binlog文件
mysql-bin.000001
mysql-bin.000002
MYSQL生成的binlog是二进制文件,无法直接查看。可以使用mysqlbinlog工具将binlog二进制数据导出进行查看,例如
./mysqlbinlog ../data/mysql-bin.000004 --start-datetime='2020-02-12 23:30:00' --stop-datetime='2025-02-12:23:50:00' --base64-output=DECODE-ROWS > 4.txt
可以按照时间维度将数据导出到文件中进行查看。下面是2个binlog的导出数据
mysql-bin.000041这个文件有203M,最后的修改时间是2023-12-24 23:59:15。此binlog导出文件的末尾数据如下
#231224 23:59:15 server id 1 end_log_pos 83409660 CRC32 0x595aa790 Table_map: `acm`.`tb_lock` mapped to number 58
# at 83409660
#231224 23:59:15 server id 1 end_log_pos 83410156 CRC32 0xed3ad18a Update_rows: table id 58 flags: STMT_END_F
# Number of rows: 1
# at 83410156
#231224 23:59:15 server id 1 end_log_pos 83410187 CRC32 0x539c3bfa Xid = 1904727
COMMIT/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
mysql-bin.000042这个文件有1G,最后的修改时间是2023-12-25 19:03:03。此binlog导出文件的末尾数据如下