一、背景
- 在业务庞大,运维流程长且复杂的场景下,为了提高运维效率,减小系统对业务工人的依赖(耦合),系统引入消息队列
- 系统基于MySql实现的消息队列 oxen-queue
- 在高并发的情况下,消息队列由于设计原因,有死锁情况出现
- 造成的影响:入队操作失败,导致单据创建失败;重新入队不及时,导致业务处理慢
二、环境
- 建表语句
create table Queue ( FId bigint unsigned auto_increment primary key, FBatchId bigint unsigned null, FJobType varchar(256) not null, FCreatedTs datetime default CURRENT_TIMESTAMP not null comment '创建时间', FStartedTs datetime null comment '开始时间', FBody varchar(8096) null, FStatus varchar(256) default 'waiting' null, FResult text null, FRecovered int default 0 null, FRunningTime int null, FUniqueKey bigint null, FPriority bigint null, FRetryCount bigint default 0 not null, FCreateTime datetime default CURRENT_TIMESTAMP not null, constraint i_u_FUniqueKey unique (FUniqueKey) ) charset = utf8; create index i_FBatchId_FPriority on devops.Queue (FBatchId, FPriority); create index i_FCreatedTs on devops.Queue (FCreatedTs); create index i_FJobType_FBatchId_FStatus_FPriority on devops.Queue (FJobType, FBatchId, FStatus, FPriority); create index i_FStartedTs_FJobType_FStatus on devops.Queue (FStartedTs, FJobType, FStatus); create index i_FStatus on devops.Queue (FStatus); INSERT INTO `Queue` (`FBody`, `FCreateTime`, `FCreatedTs`, `FJobType`, `FPriority`, `FRetryCount`, `FStartedTs`, `FUniqueKey`) VALUES ('body', '2022-08-22 10:59:59', '2022-08-23 14:30:00', 'syncChangeStatusQueue', 1661137199157, 0, '2022-08-22 10:59:59', NULL);
- Queue相关DB操作
// 入队 INSERT INTO `Queue` (`FBody`, `FCreateTime`, `FCreatedTs`, `FJobType`, `FPriority`, `FRetryCount`, `FStartedTs`, `FUniqueKey`) VALUES ('body', '2022-08-22 10:59:59', '2022-08-23 14:30:00', 'syncChangeStatusQueue', 1661137199157, 0, '2022-08-22 10:59:59', NULL); // 重新入队 UPDATE `Queue` SET `FStatus` = 'waiting', `FCreatedTs` = '2022-09-19 09:37:46.766', `FStartedTs` = '2022-09-19 09:37:46.766', `FBatchId` = NULL, `FRetryCount` = 1, `FBody` = 'body' WHERE `FId` = 1; // 超时任务处理 UPDATE T_DevOpsQueue SET FStatus='waiting', FBatchId = NULL, FStartedTs = NULL, FRecovered = 1 WHERE FStatus = 'processing' AND FStartedTs < (NOW() - INTERVAL 60 SECOND) AND `FJobType` = 'syncChangeStatusQueue';
三、问题定位
起因
用户反馈单据创建异常,通过相关信息,定位到异常产生的原因是消息队列死锁导致-
系统异常日志
INSERT INTO `Queue` (`FBody`, `FCreateTime`, `FCreatedTs`, `FJobType`, `FPriority`, `FRetryCount`, `FStartedTs`, `FUniqueKey`) VALUES ('body', '2022-08-22 10:59:59', '2022-08-23 14:30:00', 'syncChangeStatusQueue', 1661137199157, 0, '2022-08-22 10:59:59', NULL) ER_LOCK_DEADLOCK: Deadlock found when trying to get lock; try restarting transaction
首次入队失败,由于没有回退流程,导致数据单据数据不一致
第二条错误日志,这里是我通过在日志中发现的, -
统计系统死锁(以
ER_LOCK_DEADLOCK
为关键字)
通过ER_LOCK_DEADLOCK
查找,发现不只是Insert, 觉大部分为UpdateUPDATE `Queue` SET `FStatus` = 'waiting', `FCreatedTs` = '2022-09-19 09:37:46.766', `FStartedTs` = '2022-09-19 09:37:46.766', `FBatchId` = NULL, `FRetryCount` = 1, `FBody`= 'body' WHERE `FId` = 1 ER_LOCK_DEADLOCK: Deadlock found when trying to get lock; try restarting transaction
问:为什么异常发生频率这么高,却一直没有被发现
答:Insert会引起业务异常(首次入队失败,由于没有回退流程,导致数据单据数据不一致),更容易关注到,Updata(是重新入队)会被后台Worker补偿,所以没有业务影响 初步分析-业务层面
系统错误日志已经明确,试图从业务层面去分区死锁并解决,
猜测:由于业务处理复杂,整个流程全部在一个事务中处理(包含多次网络请求),初步认为是事务使用不当导致死锁
解决:拆解事务,网络请求等耗时操作移出事务
结果:并没有改善,所以不是这里引起的(事后分析,入队操作并没有使用事务,所以一定不是这里引起的)
四、死锁分析与处理
-
死锁分析(一)
-
找DBA拉取死锁日志(
show engine innodb status
)// 事物一 *** (1) TRANSACTION: TRANSACTION 264237645, ACTIVE 2 sec starting index read mysql tables in use 2, locked 2 LOCK WAIT 128 lock struct(s), heap size 24784, 265 row lock(s) // 事物ID:53640626,client host: x.x.x.177 MySQL thread id 53640626, OS thread handle 139684660885248, query id 1839137253 x.x.x.177 x.x.x.177 db Sending data // 死锁SQL UPDATE Queue AS main INNER JOIN ( SELECT FId FROM Queue FORCE INDEX (i_FJobType_FBatchId_FStatus_FPriority) WHERE FBatchId IS NULL AND FStatus = "waiting" AND `FJobType` = 'syncChangeStatusQueue' AND FCreatedTs <= NOW() ORDER BY FPriority ASC LIMIT 10 ) sub ON sub.FId = main.FId SET `FBatchId` = 6835010934527921, FStatus = "processing", FStartedTs = NOW() // 事物一,等待的锁 *** (1) WAITING FOR THIS LOCK TO BE GRANTED: // 等待主键的记录X锁(排他锁) RECORD LOCKS space id 217 page no 346233 n bits 96 index PRIMARY of table `db`.`Queue` trx id 264237645 lock_mode X locks rec but not gap waiting Record lock, heap no 25 PHYSICAL RECORD: n_fields 16; compact format; info bits 0 // 事物二 *** (2) TRANSACTION: TRANSACTION 264237646, ACTIVE 1 sec starting index read mysql tables in use 2, locked 2 128 lock struct(s), heap size 24784, 265 row lock(s) // 事物ID:53950278,client host: x.x.x.198 MySQL thread id 53950278, OS thread handle 139684121917184, query id 1839137525 x.x.x.198 x.x.x.198 db Sending data // 死锁SQL UPDATE Queue AS main INNER JOIN ( SELECT FId FROM Queue FORCE INDEX (i_FJobType_FBatchId_FStatus_FPriority) WHERE FBatchId IS NULL AND FStatus = "waiting" AND `FJobType` = 'syncChangeStatusQueue' AND FCreatedTs <= NOW() ORDER BY FPriority ASC LIMIT 10 ) sub ON sub.FId = main.FId SET `FBatchId` = 2026729855303369, FStatus = "processing", FStartedTs = NOW() // 事物二,持有的锁 *** (2) HOLDS THE LOCK(S): // 等待主键的记录S锁(共享锁) RECORD LOCKS space id 217 page no 346233 n bits 96 index PRIMARY of table `db`.`Queue` trx id 264237646 lock mode S locks rec but not gap Record lock, heap no 25 PHYSICAL RECORD: n_fields 16; compact format; info bits 0 // 事物二,等待的锁 *** (2) WAITING FOR THIS LOCK TO BE GRANTED: // 等待主键的记录X锁(排他锁) RECORD LOCKS space id 217 page no 346233 n bits 96 index PRIMARY of table `db`.`Queue` trx id 264237646 lock_mode X locks rec but not gap waiting Record lock, heap no 25 PHYSICAL RECORD: n_fields 16; compact format; info bits 0 // 事物二,回滚 *** WE ROLL BACK TRANSACTION (2)
-
日志分析
疑惑:为什么业务系统异常日志信息和MySql死锁日志信息不一致?
推测: MySql中引起死锁的Sql会影响业务系统中Sql,导致的异常;不止一个死锁先解决MySql死锁日志中的: 死锁语句是同一个,但是是不同的Client执行(不同的Host),系统的服务,部署在不同的机器(集群)并发引起的
- 死锁SQL
UPDATE Queue AS main INNER JOIN ( SELECT FId FROM Queue FORCE INDEX (i_FJobType_FBatchId_FStatus_FPriority) WHERE FBatchId IS NULL AND FStatus = "waiting" AND `FJobType` = 'syncChangeStatusQueue' AND FCreatedTs <= NOW() ORDER BY FPriority ASC LIMIT 10 ) sub ON sub.FId = main.FId SET `FBatchId` = 2026729855303369, FStatus = "processing", FStartedTs = NOW()
-
通过日志,分析得到,俩个事务持有、等待锁的情况
事务1 事务2 持有锁 ... PRIMARY的记录S锁 等待锁 PRIMARY的记录X锁 PRIMARY的记录X锁 -
虽然日志中没有记录事务1持有锁的情况,但是不难推断出,是PRIMARY的记录S锁
-
分析一句Sql是怎么形成上面这样的死锁的
-
首先分析这条Sql会加哪些锁(
select * from performance_schema.data_locks
)
sql锁情况
- Sql语句中有子查询,并强制使用i_FJobType_FBatchId_FStatus_FPriority索引,所以这里获取了二级索引(i_FJobType_FBatchId_FStatus_FPriority)的Next-Key共享锁,
- 由于索引的机制,这里会回表,也就是通过二级索引,会获取聚簇索引的记录(
S,REC_NOT_GAP
)共享锁, - 通过主键Update,所以再获取聚簇索引的记录(
X,REC_NOT_GAP
)排他锁
-
真实的死锁原因
先加了聚簇索引的记录共享锁,然后再获取聚簇索引的记录排他锁,在并发的情况下,同时持有共享锁,再获取排他锁时,相互等待,形成死锁
-
-
如何解决
- 问题的根因已经明确,这里主要共享锁相互兼容,导致相同的Sql语句同时持有共享锁导致,所以我们这里通过将共享锁升级为排他锁,解决,添加
FOR UPDATE
UPDATE Queue AS main INNER JOIN ( SELECT FId FROM Queue FORCE INDEX (i_FJobType_FBatchId_FStatus_FPriority) WHERE FBatchId IS NULL AND FStatus = "waiting" AND `FJobType` = 'syncChangeStatusQueue' AND FCreatedTs <= NOW() ORDER BY FPriority ASC LIMIT 10 FOR UPDATE ) sub ON sub.FId = main.FId SET `FBatchId` = 2026729855303369, FStatus = "processing", FStartedTs = NOW();
调整后Sql锁情况
通过FOR UPDATE将共享锁升级为排他锁,顺序加锁,解决问题
- 问题的根因已经明确,这里主要共享锁相互兼容,导致相同的Sql语句同时持有共享锁导致,所以我们这里通过将共享锁升级为排他锁,解决,添加
-
解决效果
通过统计观察指定异常,发现这里的异常已经消除
-
-
死锁分析(二)
上面的异常是消除了,但是并没有解决掉我发现系统日志的异常信息,也就侧面验证了我们在上面的疑惑,系统不止一个死锁所以这里调整了日志统计信息,看所有的死锁问题
show engine innodb status
只显示最近一次发生的死锁日志,如果死锁频繁,可以通过系统全局变量innodb_print_all_deadlocks
置为ON,通过日志可以看到所有的死锁信息-
找DBA拉取死锁日志(
show engine innodb status
),发现上面的死锁确实换成了另外一个*** (1) TRANSACTION: TRANSACTION 278046801, ACTIVE 0 sec starting index read mysql tables in use 1, locked 1 LOCK WAIT 5 lock struct(s), heap size 1136, 3 row lock(s) MySQL thread id 54720832, OS thread handle 139687080482560, query id 1895336202 x.x.x.177 x.x.x.177 mmpaydevopsweb Searching rows for update UPDATE Queue SET FStatus="waiting", FBatchId = NULL, FStartedTs = NULL, FRecovered = 1 WHERE FStatus="processing" AND FStartedTs < (NOW() - INTERVAL 60 SECOND) AND `FJobType` = 'syncChangeGrayTimeQueue' *** (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 217 page no 325711 n bits 104 index PRIMARY of table `db`.`Queue` trx id 278046801 lock_mode X locks rec but not gap waiting Record lock, heap no 31 PHYSICAL RECORD: n_fields 16; compact format; info bits 0 *** (2) TRANSACTION: TRANSACTION 278046800, ACTIVE 0 sec updating or deleting mysql tables in use 1, locked 1 3 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1 MySQL thread id 55659707, OS thread handle 139684629968640, query id 1895336203 x.x.x.62 x.x.x.62 mmpaydevopsweb updating update `Queue` set `FStatus` = 'waiting', `FCreatedTs` = '2022-09-19 09:37:46.766', `FStartedTs` = '2022-09-19 09:37:46.766', `FBatchId` = NULL, `FRetryCount` = 26846, `FBody` = '{\"job_type\":\"syncChangeStatusJobType\",\"payload\":{\"deploymentId\":41582},\"start_time\":10000}' where `FId` = 7446396 *** (2) HOLDS THE LOCK(S): RECORD LOCKS space id 217 page no 325711 n bits 104 index PRIMARY of table `db`.`Queue` trx id 278046800 lock_mode X locks rec but not gap Record lock, heap no 31 PHYSICAL RECORD: n_fields 16; compact format; info bits 0 *** (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 217 page no 435270 n bits 864 index i_FStatus of table `db`.`Queue` trx id 278046800 lock_mode X locks rec but not gap waiting Record lock, heap no 767 PHYSICAL RECORD: n_fields 2; compact format; info bits 0 *** WE ROLL BACK TRANSACTION (2)
-
日志分析
-
死锁SQL
============================= 事物一 ====================================== UPDATE Queue SET FStatus="waiting", FBatchId = NULL, FStartedTs = NULL, FRecovered = 1 WHERE FStatus="processing" AND FStartedTs < (NOW() - INTERVAL 60 SECOND) AND `FJobType` = 'syncChangeGrayTimeQueue' ============================= 事物二 ====================================== UPDATE `Queue` SET `FStatus` = 'waiting', `FCreatedTs` = '2022-09-19 09:37:46.766', `FStartedTs` = '2022-09-19 09:37:46.766', `FBatchId` = NULL, `FRetryCount` = 1, `FBody` = 'body' WHERE `FId` = 150; ```
通过日志,分析得到,俩个事务持有、等待锁的情况
事务1 事务2 持有锁 ... PRIMARY的记录X锁 等待锁 PRIMARY的记录X锁 二级索引FStatus的记录X锁 -
-
分析Sql是怎么形成上面这样的死锁的
- 首先分析这条Sql会加哪些锁(
select * from performance_schema.data_locks
)
sql锁情况
Sql-1
Sql-2
这里Sql2好像并没有二级索引FStatus的X锁,难道推断出了问题?并不是,该语句更新了FStatus,而FStatus是一个二级索引(二级索引所在的列产生实际变化的更新),所以在更新前,需要对FStatus的二级索引对应的记录加锁(隐式锁)
所以实际的加锁情况是
这种情况主要是不同的sql,加锁的顺序不一致,更新的条件是主键、二级索引,很容易产生死锁
- 首先分析这条Sql会加哪些锁(
-
如何解决
俩个语句如果加锁顺序一致,就可以避免这样的问题发生// 先查询 SELECT FId FROM ${this.db_table} WHERE FStatus="processing" AND FStartedTs < (NOW() - INTERVAL ${this.job_timeout_seconds} SECOND) AND ? LIMIT 10 // where条件加上主键 UPDATE Queue SET FStatus="waiting", FBatchId = NULL, FStartedTs = NULL, FRecovered = 1 WHERE FId IN(1,2) //上面的ID FStatus="processing" AND FStartedTs < (NOW() - INTERVAL 60 SECOND) AND `FJobType` = 'jobType'
这里where条件为什么不去掉后面的条件,只通过主键修改?原因是可能形成数据覆盖
-
解决效果
-
五、疑惑
-
上面并没有分析关于引发系统异常的Insert语句的死锁,具体Insert为什么也会有死锁的问题呢?
-
处理超时消息
-
首次入队(Insert)
-
真实执行锁情况
这里可能有疑问,事物二(Insert)为什么会有二级索引的X锁,这个就要说说隐式锁
为了减少锁的数量,提高Mysql的性能,在一些情况下,可以不不加锁,等判断可能发生冲突时才加锁,也就是延迟加锁, -
结果:俩次解决效果
热点行锁,影响系统性能
六、总结
排查分析的过程比较曲折,刚开始只通过系统日志中的一句异常信息,从系统上分析死锁的原因,结果浪费了很多时间,系统日志与MySql死锁日志不一致也就纠结了好久,最后发现,死锁问题不单单只有一个,所以最后采用逐个击破的方式,最后解决了该问题
记录主要注意以下五点,在开发中可以很大程度的避免死锁,
- 事务时间要短,原则是耗时操作不要放到事务中处理,最好将影响并发度的锁,放到时候最后
- 遇到死锁也不要慌,通过死锁日志分析是最有效的,但是要先了解清楚Mysql的锁机制
- 一个表加太多索引,很容易导致死锁,加索引前要仔细思考是否有必要,不能无脑加,同时在写Sql的时候也要考虑,是不是存在死锁的问题,和那些会形成
- 死锁有时候在并发度高的情况才会出现,对业务压测还是很有必要的
- 通过mysql工具可以清晰的分析Sql锁的情况,很好的帮助我们定位问题