并发插入引发的死锁案例分析

引言

昨天又碰到了一个死锁的问题。

印象中也是在去年一月份，一个月的时间里连续碰到好几次死锁，当时还花了不少时间向DBA取经学习，然后接下来一年的时间相安无事，直到昨天，所以似乎一月与死锁更配？

好，先看看当时拿到的死锁日志：

LATEST DETECTED DEADLOCK
------------------------
2020-01-17 10:54:43 0x7e65e2a9a700
*** (1) TRANSACTION:
TRANSACTION 2680591430, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 7 lock struct(s), heap size 1136, 6 row lock(s), undo log entries 2
MySQL thread id 5886047, OS thread handle 138974565865216, query id 899537643   db_user update
INSERT INTO t VALUES  (1,12,' aaaaaa',1,1579229683,1579229683)
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2528547 page no 15403 n bits 792 index  idx_a_b of table ` db`.`t` trx id 2680591430 lock_mode X locks gap before rec insert intention waiting

*** (2) TRANSACTION:
TRANSACTION 2680591431, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
7 lock struct(s), heap size 1136, 6 row lock(s), undo log entries 2
MySQL thread id 5886128, OS thread handle 138976059565824, query id 899537680  db_user update
INSERT INTO t VALUES  (2, 23,' aaaaaa',1,1579229683,1579229683)
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 2528547 page no 15403 n bits 792 index  idx_a_b of table ` db`.`t` trx id 2680591431 lock_mode X locks gap before rec
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2528547 page no 15403 n bits 792 index  idx_a_b of table ` db`.`t` trx id 2680591431 lock_mode X locks gap before rec insert intention waiting
*** WE ROLL BACK TRANSACTION (2)

分析死锁日志，不难发现死锁的发生是由于并发insert引起的，两个事务都持有gap锁，然后同时申请插入意向锁导致，因此这是一个典型的由gap锁和插入意向锁导致的死锁案例。

不过前司DBA同学曾经跟我说过一句话：99.99%的死锁场景都可以通过构造数据来复现，如果复现不了，大概率是你分析错了。

因此为了验证自己的猜测，我们不妨就根据当时死锁发生的业务场景来复现一下。不过在此之前，我们先简单了解下gap锁和插入意向锁的基本概念。

Gap锁

MySQL官网对gap锁的介绍是这样的Gap Locks：

A gap lock is a lock on a gap between index records, or a lock on the gap before the first or after the last index record.

简单来说，gap锁是加在索引区间的锁，或者是第一条记录之前、最后一条记录之后，不包括记录本身。它的主要目的就是用来防止幻读的发生。

gap锁+行锁就构成了所谓的Next-key锁，这也是在RR隔离级别下的基本加锁单位。

插入意向锁

同样，先看下MySQL官网对插入意向锁的介绍Insert Intention Locks：

An insert intention lock is a type of gap lock set by INSERT operations prior to row insertion. This lock signals the intent to insert in such a way that multiple transactions inserting into the same index gap need not wait for each other if they are not inserting at the same position within the gap.

也就是说，插入意向锁其实也是一种gap锁，在执行insert操作时产生。它的主要目的就是让多个事务在同一区间内插入不同索引值时不用互相等待，提高插入效率。

这里多说一句：我们知道还有一种意向锁，意向锁是表锁。但是插入意向锁本质上是gap锁而不是意向锁，所以插入意向锁是行锁而不是表锁，千万别被这个名字所误导了。

场景复现

版本：MySQL 5.7

隔离级别：RR

建表语句：

 CREATE TABLE `t1` (
  `id` int(10) NOT NULL AUTO_INCREMENT,
  `a` int(10) DEFAULT NULL,
  `b` int(10) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_a_b` (`a`,`b`)
) ENGINE=InnoDB AUTO_INCREMENT=52 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

insert into t1(a, b) values (1, 1);
insert into t1(a, b) values (10, 1);

业务场景简化后大致如下：

	T1	T2
t1	begin;	begin;
t2	update t1 set b = 0 where a = 5;
t3		update t1 set b = 0 where a = 6;
t4	insert into t1(a, b) values (5, 1); blocked
t5		insert into t1(a, b) values (6, 1); dead lock
t6	commit;	commit;

按上图时序执行后，意料之中发生了死锁：

image.png

查看死锁日志如下：

------------------------
LATEST DETECTED DEADLOCK
------------------------
2020-01-18 11:57:45 0x70000ca6a000
*** (1) TRANSACTION:
TRANSACTION 430276, ACTIVE 56 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1
MySQL thread id 25, OS thread handle 123145515114496, query id 439 localhost root update
insert into t1(a, b) values (5, 1)
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 27 page no 4 n bits 880 index idx_a_b of table `test`.`t1` trx id 430276 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 583 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 4; hex 8000000a; asc     ;;
 1: len 4; hex 80000001; asc     ;;
 2: len 4; hex 80000029; asc    );;

*** (2) TRANSACTION:
TRANSACTION 430277, ACTIVE 51 sec inserting
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1
MySQL thread id 26, OS thread handle 123145514557440, query id 440 localhost root update
insert into t1(a, b) values (6, 1)
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 27 page no 4 n bits 880 index idx_a_b of table `test`.`t1` trx id 430277 lock_mode X locks gap before rec
Record lock, heap no 583 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 4; hex 8000000a; asc     ;;
 1: len 4; hex 80000001; asc     ;;
 2: len 4; hex 80000029; asc    );;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 27 page no 4 n bits 880 index idx_a_b of table `test`.`t1` trx id 430277 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 583 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 4; hex 8000000a; asc     ;;
 1: len 4; hex 80000001; asc     ;;
 2: len 4; hex 80000029; asc    );;

*** WE ROLL BACK TRANSACTION (2)

不难发现，我们复现场景的死锁日志与线上环境看到的死锁日志完全一样，这说明我们之前对死锁原因的猜测是正确的。按照事务执行顺序我们逐条分析下：

t2时刻：事务T1视图更新一条不存在的记录。之前我们说过，在RR隔离级别下，加锁的基本单位是Next-key锁，因此，t2时刻执行完，事务T1持有gap锁(1, 10]。

t3时刻：因为gap之间是不会相互阻塞的，因此t3时刻执行完，事务T2同样持有了gap锁(1, 10]。

t4时刻：事务T1尝试插入a=5的记录，插入操作在相应的行记录a=5上加上插入意向锁，但是因为事务T2持有了gap锁(1, 10]，而gap是会阻塞插入意向锁的，所以此时T1阻塞，等待事务T2释放gap锁(1, 10]。

t5时刻：同理，事务T2会尝试对相应的行记录a=6加插入意向锁，但同样的原因被阻塞，等待事务T1释放gap锁(1, 10]。

由此，两个事务T1、T2都在等待对方释放持有的gap锁，循环等待发生，导致死锁。

场景回归

回到我们实际的业务场景，在t3、t4时刻实际上是大量的并发插入，每个事务可能都是十几万甚至几十万的插入操作。因此实际上这里我们就已经违反了一条事务的最佳实践：

尽量避免大事务

所以解决方案应该是尽量把这个超大事务拆小，减小每个事务持有锁的时间。

当然，基于我们的业务场景，并行事务其实很少，而且事务还是处于一个异步操作内，所以还有一个更加简单可行的方案是使用分布式锁。当然这也只能说是一个次优方案，毕竟单个事务里插入几十万条记录，还是很不推荐的。