从库记录存在却报错找不到记录(HA_ERR_KEY_NOT_FOUND)？

问题描述和总结

最近遇到这么一个问题，当然这个问题也是遇到很多次了，不管在5.7和8.0中都遇到过，大概报错如下，

2025-02-24T14:12:49.141553+08:00 26 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '83e0a90c-faa2-11ed-9b64-000c29ebe88e:39' at master log mysql-bin.000002, end_log_pos 4536; Could not execute Delete_rows event on table tt0224.t3; Can't find record in 't3', Error_code: 1032; Can't find record in 't3', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 4536, Error_code: MY-001032
2025-02-24T14:12:49.153781+08:00 25 [Warning] [MY-010584] [Repl] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: MY-001756

不管是MTS还是单线程的主从都可能遇到，然后查看数据是存在的，而且重启主从能够自愈，查找BUG发现在8027中已经修复了这个BUG，

Replication: The replication applier (SQL) thread overrode retryable errors (such as deadlocks and wait timeouts) from storage engines with a key not found error, causing replication to stop without retrying the transaction. These errors are no longer overridden. (Bug #33107663)

然后仔细进行了查看，虽然BUG只修复了8.0但是也分析了一下5.7代码也有可能遇到，因此总结如下，供参考，我们这里分析以常用的slave_rows_search_algorithms参数设置为例子，及slave_rows_search_algorithms='INDEX_SCAN,HASH_SCAN'或者slave_rows_search_algorithms='INDEX_SCAN,TABLE_SCAN'

如果表中没有主键和索引，也就是表中没有任何索引，这种情况下不管table scan和hash scan都不会触发本BUG
如果表中没有主键只有索引，这种情况下肯定不会用到table scan，但是index scan(5.7不会触发)和hash scan(5.7和8027以下都会触发) 都可能触发本BUG
如果表中有主键，这种情况下只会用到index_scan，不会触发本BUG

因此我们发现对于MySQL而言如果表有主键，确实能够避免很多不必要的BUG更能够提交主从的性能，强烈推荐。
当然这里也不清楚是否分析完了所有场景，如果有遇到报错找不到记录(HA_ERR_KEY_NOT_FOUND)重启后主从自愈的情况也不用过于奇怪。

二、分析过程

实际上使用的不同的数据扫描算法我们提到会用到3个函数分别为，

hash scan Rows_log_event::do_hash_scan_and_update
index scan Rows_log_event::do_index_scan_and_update
table scan Rows_log_event::do_table_scan_and_update

这个BUG主要存在于函数Rows_log_event::next_record_scan中，而这个函数hash scan和index scan都有调用，当innodb层返回错误的时候，在这个函数中某些逻辑被完全屏蔽了直接全部返回HA_ERR_KEY_NOT_FOUND的错误，比如常见的行锁堵塞报错应该是Lock wait timeout exceeded而不应该是HA_ERR_KEY_NOT_FOUND，屏蔽的场景包括了如下，

8.0 中index scan二级索引首次定位数据的时候完全屏蔽完全屏蔽全部返回HA_ERR_KEY_NOT_FOUND
8.0 中hash scan通过event distinct key访问下一行数据的时候完全屏蔽全部返回HA_ERR_KEY_NOT_FOUND
5.7 中hash scan通过event distinct key访问下一行数据的时候完全屏蔽全部返回HA_ERR_KEY_NOT_FOUND

这里贴出8.0.23的代码如下，5.7可自行查看，

image.png

三、测试

这里分别测试5.7.40和8.0.23 作为从库的情况下的报错，我们采用一种5.7和8.0必现的方式，也就是如果表中没有主键只有索引，且使用hash scan的扫描方式，并且构造出通过event distinct key访问下一行数据的时候被innodb 行锁堵塞的情况，这种情况是5.7和8.0都命中BUG的。

构造如下，

注意考虑使用RC隔离级别
set global transaction_isolation=1;


slave:start slave;
master:CREATE TABLE t2 (a INT, b INT, key(a),key(b)) ENGINE=innodb;INSERT INTO t2 VALUES (10, 110), (10, 120),(20, 130), (20, 140),(30, 150), (30, 160),(40, 170), (40, 180),(50, 190), (50, 200);
slave:stop slave;set global slave_transaction_retries=4;set global innodb_lock_wait_timeout = 2;set global slave_parallel_workers=0;SET GLOBAL slave_rows_search_algorithms = 'INDEX_SCAN,HASH_SCAN'; 
master:BEGIN;DELETE FROM t2 where a=40;COMMIT;
slave:BEGIN;SELECT * FROM t2 WHERE a = 40 and b=180 FOR UPDATE; (模拟从库行锁的情况)
slave connect 2:start slave;

5.7.40 测试结果

2025-02-26T16:38:16.508921+08:00 63 [ERROR] Slave SQL for channel '': Could not execute Delete_rows event on table test0226.t2; Can't find record in 't2', Error_code: 1032; Can't find record in 't2', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000001, end_log_pos 7222, Error_code: 1032
2025-02-26T16:38:16.508981+08:00 63 [Warning] Slave: Can't find record in 't2' Error_code: 1032
2025-02-26T16:38:16.508995+08:00 63 [Warning] Slave: Can't find record in 't2' Error_code: 1032
2025-02-26T16:38:16.509018+08:00 63 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000001' position 6929.

Last_SQL_Error: Could not execute Delete_rows event on table test0226.t2; Can't find record in 't2', Error_code: 1032; Can't find record in 't2', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000001, end_log_pos 7222

8.0.23 测试结果

2025-02-26T16:46:59.276072+08:00 13 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Could not execute Delete_rows event on table tt0224.t2; Can't find record in 't2', Error_code: 1032; Can't find record in 't2', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000018, end_log_pos 1455, Error_code: MY-001032
2025-02-26T16:46:59.276221+08:00 13 [Warning] [MY-010584] [Repl] Slave: Can't find record in 't2' Error_code: MY-001032
2025-02-26T16:46:59.276249+08:00 13 [Warning] [MY-010584] [Repl] Slave: Can't find record in 't2' Error_code: MY-001032
2025-02-26T16:46:59.276267+08:00 13 [ERROR] [MY-010586] [Repl] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000018' position 1146

Last_SQL_Error: Could not execute Delete_rows event on table tt0224.t2; Can't find record in 't2', Error_code: 1032; Can't find record in 't2', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000018, end_log_pos 1455

理论上这种报错应该是Lock wait timeout exceeded如下，

2025-02-26T17:01:42.969940+08:00 19 [Warning] [MY-010897] [Repl] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2025-02-26T17:02:10.475053+08:00 20 [Warning] [MY-010584] [Repl] Slave SQL for channel '': Could not execute Delete_rows event on table tt0224.t2; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mysql-bin.000018, end_log_pos 2750, Error_code: MY-001205

四、其他

Rows_log_event::do_index_scan_and_update  这里是index scan的逻辑
   这里值从库可以从event中获取整行记录，来获取主键/索引的相关字段用于查找，所以使用的pos相关的调用
  ->(m_table->file->ha_table_flags() & HA_PRIMARY_KEY_REQUIRED_FOR_POSITION)
    先考虑主键定位逻辑
  ->handler::rnd_pos_by_record
       ->ha_innobase::position
        从记录中获取主键/索引需要的field
          ->key_copy(ref, (uchar *)record, key_info, key_info->key_length)
           如果是主键这里就是获取主键的值
       ->handler::ha_rnd_pos
          ->ha_innobase::rnd_pos 
             主键定位数据
             ->ha_innobase::index_read
              最终调用index read
                ->row_search_mvcc
     如果不是主键定位，则考虑二级索引
     INDEX_SCAN:
     ->if ((error = open_record_scan()))
        对于index scan来讲这里就是准备 m_key
    ->error = next_record_scan(true)
      ....
        
        
              

Rows_log_event::do_hash_scan_and_update
     ->if (int error = this->do_hash_row(rli))
        扫描一个event，将event的BI放入到HASH 中
     ->this->do_scan_and_update(rli)
        通过hash扫描表，
        Rows_log_event::do_scan_and_update    
        ->(error = open_record_scan())
           如果是hash scan则制定m_key为 hash m_distinct_keys中的数据
           如果是index scan 就使用行中关于 索引的部分
           如果为主键扫描则修改active index为primary 
           ->if (m_key_index < MAX_KEY) { 
             如果是索引扫描
             ->if (m_rows_lookup_algorithm == ROW_LOOKUP_HASH_SCAN)
              如果是hash scan
                ->m_itr = m_distinct_keys.begin(); 
              event中存储的关于索引字段的不同的值
                  ->m_key = *m_itr
                 获取第一个key，也就是需要定位的数据
                  ->m_itr++;
                  迭代器+1
             ->else 
              如果是index scan
              key_copy(m_key, m_table->record[0], m_key_info, 0);
           ->else
              如果是全表扫描方式
               if ((error = table->file->ha_rnd_init(true)))
               change_active_index(table_share->primary_key)
               更改扫描索引为主键
        -> do {   
           开启循环
           ->next_record_scan(i == 0)
              初次定位数据，这里根据的是hash里面的m_key进行数据扫描
           .....

从库记录存在却报错找不到记录(HA_ERR_KEY_NOT_FOUND)？