写在前面
       有同学会问,为什么用了英文标题呢?真的不知道怎么用中文来明确的表述这个问题,思来想去,还是用英文来表达更准确、清晰。本来想使用这个中文标题的---“Go语言中数据库事务挂起问题分析”,感觉说的没有那么清楚,不如英文表达的简洁明了。哗众取宠了,见笑见笑!
       闲言少叙,直奔主题,这是笔者一贯的风格。首先也得声明一下,笔者并非Go语言的资深开发者,使用Go语言开发还不到3个月的时间,只是个初级的入门者,跟着一帮写Go的小兄弟们学习,照猫画虎,依葫芦画瓢而已!
问题描述
       先交代一下项目背景,目前在做的一个项目是电商业务平台系统,使用Mysql 5.6与Go语言开发。其中有个order_info表是订单表,为整个业务系统的核心表,很多业务操作都与该表相关,或者插入记录到该表中,或者更新该表中的记录。前几日负责增加了一个功能模块,团购订单的团长免单功能,在原有团购功能的基础上修改。由于团长是无需付款的,所以缺少相应的支付信息,包括支付交易号、支付金额、支付时间等。这里先埋个伏笔,问题就出在这里,且看详细分解。
       问题的现象表现为很多涉及到更新order_info表的业务操作都无法成功(从前端应用来测试),分析问题第一时间得看log,看看log中的错误信息,log中出现很多的Lock wait timeout错误。
time="2019-10-25T21:20:52+08:00" level=error msg="Error 1205: Lock wait timeout exceeded; try restarting transaction"
产生了锁等待超时错误,但这应该只是结果,而不是问题的原因。原因肯定是某个数据库表被锁定了,导致后续很多的操作拿不到锁,因而产生锁等待超时的错误。继续深挖log文件,又发现如下错误:
2019/10/25 18:29:02 cron: panic running job: runtime error: invalid memory address or nil pointer dereference
goroutine 100 [running]:
github.com/robfig/cron.(*Cron).runWithRecovery.func1(0xc0002f0190)
    /Users/weichengli/go/src/awesomeProject/pkg/mod/github.com/robfig/cron@v0.0.0-20180505203441-b41be1df6967/cron.go:161 +0x9e
panic(0x4afcf60, 0x556c130)
    /usr/local/go/src/runtime/panic.go:522 +0x1b5
histore/order/service/order.(*Service).pushOrderToMiddle(0xc0001eb6b8, 0xc0004305b0, 0x4765e, 0xc000306900, 0x16, 0xc000348586, 0x7, 0xc000348590, 0x9, 0x3ff0000000000000, ...)
    /Users/weichengli/go/src/nov/order-module/service/order/order.go:314 +0x23e
histore/order/service/order.(*Service).updateOrderStatus(0xc0001eb6b8, 0xc0004305b0, 0x4765e, 0xc000306900, 0x16, 0xc000348586, 0x7, 0xc000348590, 0x9, 0x3ff0000000000000, ...)
    /Users/weichengli/go/src/nov/order-module/service/order/order.go:102 +0xdf7
histore/order/service/order.(*Service).JobUpdateOrder(0xc0003dd6b8, 0xc000039320, 0x16, 0x4c61c41, 0x2, 0x0, 0xc0000ec9c0)
    /Users/weichengli/go/src/nov/order-module/service/order/order.go:206 +0x1e3
histore/order/job/task.AutoSixty()
    /Users/weichengli/go/src/nov/order-module/job/task/order.go:98 +0x4cb
github.com/robfig/cron.FuncJob.Run(0x4caba08)
    /Users/weichengli/go/src/awesomeProject/pkg/mod/github.com/robfig/cron@v0.0.0-20180505203441-b41be1df6967/cron.go:92 +0x25
github.com/robfig/cron.(*Cron).runWithRecovery(0xc0002f0190, 0x4dfec00, 0x4caba08)
    /Users/weichengli/go/src/awesomeProject/pkg/mod/github.com/robfig/cron@v0.0.0-20180505203441-b41be1df6967/cron.go:165 +0x57
created by github.com/robfig/cron.(*Cron).run
    /Users/weichengli/go/src/awesomeProject/pkg/mod/github.com/robfig/cron@v0.0.0-20180505203441-b41be1df6967/cron.go:199 +0x749
panic是Go抛出运行时异常的方式,类似于java中的throw。“invalid memory address or nil pointer dereference”非法内存地址或者空指针引用肯定是有不对的地方,但目前还无法确定是不是该问题的本质原因。
问题分析
开头已经提到过,我也是Go的初学者,并不资深,我也是做中学,一边开发,一边学习。到这里我觉得有必要先讲一讲Go中的错误处理与异常处理,这将对理解本文分析的问题有所帮助。
这里插播Go的错误与异常处理机制相关的内容。
       错误和异常从Go机制上讲,就是error和panic。Go中引入error接口类型作为错误处理的标准模式,如果函数要返回错误,则返回值类型列表中肯定包含error。error处理过程类似于C语言中的错误代码,可逐层返回,直到被处理。
       Go中引入两个内置函数panic和recover来触发和终止异常处理流程,同时引入关键字defer来延迟执行defer后面的函数。一直等到包含defer语句的函数执行完毕时,延迟函数后面的语句才会被执行,而不管包含defer语句的函数是通过return的正常结束,还是由于panic导致的异常结束。你可以在一个函数中执行多条defer语句,它们的执行顺序与声明顺序相反。因为当Go的代码执行时,如果遇到defer的闭包调用,则压入堆栈。当函数返回时,会按照“后进先出”的顺序调用闭包。
       当程序运行时,如果遇到引用空指针、下标越界或显式调用panic函数等情况,则先触发panic函数的执行,然后调用延迟函数。调用者继续传递panic,因此该过程一直在调用栈中重复发生。如果一路在延迟函数中没有recover函数的调用,则会到达该例程(Go中的routine)的起点,该例程结束,然后终止其他所有例程,包括主例程(类似于C语言中的主线程,该例程 ID为1)。
       Goroutine如果panic了,并且没有recover,那么整个Go进程就会异常退出。在Go当中不存在tye ... catch异常处理逻辑,在Go当中使用defer,panic和recover来控制程序执行流程,以此来达到处理异常的目的。recover函数可以在任何地方执行,但只有在defer代码块中执行才能真正达到处理异常的目的。
Go的错误和异常是可以互相转换的:
- 错误转异常,比如程序逻辑上尝试请求某个URL,最多尝试三次,尝试三次的过程中请求失败是错误,尝试完第三次还不成功的话,失败就被提升为异常了;
- 异常转错误,比如panic触发的异常被recover恢复后,将返回值中error类型的变量进行赋值,以便上层函数继续走错误处理流程;
- panic 异常处理机制不会自动将错误信息传递给error。
关于Go中的错误与异常就讲到这里,咱们回过头来继续探讨本文的问题。
继续追踪那个panic异常,根据提示的代码行找到相关的代码处,出问题的代码指示到下面这一行。
"payTime":         payment.PayTime.Unix() * 1000,
       通过分析,原来是团长免单的订单,因为缺失了付款信息,所以payment.PayTime是nil,当调用payment.PayTime.Unix()方法时,抛出了空指针异常。
       再来分析一下整个操作的代码结构,出现问题的代码结构如下:
    tx := db.Db().Begin()
    err = s.updateOrderStatus(tx, orderInfo, orderStatus, "job", map[string]interface{}{})
    if err != nil {
        tx.Rollback()
        return err
    }
    tx.Commit()
    if err = tx.Error; err != nil {
        tx.Rollback()
    }
出错的地方位于s.updateOrderStatus()方法中:
    data := map[string]interface{}{
        "orderId":         o.OrderId,
        "payTime":         payment.PayTime.Unix() * 1000,
        // 其他部分省略
    }
先开始一个数据库事务(transaction),然后更新订单状态(s.updateOrderStatus),如果有错误就回滚(tx.Rollback),没有错误就提交(tx.Commit)。整个流程都是按照错误的方式来处理的,但问题是panic抛出的是一个异常,并没有被处理(没有defer和recover)。怀疑是这个原因导致routine卡在那里不动(hang),事务(transaction)无法提交也无法回滚。为了确定这个原因,再从数据库的角度来做些分析。
数据库分析
首先打开语句事件记录表,这些表记录了语句事件信息,当前语句事件表events_statements_current、历史语句事件表events_statements_history和长语句历史事件表events_statements_history_long。
update performance_schema.setup_consumers set ENABLED = 'YES' where NAME='events_statements_history'; 
update performance_schema.setup_consumers set ENABLED = 'YES' where NAME='events_statements_history_long'; 
查看数据库中正在运行的transactions,查询Mysql系统表:SELECT * FROM information_schema.INNODB_TRX;
trx_id|trx_state|trx_started|trx_requested_lock_id|trx_wait_started|trx_weight|trx_mysql_thread_id|trx_query|trx_operation_state|trx_tables_in_use|trx_tables_locked|trx_lock_structs|trx_lock_memory_bytes|trx_rows_locked|trx_rows_modified|trx_concurrency_tickets|trx_isolation_level|trx_unique_checks|trx_foreign_key_checks|trx_last_foreign_key_error|trx_adaptive_hash_latched|trx_adaptive_hash_timeout|trx_is_read_only|trx_autocommit_non_locking
:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:
122483689|LOCK WAIT|2019-10-23 02:44:23|122483689:1176:5312:244|2019-10-23 02:44:23|2|2668220|UPDATE `order_info` SET `order_status` = ?, `update_time` = ?, `updater` = ?, `version` = ?  WHERE (order_id = ? and version = ?)|starting index read|1|1|2|360|1|0|0|REPEATABLE READ|1|1|(NULL)|0|10000|0|0
122457223|RUNNING|2019-10-23 02:12:55|(NULL)|(NULL)|17|2656633|(NULL)|(NULL)|0|0|11|2936|6|6|0|REPEATABLE READ|1|1|(NULL)|0|10000|0|0
122457217|RUNNING|2019-10-23 02:12:55|(NULL)|(NULL)|5|2670233|(NULL)|(NULL)|0|0|3|360|1|2|0|REPEATABLE READ|1|1|(NULL)|0|9978|0|0
122344697|RUNNING|2019-10-22 23:43:45|(NULL)|(NULL)|4|2664334|(NULL)|(NULL)|0|0|3|1184|2|1|0|REPEATABLE READ|1|1|(NULL)|0|10000|0|0
122343017|RUNNING|2019-10-22 23:42:34|(NULL)|(NULL)|4|2660164|(NULL)|(NULL)|0|0|3|1184|2|1|0|REPEATABLE READ|1|1|(NULL)|0|10000|0|0
122340864|RUNNING|2019-10-22 23:41:23|(NULL)|(NULL)|4|2660447|(NULL)|(NULL)|0|0|3|1184|2|1|0|REPEATABLE READ|1|1|(NULL)|0|10000|0|0
122340699|RUNNING|2019-10-22 23:40:12|(NULL)|(NULL)|4|2665307|(NULL)|(NULL)|0|0|3|1184|2|1|0|REPEATABLE READ|1|1|(NULL)|0|10000|0|0
122340522|RUNNING|2019-10-22 23:39:01|(NULL)|(NULL)|4|2656119|(NULL)|(NULL)|0|0|3|1184|2|1|0|REPEATABLE READ|1|1|(NULL)|0|10000|0|0
122158559|RUNNING|2019-10-22 20:00:02|(NULL)|(NULL)|4|2660446|(NULL)|(NULL)|0|0|3|1184|2|1|0|REPEATABLE READ|1|1|(NULL)|0|10000|0|0
122149815|RUNNING|2019-10-22 19:49:07|(NULL)|(NULL)|4|2660197|(NULL)|(NULL)|0|0|3|1184|2|1|0|REPEATABLE READ|1|1|(NULL)|0|10000|0|0
122136918|RUNNING|2019-10-22 19:33:56|(NULL)|(NULL)|4|2656543|(NULL)|(NULL)|0|0|3|1184|2|1|0|REPEATABLE READ|1|1|(NULL)|0|10000|0|0
       从上面可以看到有很多长时间处于RUNNING状态的事务(transaction),显然不正常了,主要是后面的8个事务执行时间过长。
       接下来看看那些地方发生了锁等待,主要查看information_schema.innodb_lock_waits表。
SELECT r.trx_state wating_trx_state, r.trx_id waiting_trx_id, r.trx_mysql_thread_Id waiting_thread,r.trx_query waiting_query,b.trx_state blocking_trx_state,b.trx_id blocking_trx_id,b.trx_mysql_thread_id blocking_thread,b.trx_query blocking_query FROM information_schema.innodb_lock_waits w INNER JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id INNER JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;
返回结果如下:
wating_trx_state|waiting_trx_id|waiting_thread|waiting_query|blocking_trx_state|blocking_trx_id|blocking_thread|blocking_query
:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:
LOCK WAIT|122483937|2656383|UPDATE `order_info` SET `order_status` = ?, `update_time` = ?, `updater` = ?, `version` = ?  WHERE (order_id = ? and version = ?)|RUNNING|122340522|2656119|(NULL)
LOCK WAIT|122483935|2667970|UPDATE `order_info` SET `order_status` = ?, `update_time` = ?, `updater` = ?, `version` = ?  WHERE (order_id = ? and version = ?)|RUNNING|122457223|2656633|(NULL)
LOCK WAIT|122483785|2667953|UPDATE `order_info` SET `order_status` = ?, `update_time` = ?, `updater` = ?, `version` = ?  WHERE (order_id = ? and version = ?)|RUNNING|122136918|2656543|(NULL)
       处于LOCK WAIT状态的事务基本都是被后面8个事务阻塞了。
       查看RUNNING状态的事务(transaction)的sql执行情况:
select tmp.*,c.SQL_Text blocking_sql_text,p.Host blocking_host from (SELECT r.trx_state wating_trx_state, r.trx_id waiting_trx_id, r.trx_mysql_thread_Id waiting_thread,r.trx_query waiting_query,b.trx_state blocking_trx_state,b.trx_id blocking_trx_id,b.trx_mysql_thread_id blocking_thread,b.trx_query blocking_query FROM information_schema.innodb_lock_waits w INNER JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id INNER JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id ) tmp ,information_schema.processlist p, performance_schema.events_statements_current c,performance_schema.threads t where tmp.blocking_thread=p.id and  t.thread_id= c.THREAD_ID and t.PROCESSLIST_ID=p.id;
返回结果如下:
wating_trx_state|waiting_trx_id|waiting_thread|waiting_query|blocking_trx_state|blocking_trx_id|blocking_thread|blocking_query|blocking_sql_text|blocking_host
:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:
LOCK WAIT|122485624|2667940|UPDATE `order_info` SET `order_status` = ?, `update_time` = ?, `updater` = ?, `version` = ?  WHERE (order_id = ? and version = ?)|RUNNING|122344697|2664334|(NULL)|insert into `order_info_his` (`id`,`order_id`,`source`,`user_id`,`order_type`,...) select `id`,`order_id`,`source`,`user_id`,`order_type`,... from `order_info` where `order_id` = '13196300'|10.236.233.76:36978
       blocking_sql_text字段显示了处于RUNNING状态的事务最后一条执行的sql,与代码逻辑吻合,应该能够确定就是这个原因了。
       接下来再查看一下当前数据库的运行状态,使用这个命令。
SHOW ENGINE INNODB STATUS;
这里提醒一下,如果你在navicat里面执行SHOW ENGINE INNODB STATUS;命令时,不要被它的输出迷惑了,如下图:

貌似没有任何输出,其实不然,选中Status列、全选、copy一下,粘贴到文本编辑器里面发现会有好多的内容哦!
Type    Name    Status
InnoDB      
=====================================
2019-10-23 02:46:21 7f7f339e7700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 6 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 2231560 srv_active, 0 srv_shutdown, 5577242 srv_idle
srv_master_thread log flush and writes: 7808787
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 529681608
OS WAIT ARRAY INFO: signal count 2732585540
Mutex spin waits 4358516747, rounds 5446414263, OS waits 32829842
RW-shared spins 1173175449, rounds 17999381512, OS waits 286852602
RW-excl spins 501021259, rounds 13539177896, OS waits 165436827
Spin rounds per wait: 85.70 mutex, 15.34 RW-shared, 27.02 RW-excl
------------------------
LATEST DETECTED DEADLOCK
------------------------
------------
TRANSACTIONS
------------
Trx id counter 122486240
Purge done for trx's n:o < 122136919 undo n:o < 0 state: running but idle
History list length 106505
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 122486229, not started
MySQL thread id 2670922, OS thread handle 0x7f7f33924700, query id 234691154 10.236.50.95 root
---TRANSACTION 122486197, not started
MySQL thread id 2670916, OS thread handle 0x7f7f3265a700, query id 234691058 10.236.213.35 root
---TRANSACTION 122485687, not started
MySQL thread id 2670915, OS thread handle 0x7f7f3275e700, query id 234690177 10.236.213.35 root
---TRANSACTION 122483931, not started
MySQL thread id 2670906, OS thread handle 0x7f7f33df7700, query id 234686929 10.236.62.246 root
---TRANSACTION 122486195, not started
MySQL thread id 2670903, OS thread handle 0x7f7f33efb700, query id 234691052 10.236.50.98 root
---TRANSACTION 122486063, not started
MySQL thread id 2670855, OS thread handle 0x7f8030363700, query id 234690724 10.236.231.109 root
---TRANSACTION 122485715, not started
MySQL thread id 2670854, OS thread handle 0x7f7f33c30700, query id 234690250 10.236.231.109 root
---TRANSACTION 0, not started
MySQL thread id 2670743, OS thread handle 0x7f7f32862700, query id 234666710 10.133.68.182 root
---TRANSACTION 122482088, not started
MySQL thread id 2670677, OS thread handle 0x7f802852d700, query id 234680517 10.133.68.182 root
---TRANSACTION 122471519, not started
MySQL thread id 2670646, OS thread handle 0x7f7f32966700, query id 234654632 10.133.68.182 root
---TRANSACTION 122485183, not started
MySQL thread id 2670629, OS thread handle 0x7f7f31a2a700, query id 234689190 10.236.50.98 root
---TRANSACTION 122470639, not started
MySQL thread id 2670575, OS thread handle 0x7f7f32925700, query id 234653396 10.236.50.118 root
---TRANSACTION 122472334, not started
MySQL thread id 2670347, OS thread handle 0x7f7f339e7700, query id 234691205 10.133.68.182 root init
SHOW ENGINE INNODB STATUS
---TRANSACTION 122451854, not started
MySQL thread id 2670281, OS thread handle 0x7f7f33618700, query id 234604917 10.133.68.182 root
---TRANSACTION 122456793, not started
MySQL thread id 2670256, OS thread handle 0x7f80303e5700, query id 234617401 10.236.50.98 root
---TRANSACTION 122486230, not started
MySQL thread id 2670250, OS thread handle 0x7f8028672700, query id 234691155 10.236.50.95 root
---TRANSACTION 122483942, not started
MySQL thread id 2670162, OS thread handle 0x7f7f33f7d700, query id 234686979 10.236.62.246 root
---TRANSACTION 122457038, not started
MySQL thread id 2670161, OS thread handle 0x7f7f32cb3700, query id 234618217 10.236.50.98 root
---TRANSACTION 122426396, not started
MySQL thread id 2669677, OS thread handle 0x7f7f3279f700, query id 234538919 10.236.50.98 root
---TRANSACTION 122486211, not started
MySQL thread id 2669191, OS thread handle 0x7f7f33e38700, query id 234691096 10.236.50.151 root
---TRANSACTION 122407031, not started
MySQL thread id 2669190, OS thread handle 0x7f7f33eba700, query id 234490202 10.236.50.98 root
---TRANSACTION 122486067, not started
MySQL thread id 2668220, OS thread handle 0x7f802809b700, query id 234691007 10.236.233.76 root
---TRANSACTION 122486138, not started
MySQL thread id 2668219, OS thread handle 0x7f7f33082700, query id 234690939 10.236.233.76 root
---TRANSACTION 122485706, not started
MySQL thread id 2667970, OS thread handle 0x7f80283e8700, query id 234690732 10.236.233.76 root
---TRANSACTION 122485712, not started
MySQL thread id 2667953, OS thread handle 0x7f7f3369a700, query id 234690366 10.236.233.76 root
---TRANSACTION 122485756, not started
MySQL thread id 2667940, OS thread handle 0x7f80285f0700, query id 234690374 10.236.233.76 root
---TRANSACTION 122486237, not started
MySQL thread id 2666279, OS thread handle 0x7f80304a8700, query id 234691190 10.236.50.151 root
---TRANSACTION 122486233, not started
MySQL thread id 2665405, OS thread handle 0x7f7f31f7f700, query id 234691165 10.236.50.95 root
---TRANSACTION 122486235, not started
MySQL thread id 2665308, OS thread handle 0x7f7f32db7700, query id 234691178 10.236.50.151 root
---TRANSACTION 122471583, not started
MySQL thread id 2665305, OS thread handle 0x7f7f33104700, query id 234654958 10.236.62.157 root
---TRANSACTION 122162846, not started
MySQL thread id 2664416, OS thread handle 0x7f7f33000700, query id 233872473 10.236.50.69 root
---TRANSACTION 122485758, not started
MySQL thread id 2664333, OS thread handle 0x7f8028631700, query id 234690934 10.236.233.76 root
---TRANSACTION 122486231, not started
MySQL thread id 2662683, OS thread handle 0x7f7f32d35700, query id 234691158 10.236.50.95 root
---TRANSACTION 122486228, not started
MySQL thread id 2662666, OS thread handle 0x7f7f3228b700, query id 234691149 10.236.50.95 root
---TRANSACTION 122486126, not started
MySQL thread id 2660451, OS thread handle 0x7f7f32e39700, query id 234690889 10.236.50.151 root
---TRANSACTION 122486129, not started
MySQL thread id 2660452, OS thread handle 0x7f8028429700, query id 234690907 10.236.50.151 root
---TRANSACTION 122486232, not started
MySQL thread id 2660449, OS thread handle 0x7f803062e700, query id 234691166 10.236.50.151 root
---TRANSACTION 122486238, not started
MySQL thread id 2660448, OS thread handle 0x7f7f32556700, query id 234691196 10.236.50.151 root
---TRANSACTION 122486236, not started
MySQL thread id 2660450, OS thread handle 0x7f8030732700, query id 234691184 10.236.50.151 root
---TRANSACTION 122486066, not started
MySQL thread id 2660180, OS thread handle 0x7f8031b50700, query id 234690735 10.236.233.76 root
---TRANSACTION 122486212, not started
MySQL thread id 2659957, OS thread handle 0x7f8030467700, query id 234691159 10.236.50.151 root
---TRANSACTION 122486130, not started
MySQL thread id 2659954, OS thread handle 0x7f7f32209700, query id 234690914 10.236.50.151 root
---TRANSACTION 122486132, not started
MySQL thread id 2659956, OS thread handle 0x7f7f31ebc700, query id 234690926 10.236.50.151 root
---TRANSACTION 122486234, not started
MySQL thread id 2659955, OS thread handle 0x7f7f332cb700, query id 234691172 10.236.50.151 root
---TRANSACTION 122486128, not started
MySQL thread id 2659708, OS thread handle 0x7f7f32aec700, query id 234690901 10.236.50.151 root
---TRANSACTION 122486127, not started
MySQL thread id 2659707, OS thread handle 0x7f7f3375d700, query id 234690895 10.236.50.151 root
---TRANSACTION 122486137, not started
MySQL thread id 2659468, OS thread handle 0x7f7f321c8700, query id 234690932 10.236.50.151 root
---TRANSACTION 122485755, not started
MySQL thread id 2656383, OS thread handle 0x7f7f3328a700, query id 234690372 10.236.233.76 root
---TRANSACTION 122485757, not started
MySQL thread id 2656069, OS thread handle 0x7f7f31db8700, query id 234690936 10.236.233.76 root
---TRANSACTION 122485722, not started
MySQL thread id 2656039, OS thread handle 0x7f7f33555700, query id 234690365 10.236.233.76 root
---TRANSACTION 122486210, not started
MySQL thread id 2655524, OS thread handle 0x7f80284ec700, query id 234691095 10.236.213.35 root
---TRANSACTION 122483988, not started
MySQL thread id 2648887, OS thread handle 0x7f80285af700, query id 234687087 10.236.50.98 root
---TRANSACTION 122483879, not started
MySQL thread id 2602946, OS thread handle 0x7f8031a4c700, query id 234686798 10.236.50.98 root
---TRANSACTION 122485092, not started
MySQL thread id 2416178, OS thread handle 0x7f80300a4700, query id 234688898 10.236.50.121 root
---TRANSACTION 122485741, not started
MySQL thread id 1170262, OS thread handle 0x7f7f33041700, query id 234690312 10.236.62.247 root
---TRANSACTION 122485742, not started
MySQL thread id 1170263, OS thread handle 0x7f8031b0f700, query id 234690313 10.236.62.247 root
---TRANSACTION 122486239, ACTIVE 0 sec
MySQL thread id 2669192, OS thread handle 0x7f7f335d7700, query id 234691201 10.236.50.151 root
Trx read view will not see trx with id >= 122486240, sees < 122136918
---TRANSACTION 122486139, ACTIVE 9 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 360, 1 row lock(s)
MySQL thread id 2656135, OS thread handle 0x7f803019c700, query id 234690945 10.236.233.76 root updating
UPDATE `order_info` SET `order_status` = ?, `update_time` = ?, `updater` = ?, `version` = ?  WHERE (order_id = ? and version = ?)
Trx read view will not see trx with id >= 122486140, sees < 122136918
------- TRX HAS BEEN WAITING 9 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1176 page no 5312 n bits 376 index `order_id` of table `order_v2`.`order_info` trx id 122486139 lock_mode X locks rec but not gap waiting
Record lock, heap no 234 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 22; hex 31313139313031383135303630373539343034333030; asc 1119101815060759404300;;
 1: len 8; hex 80000000000474f6; asc       t ;;
------------------
---TRANSACTION 122457223, ACTIVE 2006 sec
11 lock struct(s), heap size 2936, 6 row lock(s), undo log entries 6
MySQL thread id 2656633, OS thread handle 0x7f803056b700, query id 234618846 10.236.213.35 root
Trx read view will not see trx with id >= 122457224, sees < 122136918
---TRANSACTION 122457217, ACTIVE 2006 sec
3 lock struct(s), heap size 360, 1 row lock(s), undo log entries 2
MySQL thread id 2670233, OS thread handle 0x7f7f32187700, query id 234618816 10.236.50.95 root
---TRANSACTION 122344697, ACTIVE 10956 sec
3 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
MySQL thread id 2664334, OS thread handle 0x7f8031a8d700, query id 234332014 10.236.233.76 root
Trx read view will not see trx with id >= 122344698, sees < 121945996
---TRANSACTION 122343017, ACTIVE 11027 sec
3 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
MySQL thread id 2660164, OS thread handle 0x7f7f31aac700, query id 234327626 10.236.233.76 root
Trx read view will not see trx with id >= 122343018, sees < 121944883
---TRANSACTION 122340864, ACTIVE 11098 sec
3 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
MySQL thread id 2660447, OS thread handle 0x7f7f32ebb700, query id 234323482 10.236.233.76 root
Trx read view will not see trx with id >= 122340865, sees < 121943543
---TRANSACTION 122340699, ACTIVE 11169 sec
3 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
MySQL thread id 2665307, OS thread handle 0x7f7f319a8700, query id 234322680 10.236.233.76 root
Trx read view will not see trx with id >= 122340700, sees < 121943543
---TRANSACTION 122340522, ACTIVE 11240 sec
3 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
MySQL thread id 2656119, OS thread handle 0x7f8030322700, query id 234321849 10.236.233.76 root
Trx read view will not see trx with id >= 122340523, sees < 121941814
---TRANSACTION 122158559, ACTIVE 24379 sec
3 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
MySQL thread id 2660446, OS thread handle 0x7f7f33a69700, query id 233861810 10.236.233.76 root
Trx read view will not see trx with id >= 122158560, sees < 121766131
---TRANSACTION 122149815, ACTIVE 25034 sec
3 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
MySQL thread id 2660197, OS thread handle 0x7f7f33db6700, query id 233839703 10.236.233.76 root
Trx read view will not see trx with id >= 122149816, sees < 121758313
---TRANSACTION 122136918, ACTIVE 25945 sec
3 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
MySQL thread id 2656543, OS thread handle 0x7f80300e5700, query id 233807568 10.236.233.76 root
Trx read view will not see trx with id >= 122136919, sees < 121746243
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (write thread)
I/O thread 7 state: waiting for completed aio requests (write thread)
I/O thread 8 state: waiting for completed aio requests (write thread)
I/O thread 9 state: waiting for completed aio requests (write thread)
Pending normal aio reads: 0 [0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0] ,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
9083086695 OS file reads, 101938183 OS file writes, 8769098 OS fsyncs
12.83 reads/s, 16384 avg bytes/read, 0.17 writes/s, 0.17 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 3090, seg size 3092, 14256403 merges
merged operations:
 insert 32898420, delete mark 6919605, delete 1066026
discarded operations:
 insert 0, delete mark 0, delete 0
Hash table size 276671, node heap has 620 buffer(s)
1920.01 hash searches/s, 1921.35 non-hash searches/s
---
LOG
---
Log sequence number 79861148875
Log flushed up to   79861148827
Pages flushed up to 79861148576
Last checkpoint at  79861148576
0 pending log writes, 0 pending chkp writes
2756069 log i/o's done, 0.17 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 137363456; in additional pool allocated 0
Dictionary memory allocated 4661075
Buffer pool size   8191
Free buffers       1024
Database pages     6547
Old database pages 2396
Modified db pages  5
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 169199422, not young 573165874082
46.66 youngs/s, 101.98 non-youngs/s
Pages read 9083101188, created 2691203, written 96587546
12.83 reads/s, 0.00 creates/s, 0.00 writes/s
Buffer pool hit rate 999 / 1000, young-making rate 4 / 1000 not 9 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 6547, unzip_LRU len: 0
I/O sum[37609]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
11 read views open inside InnoDB
Main thread process no. 1, id 140188462343936, state: sleeping
Number of rows inserted 57396241, updated 46798608, deleted 5486731, read 330712507867
0.17 inserts/s, 0.00 updates/s, 0.00 deletes/s, 2313.28 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================
上面Status的内容太多,不再一一解释说明了,拿出一条作为例子说明一下(从后面的8个事务中取一条):
---TRANSACTION 122136918, ACTIVE 25945 sec
3 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
MySQL thread id 2656543, OS thread handle 0x7f80300e5700, query id 233807568 10.236.233.76 root
Trx read view will not see trx with id >= 122136919, sees < 121746243
       从上面信息可知,该事务占用了2个行锁,undo log entries为1,说明没有回滚也没有提交,与本文的问题表现吻合。
       也可以使用如下的语句查看更详细的sql执行信息:
SELECT ps.id as processlist_id,
             trx_started, trx_isolation_level,
             esh.EVENT_ID,
             esh.TIMER_WAIT,
             esh.event_name as EVENT_NAME,
             esh.sql_text as SQL_text,
             esh.RETURNED_SQLSTATE, esh.MYSQL_ERRNO, esh.MESSAGE_TEXT, esh.ERRORS, esh.WARNINGS   
FROM information_schema.innodb_trx trx
JOIN information_schema.processlist ps ON trx.trx_mysql_thread_id = ps.id
LEFT JOIN performance_schema.threads th ON th.processlist_id = trx.trx_mysql_thread_id
LEFT JOIN performance_schema.events_statements_history esh ON esh.thread_id = th.thread_id
WHERE trx.trx_started < CURRENT_TIME - INTERVAL 10 second
  AND ps.USER != 'SYSTEM_USER' ORDER BY processlist_id, EVENT_ID;
返回结果如下所示:
processlist_id|trx_started|trx_isolation_level|EVENT_ID|TIMER_WAIT|EVENT_NAME|SQL_text|RETURNED_SQLSTATE|MYSQL_ERRNO|MESSAGE_TEXT|ERRORS|WARNINGS
:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:
50099|2019-10-29 06:50:01|REPEATABLE READ|37122|141286000|statement/com/Prepare|(NULL)|(NULL)|0|(NULL)|0|0
50099|2019-10-29 06:50:01|REPEATABLE READ|37123|413441000|statement/com/Execute|(NULL)|(NULL)|0|(NULL)|0|0
50099|2019-10-29 06:50:01|REPEATABLE READ|37124|9923000|statement/com/Close stmt|(NULL)|(NULL)|0|(NULL)|0|0
50099|2019-10-29 06:50:01|REPEATABLE READ|37125|85238000|statement/com/Prepare|(NULL)|(NULL)|0|(NULL)|0|0
50099|2019-10-29 06:50:01|REPEATABLE READ|37126|1011588309000|statement/com/Execute|(NULL)|00000|0|Rows matched: 1  Changed: 1  Warnings: 0|0|0
50099|2019-10-29 06:50:01|REPEATABLE READ|37127|13898000|statement/com/Close stmt|(NULL)|(NULL)|0|(NULL)|0|0
50099|2019-10-29 06:50:01|REPEATABLE READ|37128|444551000|statement/sql/insert_select|insert into `order_info_his` (`id`,`order_id`,`source`,`user_id`,`order_type`,...) select `id`,`order_id`,`source`,`user_id`,`order_type`,... from `order_info` where `order_id` = '11191564'|00000|0|Records: 1  Duplicates: 0  Warnings: 0|0|0
50099|2019-10-29 06:50:01|REPEATABLE READ|37129|103372000|statement/com/Prepare|(NULL)|(NULL)|0|(NULL)|0|0
50099|2019-10-29 06:50:01|REPEATABLE READ|37130|813128000|statement/com/Execute|(NULL)|(NULL)|0|(NULL)|0|0
50099|2019-10-29 06:50:01|REPEATABLE READ|37131|5560000|statement/com/Close stmt|(NULL)|(NULL)|0|(NULL)|0|0
96605|2019-10-29 06:55:52|REPEATABLE READ|12533|173214000|statement/com/Prepare|(NULL)|(NULL)|0|(NULL)|0|0
96605|2019-10-29 06:55:52|REPEATABLE READ|12534|621881000|statement/com/Execute|(NULL)|(NULL)|0|(NULL)|0|0
96605|2019-10-29 06:55:52|REPEATABLE READ|12535|12524000|statement/com/Close stmt|(NULL)|(NULL)|0|(NULL)|0|0
96605|2019-10-29 06:55:52|REPEATABLE READ|12536|100068000|statement/com/Prepare|(NULL)|(NULL)|0|(NULL)|0|0
96605|2019-10-29 06:55:52|REPEATABLE READ|12537|921222563000|statement/com/Execute|(NULL)|00000|0|Rows matched: 1  Changed: 1  Warnings: 0|0|0
96605|2019-10-29 06:55:52|REPEATABLE READ|12538|30867000|statement/com/Close stmt|(NULL)|(NULL)|0|(NULL)|0|0
96605|2019-10-29 06:55:52|REPEATABLE READ|12539|337314000|statement/sql/insert_select|insert into `order_info_his` (`id`,`order_id`,`source`,`user_id`,`order_type`,...) select `id`,`order_id`,`source`,`user_id`,`order_type`,... from `order_info` where `order_id` = '11196379'|00000|0|Records: 1  Duplicates: 0  Warnings: 0|0|0
96605|2019-10-29 06:55:52|REPEATABLE READ|12540|87862000|statement/com/Prepare|(NULL)|(NULL)|0|(NULL)|0|0
96605|2019-10-29 06:55:52|REPEATABLE READ|12541|482402000|statement/com/Execute|(NULL)|(NULL)|0|(NULL)|0|0
96605|2019-10-29 06:55:52|REPEATABLE READ|12542|5916000|statement/com/Close stmt|(NULL)|(NULL)|0|(NULL)|0|0
       通过以上信息可以印证一直在RUNNING的事务执行的最后一条sql语句。
       到此通过对数据库的分析,已经可以明确问题的本质原因了,于是可以在代码结构上进行修改、优化。
问题修复
优化后的代码结构如下:
    tx := db.Db().Begin()
    defer func() {
        if err := recover(); err != nil {
            fmt.Println("Panic error: ", err)
            debug.PrintStack()  // 打印堆栈信息
            tx.Rollback()
        }
    }()
    err = s.updateOrderStatus(tx, orderInfo, orderStatus, "job", map[string]interface{}{})
    if err != nil {
        tx.Rollback()
        return err
    }
    tx.Commit()
    if err = tx.Error; err != nil {
        tx.Rollback()
    }
增加defer和recover来处理异常情况的发生,不至于让事务hang在那里不动。
我们在调用recover的延迟函数中以合理的方式响应该异常:
- 打印堆栈的异常调用信息和关键的业务信息,以便这些问题保留可见;
- 用recover函数捕获异常,并回滚数据库事务。
最后总结一下分析问题的基本思路:
- 首先查看log文件,检查其中是否有Error、Exception、Panic(Go语言)等;
- 根据错误提示查看源代码;
- 其次查看数据库状态,SHOW ENGINE INNODB STATUS;
- 根据3中的内容,查看事务运行状态、锁的状态、锁等待的状态;
- 最后确认问题并修改代码;
遗留问题
最后还有一个遗留问题:既然panic应该会按照调用顺序一直往上返回,那么我们的程序为什么没有呢?貌似我们的程序就一直hang在那里!我们的调用路径中确实没有使用defer和recover,照此panic应该一直返回到main函数里面,然后终止整个进程。但在我们的问题中表现并非如此,Go进程并未结束,却一直hang在那里。
下面把我们的使用场景详细描述一下:
- 这个引起问题的操作在定时任务中执行,我们使用github.com/robfig/cron包来实现crontab的功能;
- 这个操作被包含在数据库事务(transaction)中;
- 貌似一直hang在出问题的语句行,虽然log中已经打印了panic异常;
- 其他的job还能正常启动运行;
- ORM工具用的是gorm;
目前虽然解决了hang的问题,但是暂时还是没有弄清楚为什么会出现这种状况。莫非是Go语言的bug?或者是github.com/robfig/cron包的bug?还是另有其他的原因?大家可以继续探讨,也请Go的高手指点迷津!
2019年11月1日星期五 于北京通州家中