1. 问题描述
线上数据库,里面有个account_recharge_transaction表,这个表采用了hash分区,以account_id作为分区键,但是最近这段时间发现这个一个接口报超时错误,大概有2%的概率超时。执行sql如下
SELECT * FROM "account_recharge_transaction" WHERE "status"=2 AND "account_id"=111795163172472852 ORDER BY "id" DESC LIMIT 1;
2. 排查方向
- 超时能否确定确实是执行sql引起的?
通过otel的链路排查发现确实是sql查询消耗的,不存在是上游业务占用了过多的时间导致执行sql本身时间不够导致的,也不存在执行sql之前有其他的业务代码占用了过多的时间。
- 是否是数据库连接数不够导致的?
不存在这个问题,采用k8s部署,有5个pod,每个pod最大连接数是30,配置是足够的,同时查看连接池的监控发现占用很少。
- 是否是sql本身有问题
通过执行postgre的explain语句发现走索引了, 索引定义是:
account_recharge_transaction_pkey PRIMARY KEY (id, account_id)
explain analyse SELECT "id","account_id","created_at","client_member_id","subsidy_id","amount","batch_id","status","transaction_id","sub_account_id","from","to","type","updated_at","detail","operator","member_id" FROM "account_recharge_transaction"
WHERE "account_id"=111332578555887692 AND "status"=2 ORDER BY "id" DESC LIMIT 1;
Limit (cost=0.43..103.72 rows=1 width=123) (actual time=38.996..38.997 rows=1 loops=1)
-> Index Scan Backward using account_recharge_transaction_060_pkey on account_recharge_transaction_060 account_recharge_transaction (cost=0.43..24066.54 rows=233 width=123) (
actual time=38.995..38.995 rows=1 loops=1)
Index Cond: (account_id = '111332578555887692'::bigint)
Filter: (status = 2)
Planning Time: 0.160 ms
Execution Time: 39.035 ms
- 排除没走索引问题。
- 排除跨分区查询问题,只走了一个分区。
- 排除数据量过大问题,这个查询命中数据其实很少只有个位数,而且手动执行的时候用了39.035ms并不慢,但是线上确实是执行这个的时候超时了。
- 是否是数据库负载问题?
线上用aws的aurora postgre数据库,配置相当高,CPU或者内存不足相关的问题。
- 是否是sql真正执行的时候执行计划出现了问题,导致没有走hash分区,走全表扫描?
通过配置auto_explain发现即使在超时的时候执行计划也是走一个分区和primary key索引了。
3. 继续排查
通过上面分析和信息收集依然没有找出偶尔超时的问题,继续在aws的打开performance insight,可以看到更多的数据库信息,里面有AAS指标,里面的等待选项发现对应的sql是IO操作。但是上面sql初次看起来不可能引发io问题,因为数据量很少,每个分区也就200万数据,而且查询的数据也很小。继续手动在读库执行查询语句,发现有的时候确实能复现查询比较慢的情况。
继续给explain增加参数
EXPLAIN (ANALYZE, VERBOSE, COSTS, BUFFERS)
SELECT "id","account_id","created_at","client_member_id","subsidy_id","amount","batch_id","status","transaction_id","sub_account_id","from","to","type","updated_at","detail","operator","member_id"
FROM "account_recharge_transaction"
WHERE "account_id"=111441242101419219 AND "status"=2 ORDER BY "id" DESC LIMIT 1;
发现访问速度正常时结果如下
Limit (cost=0.43..101.12 rows=1 width=123) (actual time=38.298..38.299 rows=0 loops=1)
Output: account_recharge_transaction.id, account_recharge_transaction.account_id, account_recharge_transaction.created_at, account_recharge_tr
ansaction.client_member_id, account_recharge_transaction.subsidy_id, account_recharge_transaction.amount, account_recharge_transaction.batch_id,
account_recharge_transaction.status, account_recharge_transaction.transaction_id, account_recharge_transaction.sub_account_id, account_recharge_t
ransaction."from", account_recharge_transaction."to", account_recharge_transaction.type, account_recharge_transaction.updated_at, account_recharg
e_transaction.detail, account_recharge_transaction.operator, account_recharge_transaction.member_id
Buffers: shared hit=8948
-> Index Scan Backward using account_recharge_transaction_082_pkey on public.account_recharge_transaction_082 account_recharge_transaction (
cost=0.43..23462.20 rows=233 width=123) (actual time=38.297..38.297 rows=0 loops=1)
Output: account_recharge_transaction.id, account_recharge_transaction.account_id, account_recharge_transaction.created_at, account_recha
rge_transaction.client_member_id, account_recharge_transaction.subsidy_id, account_recharge_transaction.amount, account_recharge_transaction.batc
h_id, account_recharge_transaction.status, account_recharge_transaction.transaction_id, account_recharge_transaction.sub_account_id, account_rech
arge_transaction."from", account_recharge_transaction."to", account_recharge_transaction.type, account_recharge_transaction.updated_at, account_r
echarge_transaction.detail, account_recharge_transaction.operator, account_recharge_transaction.member_id
Index Cond: (account_recharge_transaction.account_id = '111441242101419222'::bigint)
Filter: (account_recharge_transaction.status = 2)
Buffers: shared hit=8948
Query Identifier: -7739276633008834702
Planning Time: 0.165 ms
Execution Time: 38.329 ms
访问速度异常时结果如下
Limit (cost=0.43..101.12 rows=1 width=123) (actual time=8940.396..8940.396 rows=0 loops=1)
Output: account_recharge_transaction.id, account_recharge_transaction.account_id, account_recharge_transaction.created_at, account_recharge_tr
ansaction.client_member_id, account_recharge_transaction.subsidy_id, account_recharge_transaction.amount, account_recharge_transaction.batch_id,
account_recharge_transaction.status, account_recharge_transaction.transaction_id, account_recharge_transaction.sub_account_id, account_recharge_t
ransaction."from", account_recharge_transaction."to", account_recharge_transaction.type, account_recharge_transaction.updated_at, account_recharg
e_transaction.detail, account_recharge_transaction.operator, account_recharge_transaction.member_id
Buffers: shared hit=296 read=8652
I/O Timings: shared/local read=8873.460
-> Index Scan Backward using account_recharge_transaction_082_pkey on public.account_recharge_transaction_082 account_recharge_transaction (
cost=0.43..23462.20 rows=233 width=123) (actual time=8940.394..8940.394 rows=0 loops=1)
Output: account_recharge_transaction.id, account_recharge_transaction.account_id, account_recharge_transaction.created_at, account_recha
rge_transaction.client_member_id, account_recharge_transaction.subsidy_id, account_recharge_transaction.amount, account_recharge_transaction.batc
h_id, account_recharge_transaction.status, account_recharge_transaction.transaction_id, account_recharge_transaction.sub_account_id, account_rech
arge_transaction."from", account_recharge_transaction."to", account_recharge_transaction.type, account_recharge_transaction.updated_at, account_r
echarge_transaction.detail, account_recharge_transaction.operator, account_recharge_transaction.member_id
Index Cond: (account_recharge_transaction.account_id = '111441242101419222'::bigint)
Filter: (account_recharge_transaction.status = 2)
Buffers: shared hit=296 read=8652
I/O Timings: shared/local read=8873.460
Query Identifier: -7739276633008834702
Planning:
Buffers: shared hit=117
Planning Time: 0.611 ms
Execution Time: 8940.421 ms
(15 rows)
发现两者差别在buffer的命中率,比如速度比较慢的时候是因为,读取了8652个块数据,只命中了296个。这是postgre的特性,加载数据优先从buffer shared(内存)读取,不存在再去磁盘读取,磁盘读取耗时较长。
新的疑问:磁盘读取虽然慢,但也不能这么慢吧!!!,到底读了什么
buffer shared的单位是块,可以通过这个sql查询出来 ,大小是8k
SELECT current_setting('block_size') AS block_size; -- 8192
所以上面8948 * 8大约是70M
然后通过sql查询索引的大小发现是72M
SELECT pg_size_pretty(pg_relation_size('account_recharge_transaction_030_pkey'))
AS index_size_in_mb;
似乎是这次查询直接把这个索引给全部加载到buffer shared了,但是线上的buffer shard大小是多少呢?大约如下,这么大理论上应该是够的,不应该不够。
SHOW shared_buffers; -- 单位是block,大小是5492159 * 8k = 大约等于43G,总内存大小64GB
没有办法在线上直接安装扩展查询buffer shared数据是不是直接加载70M大部分都是索引数据,索引在sandbox进行测试。
CREATE EXTENSION pg_buffercache;
SELECT
c.relname,
count(*) AS buffers,
pg_size_pretty(count(*) * 8192) AS size
FROM
pg_buffercache b
JOIN pg_class c ON b.relfilenode = c.relfilenode
JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
WHERE
c.relname = 'account_recharge_transaction_082'
OR c.relname = 'account_recharge_transaction_082_pkey'
GROUP BY
c.relname
ORDER BY
buffers DESC;
执行上面的sql buffer cache确实包含了account_recharge_transaction_082_pkey所有的数据,所以上面的sql查询必须依赖所有的数据。
4. 真相大白
为什么这个会依赖整个索引呢,是因为分区表的主键索引的字段的顺序不对,因为postgre数据库hash分区的主键必须包含分区键,当时是这样建立的,account_recharge_transaction_pkey PRIMARY KEY (id, account_id),应该是(account_id,id),account_id作为分区键在前,id在后。这样能解释通一系列疑问了。
- 为什么第一次explain正常?
因为按照account_id查询一定命中一个分区,这个是postgre的特有逻辑,根据where条件,执行器发现用account_recharge_transaction_pkey索引比全表扫描更优化,索引使用了这个索引,通过这个索引去查询数据,但是这个查询并不是高效,因为account_id在后面,btree索引只能被全部加载到内存然后在扫描整个索引。索引单看explain不加其他参数看不出来,反而会造成索引被充分利用错觉,所以以后执行explain要加其他参数综合考虑。
- 为什么有的时候有问题?
100个分区按照每个70M算大约是7G,account_id查询概率比较均等,如果想完全都通过buffer shared查询到,至少有7Gbuffer shared空间给这个表,但是线上也有其他的表这个不容易保证,满了以后就可能被覆盖到,导致内存没有去磁盘读。
- 为什么磁盘读取那么慢?
主要还是需要加载一个分区的整个索引数据,70M太多了。
5. 怎么解决?
理想方案:删除现在的索引,重建新的主键索引用accont_id, id .但是线上数据太大了,这个会锁表,耗费时间太长,不可取。
实际方案:线上本来就存在account_id 和 created_at, 通过created_at desc,id desc实现查询。
优化后对比发现加载的block数量大大减少,之前是9000左右,现在大多数都是10左右。
总结:
- postgre的explain执行要加其他参数观察全面一些。
- 分析问题要定量,不能凭直觉,比如感觉数据不应该那么多,不应该耗费那么久,要想办法去看看到底发生了什么。
- postgre hash分区建立主键索引一定把分区键放前面。