线上postgre数据库查询2%的概率超时

1. 问题描述

线上数据库，里面有个account_recharge_transaction表，这个表采用了hash分区，以account_id作为分区键，但是最近这段时间发现这个一个接口报超时错误，大概有2%的概率超时。执行sql如下

SELECT * FROM "account_recharge_transaction" WHERE "status"=2 AND "account_id"=111795163172472852 ORDER BY "id" DESC LIMIT 1;

2. 排查方向

超时能否确定确实是执行sql引起的？

通过otel的链路排查发现确实是sql查询消耗的，不存在是上游业务占用了过多的时间导致执行sql本身时间不够导致的，也不存在执行sql之前有其他的业务代码占用了过多的时间。

是否是数据库连接数不够导致的？

不存在这个问题，采用k8s部署，有5个pod，每个pod最大连接数是30，配置是足够的，同时查看连接池的监控发现占用很少。

是否是sql本身有问题

通过执行postgre的explain语句发现走索引了, 索引定义是：
account_recharge_transaction_pkey PRIMARY KEY (id, account_id)

explain analyse SELECT "id","account_id","created_at","client_member_id","subsidy_id","amount","batch_id","status","transaction_id","sub_account_id","from","to","type","updated_at","detail","operator","member_id" FROM "account_recharge_transaction"
WHERE "account_id"=111332578555887692 AND "status"=2 ORDER BY "id" DESC LIMIT 1;
Limit  (cost=0.43..103.72 rows=1 width=123) (actual time=38.996..38.997 rows=1 loops=1)
   ->  Index Scan Backward using account_recharge_transaction_060_pkey on account_recharge_transaction_060 account_recharge_transaction  (cost=0.43..24066.54 rows=233 width=123) (
actual time=38.995..38.995 rows=1 loops=1)
         Index Cond: (account_id = '111332578555887692'::bigint)
         Filter: (status = 2)
 Planning Time: 0.160 ms
 Execution Time: 39.035 ms

排除没走索引问题。
排除跨分区查询问题，只走了一个分区。
排除数据量过大问题，这个查询命中数据其实很少只有个位数，而且手动执行的时候用了39.035ms并不慢，但是线上确实是执行这个的时候超时了。

是否是数据库负载问题?

线上用aws的aurora postgre数据库，配置相当高，CPU或者内存不足相关的问题。

是否是sql真正执行的时候执行计划出现了问题，导致没有走hash分区，走全表扫描？

通过配置auto_explain发现即使在超时的时候执行计划也是走一个分区和primary key索引了。

3. 继续排查

通过上面分析和信息收集依然没有找出偶尔超时的问题，继续在aws的打开performance insight，可以看到更多的数据库信息，里面有AAS指标，里面的等待选项发现对应的sql是IO操作。但是上面sql初次看起来不可能引发io问题，因为数据量很少，每个分区也就200万数据，而且查询的数据也很小。继续手动在读库执行查询语句，发现有的时候确实能复现查询比较慢的情况。

继续给explain增加参数

EXPLAIN (ANALYZE, VERBOSE, COSTS, BUFFERS) 
SELECT "id","account_id","created_at","client_member_id","subsidy_id","amount","batch_id","status","transaction_id","sub_account_id","from","to","type","updated_at","detail","operator","member_id" 
FROM "account_recharge_transaction" 
WHERE "account_id"=111441242101419219 AND "status"=2 ORDER BY "id" DESC LIMIT 1;

发现访问速度正常时结果如下

Limit  (cost=0.43..101.12 rows=1 width=123) (actual time=38.298..38.299 rows=0 loops=1)
   Output: account_recharge_transaction.id, account_recharge_transaction.account_id, account_recharge_transaction.created_at, account_recharge_tr
ansaction.client_member_id, account_recharge_transaction.subsidy_id, account_recharge_transaction.amount, account_recharge_transaction.batch_id,
account_recharge_transaction.status, account_recharge_transaction.transaction_id, account_recharge_transaction.sub_account_id, account_recharge_t
ransaction."from", account_recharge_transaction."to", account_recharge_transaction.type, account_recharge_transaction.updated_at, account_recharg
e_transaction.detail, account_recharge_transaction.operator, account_recharge_transaction.member_id
   Buffers: shared hit=8948
   ->  Index Scan Backward using account_recharge_transaction_082_pkey on public.account_recharge_transaction_082 account_recharge_transaction  (
cost=0.43..23462.20 rows=233 width=123) (actual time=38.297..38.297 rows=0 loops=1)
         Output: account_recharge_transaction.id, account_recharge_transaction.account_id, account_recharge_transaction.created_at, account_recha
rge_transaction.client_member_id, account_recharge_transaction.subsidy_id, account_recharge_transaction.amount, account_recharge_transaction.batc
h_id, account_recharge_transaction.status, account_recharge_transaction.transaction_id, account_recharge_transaction.sub_account_id, account_rech
arge_transaction."from", account_recharge_transaction."to", account_recharge_transaction.type, account_recharge_transaction.updated_at, account_r
echarge_transaction.detail, account_recharge_transaction.operator, account_recharge_transaction.member_id
         Index Cond: (account_recharge_transaction.account_id = '111441242101419222'::bigint)
         Filter: (account_recharge_transaction.status = 2)
         Buffers: shared hit=8948
 Query Identifier: -7739276633008834702
 Planning Time: 0.165 ms
 Execution Time: 38.329 ms

访问速度异常时结果如下

Limit  (cost=0.43..101.12 rows=1 width=123) (actual time=8940.396..8940.396 rows=0 loops=1)
   Output: account_recharge_transaction.id, account_recharge_transaction.account_id, account_recharge_transaction.created_at, account_recharge_tr
ansaction.client_member_id, account_recharge_transaction.subsidy_id, account_recharge_transaction.amount, account_recharge_transaction.batch_id,
account_recharge_transaction.status, account_recharge_transaction.transaction_id, account_recharge_transaction.sub_account_id, account_recharge_t
ransaction."from", account_recharge_transaction."to", account_recharge_transaction.type, account_recharge_transaction.updated_at, account_recharg
e_transaction.detail, account_recharge_transaction.operator, account_recharge_transaction.member_id
   Buffers: shared hit=296 read=8652
   I/O Timings: shared/local read=8873.460
   ->  Index Scan Backward using account_recharge_transaction_082_pkey on public.account_recharge_transaction_082 account_recharge_transaction  (
cost=0.43..23462.20 rows=233 width=123) (actual time=8940.394..8940.394 rows=0 loops=1)
         Output: account_recharge_transaction.id, account_recharge_transaction.account_id, account_recharge_transaction.created_at, account_recha
rge_transaction.client_member_id, account_recharge_transaction.subsidy_id, account_recharge_transaction.amount, account_recharge_transaction.batc
h_id, account_recharge_transaction.status, account_recharge_transaction.transaction_id, account_recharge_transaction.sub_account_id, account_rech
arge_transaction."from", account_recharge_transaction."to", account_recharge_transaction.type, account_recharge_transaction.updated_at, account_r
echarge_transaction.detail, account_recharge_transaction.operator, account_recharge_transaction.member_id
         Index Cond: (account_recharge_transaction.account_id = '111441242101419222'::bigint)
         Filter: (account_recharge_transaction.status = 2)
         Buffers: shared hit=296 read=8652
         I/O Timings: shared/local read=8873.460
 Query Identifier: -7739276633008834702
 Planning:
   Buffers: shared hit=117
 Planning Time: 0.611 ms
 Execution Time: 8940.421 ms
(15 rows)

发现两者差别在buffer的命中率，比如速度比较慢的时候是因为，读取了8652个块数据，只命中了296个。这是postgre的特性，加载数据优先从buffer shared（内存）读取，不存在再去磁盘读取，磁盘读取耗时较长。

新的疑问：磁盘读取虽然慢，但也不能这么慢吧！！！，到底读了什么

buffer shared的单位是块，可以通过这个sql查询出来 ,大小是8k

SELECT current_setting('block_size') AS block_size; -- 8192

所以上面8948 * 8大约是70M

然后通过sql查询索引的大小发现是72M

SELECT pg_size_pretty(pg_relation_size('account_recharge_transaction_030_pkey')) 
AS index_size_in_mb;

似乎是这次查询直接把这个索引给全部加载到buffer shared了，但是线上的buffer shard大小是多少呢？大约如下，这么大理论上应该是够的，不应该不够。

SHOW shared_buffers; -- 单位是block，大小是5492159 * 8k = 大约等于43G，总内存大小64GB

没有办法在线上直接安装扩展查询buffer shared数据是不是直接加载70M大部分都是索引数据，索引在sandbox进行测试。


CREATE EXTENSION pg_buffercache;


SELECT
    c.relname,
    count(*) AS buffers,
    pg_size_pretty(count(*) * 8192) AS size
FROM
    pg_buffercache b
    JOIN pg_class c ON b.relfilenode = c.relfilenode
    JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
WHERE
    c.relname = 'account_recharge_transaction_082'
    OR c.relname = 'account_recharge_transaction_082_pkey'
GROUP BY
    c.relname
ORDER BY
    buffers DESC;

执行上面的sql buffer cache确实包含了account_recharge_transaction_082_pkey所有的数据，所以上面的sql查询必须依赖所有的数据。

4. 真相大白

为什么这个会依赖整个索引呢，是因为分区表的主键索引的字段的顺序不对，因为postgre数据库hash分区的主键必须包含分区键，当时是这样建立的，account_recharge_transaction_pkey PRIMARY KEY (id, account_id)，应该是(account_id，id)，account_id作为分区键在前，id在后。这样能解释通一系列疑问了。

为什么第一次explain正常？

因为按照account_id查询一定命中一个分区，这个是postgre的特有逻辑，根据where条件，执行器发现用account_recharge_transaction_pkey索引比全表扫描更优化，索引使用了这个索引，通过这个索引去查询数据，但是这个查询并不是高效，因为account_id在后面，btree索引只能被全部加载到内存然后在扫描整个索引。索引单看explain不加其他参数看不出来，反而会造成索引被充分利用错觉，所以以后执行explain要加其他参数综合考虑。

为什么有的时候有问题？

100个分区按照每个70M算大约是7G，account_id查询概率比较均等，如果想完全都通过buffer shared查询到，至少有7Gbuffer shared空间给这个表，但是线上也有其他的表这个不容易保证，满了以后就可能被覆盖到，导致内存没有去磁盘读。

为什么磁盘读取那么慢？

主要还是需要加载一个分区的整个索引数据，70M太多了。

5. 怎么解决？

理想方案：删除现在的索引，重建新的主键索引用accont_id, id .但是线上数据太大了，这个会锁表，耗费时间太长，不可取。

实际方案：线上本来就存在account_id 和 created_at, 通过created_at desc,id desc实现查询。

优化后对比发现加载的block数量大大减少，之前是9000左右，现在大多数都是10左右。

总结：

postgre的explain执行要加其他参数观察全面一些。
分析问题要定量，不能凭直觉，比如感觉数据不应该那么多，不应该耗费那么久，要想办法去看看到底发生了什么。
postgre hash分区建立主键索引一定把分区键放前面。