环境说明
一主两从:
角色 | 主机 | 组件 |
---|---|---|
pa-pg-1 | 172.22.138.220 | etcd、patroni、postgresql |
pa-pg-2 | 172.22.138.219 | etcd、patroni、postgresql |
pa-pg-3 | 172.22.138.218 | etcd、patroni、postgresql |
安装部署好Patroni高可用后,主库pa-pg-1和从库pa-pg-3均启动正常,而pa-pg-2的patroni日志一直输出以下报错:
...
Aug 15 10:09:56 DB-test2 patroni: 2024-08-15 11:09:56,419 INFO: Lock owner: pa-pg-1; I am pa-pg-2
Aug 15 10:09:56 DB-test2 patroni: 2024-08-15 11:09:56,421 INFO: Local timeline=3 lsn=0/DB000110
Aug 15 10:09:56 DB-test2 patroni: 2024-08-15 11:09:56,422 ERROR: Exception when working with leader
Aug 15 10:09:56 DB-test2 patroni: Traceback (most recent call last):
Aug 15 10:09:56 DB-test2 patroni: File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/rewind.py", line 80, in check_leader_is_not_in_recovery
Aug 15 10:09:56 DB-test2 patroni: with get_connection_cursor(connect_timeout=3, options='-c statement_timeout=2000', **conn_kwargs) as cur:
Aug 15 10:09:56 DB-test2 patroni: File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
Aug 15 10:09:56 DB-test2 patroni: return next(self.gen)
Aug 15 10:09:56 DB-test2 patroni: File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/connection.py", line 157, in get_connection_cursor
Aug 15 10:09:56 DB-test2 patroni: conn = psycopg.connect(**kwargs)
Aug 15 10:09:56 DB-test2 patroni: File "/usr/local/lib/python3.6/site-packages/patroni/psycopg.py", line 104, in connect
Aug 15 10:09:56 DB-test2 patroni: ret = _connect(*args, **kwargs)
Aug 15 10:09:56 DB-test2 patroni: File "/usr/local/lib64/python3.6/site-packages/psycopg2/__init__.py", line 126, in connect
Aug 15 10:09:56 DB-test2 patroni: File "/usr/local/lib64/python3.6/site-packages/psycopg2/extensions.py", line 175, in make_dsn
Aug 15 10:09:56 DB-test2 patroni: psycopg2.ProgrammingError: invalid dsn: invalid connection option "target_session_attrs"
Aug 15 10:09:56 DB-test2 patroni: 2024-08-15 11:09:56,425 INFO: no action. I am (pa-pg-2), a secondary, and following a leader (pa-pg-1)
...
排查思路
1.首先一直关注的报错是"ERROR: Exception when working with leader"所以按照这个报错,首先去排查了patroni各节点的状态,但是通过节点状态并没有发现什么明显异常。
postgres # patronictl -c /etc/patroni.yml list
+ Cluster: pgsql (7400977424401590005) ----------+----+-----------+-----------------+-----------------------------+
| Member | Host | Role | State | TL | Lag in MB | Pending restart | Pending restart reason |
+---------+----------------+---------+-----------+----+-----------+-----------------+-----------------------------+
| pa-pg-1 | 172.22.138.220 | Leader | running | 3 | | | |
| pa-pg-2 | 172.22.138.219 | Replica | streaming | 3 | 0 | | |
| pa-pg-3 | 172.22.138.218 | Replica | streaming | 3 | 0 | * | max_connections: 10000->100 |
+---------+----------------+---------+-----------+----+-----------+-----------------+-----------------------------+
2.排查主库的pg_stat_replication和从库的pg_wal_stat_reciver,检查后未见明显异常
postgres=# select * from pg_stat_replication;
+-------+----------+---------+------------------+----------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------+-----------+------------+---------------+------------+-------------------------------+
| pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | reply_time |
+-------+----------+---------+------------------+----------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------+-----------+------------+---------------+------------+-------------------------------+
| 12368 | 16384 | replica | pa-pg-2 | 172.22.138.219 | | 47170 | 2024-08-15 10:23:36.860733+08 | | streaming | 0/DC000000 | 0/DC000000 | 0/DC000000 | 0/DC000000 | | | | 0 | async | 2024-08-15 11:10:18.855529+08 |
| 11101 | 16384 | replica | pa-pg-3 | 172.22.138.218 | | 55800 | 2024-08-15 09:57:31.073827+08 | | streaming | 0/DC000000 | 0/DC000000 | 0/DC000000 | 0/DC000000 | | | | 0 | async | 2024-08-15 11:10:19.992588+08 |
+-------+----------+---------+------------------+----------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------+-----------+------------+---------------+------------+-------------------------------+
...
postgres=# select * from pg_stat_wal_receiver;
-[ RECORD 1 ]---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pid | 18720
status | streaming
receive_start_lsn | 0/DB000000
receive_start_tli | 3
written_lsn | 0/DC000000
flushed_lsn | 0/DC000000
received_tli | 3
last_msg_send_time | 2024-08-15 11:11:08.623733+08
last_msg_receipt_time | 2024-08-15 11:11:08.623328+08
latest_end_lsn | 0/DC000000
latest_end_time | 2024-08-15 11:09:52.14472+08
slot_name | pa_pg_2
sender_host | 172.22.138.220
sender_port | 5432
conninfo | user=replica passfile=/home/postgres/pgpass channel_binding=prefer dbname=replication host=172.22.138.220 port=5432 application_name=pa-pg-2 fallback_application_name=pgsql sslmode=prefer sslcompression=0 sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=prefer krbsrvname=postgres target_session_attrs=any
3.查看postgresql的server日志,更是一点信息都没有。
2024-08-15 11:07:18 CST--- :LOG: redo starts at 0/DA0025B8
2024-08-15 11:07:18 CST--- :LOG: consistent recovery state reached at 0/DB000110
2024-08-15 11:07:18 CST--- :LOG: invalid record length at 0/DB000110: wanted 24, got 0
2024-08-15 11:07:18 CST--- :LOG: database system is ready to accept read-only connections
2024-08-15 11:07:18 CST--- :LOG: started streaming WAL from primary at 0/DB000000 on timeline 3
2024-08-15 11:10:06 CST--- :LOG: received fast shutdown request
2024-08-15 11:10:06 CST--- :LOG: aborting any active transactions
2024-08-15 11:10:06 CST-postgres-127.0.0.1-Patroni heartbeat :FATAL: terminating connection due to administrator command
2024-08-15 11:10:06 CST--- :FATAL: terminating walreceiver process due to administrator command
2024-08-15 11:10:06 CST--- :LOG: shutting down
2024-08-15 11:10:06 CST--- :LOG: database system is shut down
2024-08-15 11:10:10 CST--- :LOG: starting PostgreSQL 14.13 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
2024-08-15 11:10:10 CST--- :LOG: listening on IPv4 address "0.0.0.0", port 5432
2024-08-15 11:10:10 CST--- :LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2024-08-15 11:10:10 CST--- :LOG: database system was shut down in recovery at 2024-08-15 11:10:06 CST
2024-08-15 11:10:10 CST--- :LOG: entering standby mode
2024-08-15 11:10:10 CST--- :LOG: redo starts at 0/DA0025B8
2024-08-15 11:10:10 CST--- :LOG: consistent recovery state reached at 0/DB000110
2024-08-15 11:10:10 CST--- :LOG: invalid record length at 0/DB000110: wanted 24, got 0
2024-08-15 11:10:10 CST--- :LOG: database system is ready to accept read-only connections
2024-08-15 11:10:10 CST--- :LOG: started streaming WAL from primary at 0/DB000000 on timeline 3
4.随后寻求外援,通过官方github的issues,有类似报错,建议为复制用户replica添加postgres库的connect权限,司马当活马医了。去添加了相关的权限。
postgres=# grant CONNECT on database postgres to replica;
GRANT
5.重启pa-pg-2服务,查看patroni的日志发现,报错是一点没变。此时把报错的重点放在了这条上面"psycopg2.ProgrammingError: invalid dsn: invalid connection option "target_session_attrs""
Aug 15 11:09:56 DB-test2 patroni: 2024-08-15 11:09:56,422 ERROR: Exception when working with leader
Aug 15 11:09:56 DB-test2 patroni: Traceback (most recent call last):
Aug 15 11:09:56 DB-test2 patroni: File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/rewind.py", line 80, in check_leader_is_not_in_recovery
Aug 15 11:09:56 DB-test2 patroni: with get_connection_cursor(connect_timeout=3, options='-c statement_timeout=2000', **conn_kwargs) as cur:
Aug 15 11:09:56 DB-test2 patroni: File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
Aug 15 11:09:56 DB-test2 patroni: return next(self.gen)
Aug 15 11:09:56 DB-test2 patroni: File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/connection.py", line 157, in get_connection_cursor
Aug 15 11:09:56 DB-test2 patroni: conn = psycopg.connect(**kwargs)
Aug 15 11:09:56 DB-test2 patroni: File "/usr/local/lib/python3.6/site-packages/patroni/psycopg.py", line 104, in connect
Aug 15 11:09:56 DB-test2 patroni: ret = _connect(*args, **kwargs)
Aug 15 11:09:56 DB-test2 patroni: File "/usr/local/lib64/python3.6/site-packages/psycopg2/__init__.py", line 126, in connect
Aug 15 11:09:56 DB-test2 patroni: File "/usr/local/lib64/python3.6/site-packages/psycopg2/extensions.py", line 175, in make_dsn
Aug 15 11:09:56 DB-test2 patroni: psycopg2.ProgrammingError: invalid dsn: invalid connection option "target_session_attrs"
Aug 15 11:09:56 DB-test2 patroni: 2024-08-15 11:09:56,425 INFO: no action. I am (pa-pg-2), a secondary, and following a leader (pa-pg-1)
6.通过官方github的issues,建议将psycopg2模块的版本修改为psycopg2==2.8.6,但是我的机服务器上,没有这个模块,正常启动的两台也没有这个模块。但是当我检查这个相关模块时发现了问题。
#pa-pg-1和pa-pg-3
root # pip3 list|grep psycopg2
psycopg2-binary 2.9.8
#pa-pg-2
psycopg2-binary 2.9.5
7.通过上方发现patroni的所需要的模块psycopg2-binary,三台节点唯有启动报错的pa-pg-2版本与其他两台不一致。所以决定在pa-pg-2节点重装这个模块。重装后pa-pg-2的patroni日志不再输出报错。
root # pip3 uninstall psycopg2-binary
root # pip3 install psycopg2-binary==2.9.8
...
Aug 15 11:23:14 DB-test2 systemd: Started patroni - a high-availability PostgreSQL.
Aug 15 11:23:14 DB-test2 systemd-logind: New session 886 of user root.
Aug 15 11:23:14 DB-test2 systemd: Started Session 886 of user root.
Aug 15 11:23:14 DB-test2 systemd-logind: Removed session 886.
Aug 15 11:23:15 DB-test2 patroni: postgresql parameter listen_addresses=0.0.0.0 failed validation, defaulting to None
Aug 15 11:23:15 DB-test2 patroni: postgresql parameter port=5432 failed validation, defaulting to None
Aug 15 11:23:15 DB-test2 patroni: 2024-08-15 11:23:15,225 INFO: No PostgreSQL configuration items changed, nothing to reload.
Aug 15 11:23:15 DB-test2 patroni: localhost:5432 - accepting connections
Aug 15 11:23:15 DB-test2 patroni: 2024-08-15 11:23:15,233 INFO: establishing a new patroni heartbeat connection to postgres
Aug 15 11:23:15 DB-test2 patroni: 2024-08-15 11:23:15,254 INFO: no action. I am (pa-pg-2), a secondary, and following a leader (pa-pg-1)
Aug 15 11:23:16 DB-test2 systemd-logind: New session 887 of user root.
Aug 15 11:23:16 DB-test2 systemd: Started Session 887 of user root.
Aug 15 11:23:16 DB-test2 systemd-logind: Removed session 887.
Aug 15 11:23:16 DB-test2 patroni: 2024-08-15 11:23:16,423 INFO: no action. I am (pa-pg-2), a secondary, and following a leader (pa-pg-1)
总结
在进行集群相关的应用部署时,只是检查好相关模块安装成功与否还不够,还需要更细致的检查相关的版本号是否一致。