1 year ago
#261777
Ozan Başkan
Pgpool standby server's status is "down" some time after starting replication with online recovery
I have setup a 2 server cluster system using this configuration examples. They are in the same LAN. Master is node 0 on 192.168.1.31
, slave is node 1 on 192.168.1.32
.
The problem is that approximately 10 minutes after I started streaming replication with running pcp_recovery_node
, status of standby by node is changed to down, slave continues replication but all other postgresql
nodes get shutdown:
It gets back up when I run "pcp_attach_node -h 192.168.1.55 -p 9898 -U pgpool -n 1" and same process repeats.
I have found a similar issue here and read further about unexpected EOF on standby connection but was not able to come up with a solution. Not sure if it is the only problem here.
Here are the logs:
Slave pgpool:
2022-03-07 16:29:28: pid 11035: LOG: signal_user1_to_parent_with_reason(2)
2022-03-07 16:29:28: pid 11030: LOG: Pgpool-II parent process received SIGUSR1
2022-03-07 16:29:28: pid 11030: LOG: Pgpool-II parent process received sync backend signal from watchdog
2022-03-07 16:29:28: pid 11035: LOG: new IPC connection received
2022-03-07 16:29:28: pid 11030: LOG: leader watchdog has performed failover
2022-03-07 16:29:28: pid 11030: DETAIL: syncing the backend states from the LEADER watchdog node
2022-03-07 16:29:28: pid 11035: LOG: new IPC connection received
2022-03-07 16:29:28: pid 11035: LOG: received the get data request from local pgpool-II on IPC interface
2022-03-07 16:29:28: pid 11035: LOG: get data request from local pgpool-II node received on IPC interface is forwarded to leader watchdog node "192.168.1.31:9999 Linux localhost.localdomain"
2022-03-07 16:29:28: pid 11035: DETAIL: waiting for the reply...
2022-03-07 16:29:28: pid 11030: LOG: leader watchdog node "192.168.1.31:9999 Linux localhost.localdomain" returned status for 2 backend nodes
2022-03-07 16:29:28: pid 11030: LOG: backend:1 is set to down status
2022-03-07 16:29:28: pid 11030: DETAIL: backend:1 is DOWN on cluster leader "192.168.1.31:9999 Linux localhost.localdomain"
2022-03-07 16:29:28: pid 11030: LOG: 1 backend node(s) were detached because of backend status sync from "192.168.1.31:9999 Linux localhost.localdomain"
2022-03-07 16:29:28: pid 11030: DETAIL: restarting the children processes
2022-03-07 16:29:28: pid 11030: LOG: Node 0 is not down (status: 1)
2022-03-07 16:29:28: pid 26533: LOG: worker process received restart request
2022-03-07 16:29:28: pid 11030: LOG: worker child process with pid: 26533 exits with status 256
2022-03-07 16:29:28: pid 11030: LOG: fork a new worker child process with pid: 27025
2022-03-07 16:29:28: pid 27025: LOG: process started
Master pgpool:
2022-03-07 16:29:10: pid 330084: ERROR: unable to read data from frontend
2022-03-07 16:29:10: pid 330084: DETAIL: EOF encountered with frontend
2022-03-07 16:29:11: pid 330081: LOG: reading message length
2022-03-07 16:29:11: pid 330081: DETAIL: message length (22) in slot 1 does not match with slot 0(23)
2022-03-07 16:29:19: pid 15432: LOG: new IPC connection received
2022-03-07 16:29:27: pid 330082: LOG: received degenerate backend request for node_id: 1 from pid [330082]
2022-03-07 16:29:27: pid 15432: LOG: new IPC connection received
2022-03-07 16:29:27: pid 15432: LOG: watchdog received the failover command from local pgpool-II on IPC interface
2022-03-07 16:29:27: pid 15432: LOG: watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface
2022-03-07 16:29:27: pid 15432: LOG: we have got the consensus to perform the failover
2022-03-07 16:29:27: pid 15432: DETAIL: 1 node(s) voted in the favor
2022-03-07 16:29:27: pid 330082: LOG: signal_user1_to_parent_with_reason(0)
2022-03-07 16:29:27: pid 15429: LOG: Pgpool-II parent process received SIGUSR1
2022-03-07 16:29:27: pid 15429: LOG: Pgpool-II parent process has received failover request
2022-03-07 16:29:27: pid 330082: WARNING: write on backend 1 failed with error :"Broken pipe"
2022-03-07 16:29:27: pid 330082: DETAIL: while trying to write data from offset: 0 wlen: 5
2022-03-07 16:29:27: pid 15432: LOG: new IPC connection received
2022-03-07 16:29:27: pid 15432: LOG: received the failover indication from Pgpool-II on IPC interface
2022-03-07 16:29:27: pid 15432: LOG: watchdog is informed of failover start by the main process
2022-03-07 16:29:27: pid 15429: LOG: starting degeneration. shutdown host 192.168.1.32(5432)
2022-03-07 16:29:27: pid 15429: LOG: Do not restart children because we are switching over node id 1 host: 192.168.1.32 port: 5432 and we are in streaming replication mode
2022-03-07 16:29:27: pid 15429: LOG: child pid 330081 needs to restart because pool 1 uses backend 1
2022-03-07 16:29:27: pid 15429: LOG: execute command: /etc/pgpool-II/failover.sh 1 192.168.1.32 5432 /var/lib/pgsql/14/data 0 192.168.1.31 0 0 5432 /var/lib/pgsql/14/data 192.168.1.31 5432
+ FAILED_NODE_ID=1
+ FAILED_NODE_HOST=192.168.1.32
+ FAILED_NODE_PORT=5432
+ FAILED_NODE_PGDATA=/var/lib/pgsql/14/data
+ NEW_MAIN_NODE_ID=0
+ NEW_MAIN_NODE_HOST=192.168.1.31
+ OLD_MAIN_NODE_ID=0
+ OLD_PRIMARY_NODE_ID=0
+ NEW_MAIN_NODE_PORT=5432
+ NEW_MAIN_NODE_PGDATA=/var/lib/pgsql/14/data
+ OLD_PRIMARY_NODE_HOST=192.168.1.31
+ OLD_PRIMARY_NODE_PORT=5432
+ PGHOME=/usr/pgsql-14
+ REPL_SLOT_NAME=192_168_1_32
+ echo failover.sh: start: failed_node_id=1 failed_host=192.168.1.32 old_primary_node_id=0 new_main_node_id=0 new_main_host=192.168.1.31
failover.sh: start: failed_node_id=1 failed_host=192.168.1.32 old_primary_node_id=0 new_main_node_id=0 new_main_host=192.168.1.31
+ '[' 0 -lt 0 ']'
+ ssh -T -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null postgres@192.168.1.31 -i /var/lib/pgsql/.ssh/id_rsa_pgpool ls /tmp
Warning: Permanently added '192.168.1.31' (ECDSA) to the list of known hosts.
+ '[' 0 -ne 0 ']'
+ '[' 1 -ne 0 ']'
+ /usr/pgsql-14/bin/psql -h 192.168.1.31 -p 5432 -c 'SELECT pg_drop_replication_slot('\''192_168_1_32'\'');'
+ '[' 1 -ne 0 ']'
+ echo ERROR: failover.sh: drop replication slot '"192_168_1_32"' failed. You may need to drop replication slot manually.
ERROR: failover.sh: drop replication slot "192_168_1_32" failed. You may need to drop replication slot manually.
+ echo failover.sh: end: standby node is down. Skipping failover.
failover.sh: end: standby node is down. Skipping failover.
+ exit 0
2022-03-07 16:29:28: pid 15429: LOG: failover: set new primary node: 0
2022-03-07 16:29:28: pid 15429: LOG: failover: set new main node: 0
2022-03-07 16:29:28: pid 15429: LOG: child pid 330081 needs to restart because pool 1 uses backend 1
2022-03-07 16:29:28: pid 330070: LOG: worker process received restart request
2022-03-07 16:29:28: pid 15432: LOG: new IPC connection received
2022-03-07 16:29:28: pid 15432: LOG: received the failover indication from Pgpool-II on IPC interface
2022-03-07 16:29:28: pid 15432: LOG: watchdog is informed of failover end by the main process
failover done. shutdown host 192.168.1.32(5432)2022-03-07 16:29:28: pid 15429: LOG: failover done. shutdown host 192.168.1.32(5432)
2022-03-07 16:29:28: pid 330093: LOG: failover or failback event detected
2022-03-07 16:29:28: pid 330093: DETAIL: restarting myself
2022-03-07 16:29:28: pid 330089: LOG: failover or failback event detected
2022-03-07 16:29:28: pid 330089: DETAIL: restarting myself
2022-03-07 16:29:28: pid 329708: LOG: failover or failback event detected
2022-03-07 16:29:28: pid 329708: DETAIL: restarting myself
2022-03-07 16:29:28: pid 330085: LOG: failover or failback event detected
2022-03-07 16:29:28: pid 330085: DETAIL: restarting myself
2022-03-07 16:29:28: pid 330104: LOG: failover or failback event detected
2022-03-07 16:29:28: pid 330104: DETAIL: restarting myself
2022-03-07 16:29:28: pid 330075: LOG: failover or failback event detected
2022-03-07 16:29:28: pid 330075: DETAIL: restarting myself
2022-03-07 16:29:28: pid 330106: LOG: failover or failback event detected
2022-03-07 16:29:28: pid 330106: DETAIL: restarting myself
2022-03-07 16:29:28: pid 330092: LOG: failover or failback event detected
2022-03-07 16:29:28: pid 330092: DETAIL: restarting myself
2022-03-07 16:29:28: pid 330079: LOG: failover or failback event detected
2022-03-07 16:29:28: pid 330079: DETAIL: restarting myself
cluster-computing
master-slave
pgpool
postgresql-14
0 Answers
Your Answer