Pgpool standby server's status is "down" some time after - Enhance your coding expertise with Ozan Başkan on @onlycoders.net

2 years ago

#261777

Ozan Başkan

Pgpool standby server's status is "down" some time after starting replication with online recovery

I have setup a 2 server cluster system using this configuration examples. They are in the same LAN. Master is node 0 on 192.168.1.31, slave is node 1 on 192.168.1.32.

The problem is that approximately 10 minutes after I started streaming replication with running pcp_recovery_node, status of standby by node is changed to down, slave continues replication but all other postgresql nodes get shutdown:

It gets back up when I run "pcp_attach_node -h 192.168.1.55 -p 9898 -U pgpool -n 1" and same process repeats.

I have found a similar issue here and read further about unexpected EOF on standby connection but was not able to come up with a solution. Not sure if it is the only problem here.

Here are the logs:

Slave pgpool:

2022-03-07 16:29:28: pid 11035: LOG:  signal_user1_to_parent_with_reason(2)
2022-03-07 16:29:28: pid 11030: LOG:  Pgpool-II parent process received SIGUSR1
2022-03-07 16:29:28: pid 11030: LOG:  Pgpool-II parent process received sync backend signal from watchdog
2022-03-07 16:29:28: pid 11035: LOG:  new IPC connection received
2022-03-07 16:29:28: pid 11030: LOG:  leader watchdog has performed failover
2022-03-07 16:29:28: pid 11030: DETAIL:  syncing the backend states from the LEADER watchdog node
2022-03-07 16:29:28: pid 11035: LOG:  new IPC connection received
2022-03-07 16:29:28: pid 11035: LOG:  received the get data request from local pgpool-II on IPC interface
2022-03-07 16:29:28: pid 11035: LOG:  get data request from local pgpool-II node received on IPC interface is forwarded to leader watchdog node "192.168.1.31:9999 Linux localhost.localdomain"
2022-03-07 16:29:28: pid 11035: DETAIL:  waiting for the reply...
2022-03-07 16:29:28: pid 11030: LOG:  leader watchdog node "192.168.1.31:9999 Linux localhost.localdomain" returned status for 2 backend nodes
2022-03-07 16:29:28: pid 11030: LOG:  backend:1 is set to down status
2022-03-07 16:29:28: pid 11030: DETAIL:  backend:1 is DOWN on cluster leader "192.168.1.31:9999 Linux localhost.localdomain"
2022-03-07 16:29:28: pid 11030: LOG:  1 backend node(s) were detached because of backend status sync from "192.168.1.31:9999 Linux localhost.localdomain"
2022-03-07 16:29:28: pid 11030: DETAIL:  restarting the children processes
2022-03-07 16:29:28: pid 11030: LOG:  Node 0 is not down (status: 1)
2022-03-07 16:29:28: pid 26533: LOG:  worker process received restart request
2022-03-07 16:29:28: pid 11030: LOG:  worker child process with pid: 26533 exits with status 256
2022-03-07 16:29:28: pid 11030: LOG:  fork a new worker child process with pid: 27025
2022-03-07 16:29:28: pid 27025: LOG:  process started

Master pgpool:

2022-03-07 16:29:10: pid 330084: ERROR:  unable to read data from frontend
2022-03-07 16:29:10: pid 330084: DETAIL:  EOF encountered with frontend
2022-03-07 16:29:11: pid 330081: LOG:  reading message length
2022-03-07 16:29:11: pid 330081: DETAIL:  message length (22) in slot 1 does not match with slot 0(23)
2022-03-07 16:29:19: pid 15432: LOG:  new IPC connection received
2022-03-07 16:29:27: pid 330082: LOG:  received degenerate backend request for node_id: 1 from pid [330082]
2022-03-07 16:29:27: pid 15432: LOG:  new IPC connection received
2022-03-07 16:29:27: pid 15432: LOG:  watchdog received the failover command from local pgpool-II on IPC interface
2022-03-07 16:29:27: pid 15432: LOG:  watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface
2022-03-07 16:29:27: pid 15432: LOG:  we have got the consensus to perform the failover
2022-03-07 16:29:27: pid 15432: DETAIL:  1 node(s) voted in the favor
2022-03-07 16:29:27: pid 330082: LOG:  signal_user1_to_parent_with_reason(0)
2022-03-07 16:29:27: pid 15429: LOG:  Pgpool-II parent process received SIGUSR1
2022-03-07 16:29:27: pid 15429: LOG:  Pgpool-II parent process has received failover request
2022-03-07 16:29:27: pid 330082: WARNING:  write on backend 1 failed with error :"Broken pipe"
2022-03-07 16:29:27: pid 330082: DETAIL:  while trying to write data from offset: 0 wlen: 5
2022-03-07 16:29:27: pid 15432: LOG:  new IPC connection received
2022-03-07 16:29:27: pid 15432: LOG:  received the failover indication from Pgpool-II on IPC interface
2022-03-07 16:29:27: pid 15432: LOG:  watchdog is informed of failover start by the main process
2022-03-07 16:29:27: pid 15429: LOG:  starting degeneration. shutdown host 192.168.1.32(5432)
2022-03-07 16:29:27: pid 15429: LOG:  Do not restart children because we are switching over node id 1 host: 192.168.1.32 port: 5432 and we are in streaming replication mode
2022-03-07 16:29:27: pid 15429: LOG:  child pid 330081 needs to restart because pool 1 uses backend 1
2022-03-07 16:29:27: pid 15429: LOG:  execute command: /etc/pgpool-II/failover.sh 1 192.168.1.32 5432 /var/lib/pgsql/14/data 0 192.168.1.31 0 0 5432 /var/lib/pgsql/14/data 192.168.1.31 5432
+ FAILED_NODE_ID=1
+ FAILED_NODE_HOST=192.168.1.32
+ FAILED_NODE_PORT=5432
+ FAILED_NODE_PGDATA=/var/lib/pgsql/14/data
+ NEW_MAIN_NODE_ID=0
+ NEW_MAIN_NODE_HOST=192.168.1.31
+ OLD_MAIN_NODE_ID=0
+ OLD_PRIMARY_NODE_ID=0
+ NEW_MAIN_NODE_PORT=5432
+ NEW_MAIN_NODE_PGDATA=/var/lib/pgsql/14/data
+ OLD_PRIMARY_NODE_HOST=192.168.1.31
+ OLD_PRIMARY_NODE_PORT=5432
+ PGHOME=/usr/pgsql-14
+ REPL_SLOT_NAME=192_168_1_32
+ echo failover.sh: start: failed_node_id=1 failed_host=192.168.1.32 old_primary_node_id=0 new_main_node_id=0 new_main_host=192.168.1.31
failover.sh: start: failed_node_id=1 failed_host=192.168.1.32 old_primary_node_id=0 new_main_node_id=0 new_main_host=192.168.1.31
+ '[' 0 -lt 0 ']'
+ ssh -T -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null postgres@192.168.1.31 -i /var/lib/pgsql/.ssh/id_rsa_pgpool ls /tmp
Warning: Permanently added '192.168.1.31' (ECDSA) to the list of known hosts.
+ '[' 0 -ne 0 ']'
+ '[' 1 -ne 0 ']'
+ /usr/pgsql-14/bin/psql -h 192.168.1.31 -p 5432 -c 'SELECT pg_drop_replication_slot('\''192_168_1_32'\'');'
+ '[' 1 -ne 0 ']'
+ echo ERROR: failover.sh: drop replication slot '"192_168_1_32"' failed. You may need to drop replication slot manually.
ERROR: failover.sh: drop replication slot "192_168_1_32" failed. You may need to drop replication slot manually.
+ echo failover.sh: end: standby node is down. Skipping failover.
failover.sh: end: standby node is down. Skipping failover.
+ exit 0
2022-03-07 16:29:28: pid 15429: LOG:  failover: set new primary node: 0
2022-03-07 16:29:28: pid 15429: LOG:  failover: set new main node: 0
2022-03-07 16:29:28: pid 15429: LOG:  child pid 330081 needs to restart because pool 1 uses backend 1
2022-03-07 16:29:28: pid 330070: LOG:  worker process received restart request
2022-03-07 16:29:28: pid 15432: LOG:  new IPC connection received
2022-03-07 16:29:28: pid 15432: LOG:  received the failover indication from Pgpool-II on IPC interface
2022-03-07 16:29:28: pid 15432: LOG:  watchdog is informed of failover end by the main process
failover done. shutdown host 192.168.1.32(5432)2022-03-07 16:29:28: pid 15429: LOG:  failover done. shutdown host 192.168.1.32(5432)
2022-03-07 16:29:28: pid 330093: LOG:  failover or failback event detected
2022-03-07 16:29:28: pid 330093: DETAIL:  restarting myself
2022-03-07 16:29:28: pid 330089: LOG:  failover or failback event detected
2022-03-07 16:29:28: pid 330089: DETAIL:  restarting myself
2022-03-07 16:29:28: pid 329708: LOG:  failover or failback event detected
2022-03-07 16:29:28: pid 329708: DETAIL:  restarting myself
2022-03-07 16:29:28: pid 330085: LOG:  failover or failback event detected
2022-03-07 16:29:28: pid 330085: DETAIL:  restarting myself
2022-03-07 16:29:28: pid 330104: LOG:  failover or failback event detected
2022-03-07 16:29:28: pid 330104: DETAIL:  restarting myself
2022-03-07 16:29:28: pid 330075: LOG:  failover or failback event detected
2022-03-07 16:29:28: pid 330075: DETAIL:  restarting myself
2022-03-07 16:29:28: pid 330106: LOG:  failover or failback event detected
2022-03-07 16:29:28: pid 330106: DETAIL:  restarting myself
2022-03-07 16:29:28: pid 330092: LOG:  failover or failback event detected
2022-03-07 16:29:28: pid 330092: DETAIL:  restarting myself
2022-03-07 16:29:28: pid 330079: LOG:  failover or failback event detected
2022-03-07 16:29:28: pid 330079: DETAIL:  restarting myself

cluster-computing

master-slave

pgpool

postgresql-14

0 Answers

Your Answer

Posts

Questions

Blogs

Pgpool standby server&#39;s status is &quot;down&quot; some time after starting replication with online recovery

Pgpool standby server's status is "down" some time after starting replication with online recovery