Skip to content

Patroni taking too long for failover #1145

Open
@BenchmarkingBuffalo

Description

@BenchmarkingBuffalo

Please, answer some short questions which should help us to understand your problem / question better?

  • Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:latest
  • Where do you run it - cloud or metal? Bare Metal K8s
  • Are you running Postgres Operator in production? no
  • Type of issue? Question
    Hi there,
    I am trying out the postgres-operator to deploy a HA-postgres-cluster on kubernetes.
    I am using this manifest:
apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: acid-minimal-cluster
  namespace: default
spec:
  teamId: "acid"
  volume:
    size: 1Gi
  numberOfInstances: 2
  users:
    postgres:  # database owner
    - superuser
    - createdb
    foo_user: []  # role for application foo
  databases:
    postgres: postgres  # dbname: owner
  preparedDatabases:
    bar: {}
  postgresql:
    version: "12"
  enableMasterLoadBalancer: true
  allowedSourceRanges:  # load balancers' source ranges for both master and replica services
  - 0.0.0.0/24
  patroni:
    ttl: 2
    loop_wait: 1
    retry_timeout: 0
    master_start_timeout: 0
    synchronous_mode: false
    synchronous_mode_strict: false
    maximum_lag_on_failover: 33554432

I want to check, what happens, when I disconnect the network from the worker node the master is running on.
What I saw, was this:
Logs from former standby:

2020-09-22 13:32:58,868 INFO: does not have lock
2020-09-22 13:32:58,871 INFO: no action.  i am a secondary and i am following a leader
2020-09-22 13:32:59,873 INFO: Lock owner: acid-minimal-cluster-0; I am acid-minimal-cluster-1
2020-09-22 13:32:59,873 INFO: does not have lock
2020-09-22 13:32:59,879 INFO: no action.  i am a secondary and i am following a leader
2020-09-22 13:33:00,862 INFO: Lock owner: acid-minimal-cluster-0; I am acid-minimal-cluster-1
2020-09-22 13:33:00,862 INFO: does not have lock
2020-09-22 13:33:00,864 INFO: no action.  i am a secondary and i am following a leader
2020-09-22 13:33:58,493 ERROR: Can not fetch local timeline and lsn from replication connection
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/patroni/postgresql/__init__.py", line 735, in get_replica_timeline
    with self.get_replication_connection_cursor(**self.config.local_replication_address) as cur:
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.6/dist-packages/patroni/postgresql/__init__.py", line 730, in get_replication_connection_cursor
    with get_connection_cursor(**conn_kwargs) as cur:
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.6/dist-packages/patroni/postgresql/connection.py", line 43, in get_connection_cursor
    with psycopg2.connect(**kwargs) as conn:
  File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 127, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: timeout expired
could not connect to server: Network is unreachable
        Is the server running on host "localhost" (::1) and accepting
        TCP/IP connections on port 5432?

2020-09-22 13:34:23,786 INFO: Lock owner: acid-minimal-cluster-0; I am acid-minimal-cluster-1
2020-09-22 13:34:23,786 INFO: does not have lock
2020-09-22 13:34:23,860 INFO: no action.  i am a secondary and i am following a leader
2020-09-22 13:34:23,862 WARNING: Loop time exceeded, rescheduling immediately.
2020-09-22 13:34:25,386 WARNING: Request failed to acid-minimal-cluster-0: GET http://10.36.0.1:8008/patroni (HTTPConnectionPool(host='10.36.0.1', port=8008): Max retries exceeded with url: /patroni (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe1b8fe3828>: Failed to establish a new connection: [Errno 113] No route to host',)))
2020-09-22 13:34:25,520 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
2020-09-22 13:34:25,582 INFO: promoted self to leader by acquiring session lock
2020-09-22 13:34:25,584 WARNING: Loop time exceeded, rescheduling immediately.
2020-09-22 13:34:25,584 INFO: Lock owner: acid-minimal-cluster-1; I am acid-minimal-cluster-1
2020-09-22 13:34:25,634 INFO: updated leader lock during promote
server promoting
2020-09-22 13:34:25,671 INFO: cleared rewind state after becoming the leader

As you can see, I disconnected the network at 13:33:01 and there were no more logs for almost a minute.
Then a timeout was reached (I dont know how I can change the timeout to a shorter time).
After 25 more seconds, the node started to promote itself.
Is there a way I can reduce this amount of time?
What I basically want is the former standby to promote itself to the master as soon as the master does not renew his lock.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions