Open
Description
Please, answer some short questions which should help us to understand your problem / question better?
- Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:latest
- Where do you run it - cloud or metal? Bare Metal K8s
- Are you running Postgres Operator in production? no
- Type of issue? Question
Hi there,
I am trying out the postgres-operator to deploy a HA-postgres-cluster on kubernetes.
I am using this manifest:
apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
name: acid-minimal-cluster
namespace: default
spec:
teamId: "acid"
volume:
size: 1Gi
numberOfInstances: 2
users:
postgres: # database owner
- superuser
- createdb
foo_user: [] # role for application foo
databases:
postgres: postgres # dbname: owner
preparedDatabases:
bar: {}
postgresql:
version: "12"
enableMasterLoadBalancer: true
allowedSourceRanges: # load balancers' source ranges for both master and replica services
- 0.0.0.0/24
patroni:
ttl: 2
loop_wait: 1
retry_timeout: 0
master_start_timeout: 0
synchronous_mode: false
synchronous_mode_strict: false
maximum_lag_on_failover: 33554432
I want to check, what happens, when I disconnect the network from the worker node the master is running on.
What I saw, was this:
Logs from former standby:
2020-09-22 13:32:58,868 INFO: does not have lock
2020-09-22 13:32:58,871 INFO: no action. i am a secondary and i am following a leader
2020-09-22 13:32:59,873 INFO: Lock owner: acid-minimal-cluster-0; I am acid-minimal-cluster-1
2020-09-22 13:32:59,873 INFO: does not have lock
2020-09-22 13:32:59,879 INFO: no action. i am a secondary and i am following a leader
2020-09-22 13:33:00,862 INFO: Lock owner: acid-minimal-cluster-0; I am acid-minimal-cluster-1
2020-09-22 13:33:00,862 INFO: does not have lock
2020-09-22 13:33:00,864 INFO: no action. i am a secondary and i am following a leader
2020-09-22 13:33:58,493 ERROR: Can not fetch local timeline and lsn from replication connection
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/patroni/postgresql/__init__.py", line 735, in get_replica_timeline
with self.get_replication_connection_cursor(**self.config.local_replication_address) as cur:
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.6/dist-packages/patroni/postgresql/__init__.py", line 730, in get_replication_connection_cursor
with get_connection_cursor(**conn_kwargs) as cur:
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.6/dist-packages/patroni/postgresql/connection.py", line 43, in get_connection_cursor
with psycopg2.connect(**kwargs) as conn:
File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 127, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: timeout expired
could not connect to server: Network is unreachable
Is the server running on host "localhost" (::1) and accepting
TCP/IP connections on port 5432?
2020-09-22 13:34:23,786 INFO: Lock owner: acid-minimal-cluster-0; I am acid-minimal-cluster-1
2020-09-22 13:34:23,786 INFO: does not have lock
2020-09-22 13:34:23,860 INFO: no action. i am a secondary and i am following a leader
2020-09-22 13:34:23,862 WARNING: Loop time exceeded, rescheduling immediately.
2020-09-22 13:34:25,386 WARNING: Request failed to acid-minimal-cluster-0: GET http://10.36.0.1:8008/patroni (HTTPConnectionPool(host='10.36.0.1', port=8008): Max retries exceeded with url: /patroni (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe1b8fe3828>: Failed to establish a new connection: [Errno 113] No route to host',)))
2020-09-22 13:34:25,520 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
2020-09-22 13:34:25,582 INFO: promoted self to leader by acquiring session lock
2020-09-22 13:34:25,584 WARNING: Loop time exceeded, rescheduling immediately.
2020-09-22 13:34:25,584 INFO: Lock owner: acid-minimal-cluster-1; I am acid-minimal-cluster-1
2020-09-22 13:34:25,634 INFO: updated leader lock during promote
server promoting
2020-09-22 13:34:25,671 INFO: cleared rewind state after becoming the leader
As you can see, I disconnected the network at 13:33:01 and there were no more logs for almost a minute.
Then a timeout was reached (I dont know how I can change the timeout to a shorter time).
After 25 more seconds, the node started to promote itself.
Is there a way I can reduce this amount of time?
What I basically want is the former standby to promote itself to the master as soon as the master does not renew his lock.