Description
Is your feature request related to a problem? Please describe.
When a primary node failover on PostgreSQL cluster specifically Patroni in my case, pgcat needs to be configured again to correctly route queries.
[pools.simple_db]
pool_mode = "transaction"
default_role = "any"
query_parser_enabled = true
primary_reads_enabled = false
sharding_function = "pg_bigint_hash"
[pools.simple_db.users.0]
username = "dba"
password = "password"
pool_size = 100
statement_timeout = 0
[pools.simple_db.shards.0]
servers = [
[ "10.85.246.169", 5432, "replica"],
[ "10.85.241.153", 5432, "primary"], # this node failover and another node becomes primary
[ "10.85.245.26", 5432, "replica"]
]
database = "simple_db"
When this happens, it bans that node correctly, but it can not detect another node that becomes primary on its own. As a result of this, write queries get errors.
[2022-11-10T12:59:00.125226Z ERROR pgcat::pool] Banning instance Address { id: 1, host: "10.85.241.153", port: 5432, shard: 0, database: "simple_db", role: Primary, replica_number: 1, address_index: 1, username: "dba", pool_name: "simple_db" }, error: TimedOut
[2022-11-10T12:59:00.125272Z ERROR pgcat::pool] Banning Address { id: 1, host: "10.85.241.153", port: 5432, shard: 0, database: "simple_db", role: Primary, replica_number: 1, address_index: 1, username: "dba", pool_name: "simple_db" }
[2022-11-10T12:59:00.125340Z ERROR pgcat::client] Could not get connection from pool: { pool_name: "simple_db", username: "dba", shard: 0, role: "Some(Primary)", error: "AllServersDown" }
Describe the solution you'd like
I think this behaviour can be improved by checking nodes with a query like this to see if they are primary or replica.
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
t
(1 row)
If this query returns true
pgcat can mark that node replica, if it's false
then it knows that's a primary node.
Describe alternatives you've considered
Another alternative can be creating a test table when pgcat starts, and it can run INSERT/DELETE/UPDATE on that table to see if it runs successfully.
What do you think? Is this a viable solution to this problem?