Skip to content

issuer-db-1 is "healthy" but logs show FATAL on all DB, then issuer agent fails to start (no DB) #1173

Closed
@incubiq

Description

@incubiq

Is this a regression?

No

Description

No idea if this specific error is a regression, but the cmd line to run agents that used to work does not work anymore, and the DB has Fatal errors despite showing up "healthy".

I run this command to load all agents: DOCKERHOST=host.docker.internal ADMIN_TOKEN=my-admin-token API_KEY_ENABLED=true API_KEY_AUTO_PROVISIONING=false API_KEY_AUTHENTICATE_AS_DEFAULT_USER=false DEFAULT_WALLET_ENABLED=false PORT=8100 PRISM_AGENT_VERSION=1.28.0 PRISM_NODE_VERSION=2.2.1 VAULT_DEV_ROOT_TOKEN_ID=root PG_PORT=5432 docker compose -p "issuer" -f ./infrastructure/shared/docker-compose.yml up --wait

Note that I am using an older (safer?) version 1.28.0 which used to work well.

when I run the command, I got this log for the issuer-db-1 (see last 3 lines) although the image is stated to be "healthy":

2024-06-11 08:16:21.593 UTC [1] LOG: starting PostgreSQL 13.15 (Debian 13.15-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024-06-11 08:16:21.594 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
2024-06-11 08:16:21.594 UTC [1] LOG: listening on IPv6 address "::", port 5432
2024-06-11 08:16:21.620 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-06-11 08:16:21.632 UTC [27] LOG: database system was interrupted; last known up at 2024-06-11 08:14:18 UTC
2024-06-11 08:16:23.839 UTC [27] LOG: database system was not properly shut down; automatic recovery in progress
2024-06-11 08:16:23.843 UTC [27] LOG: invalid record length at 0/15D2DF0: wanted 24, got 0
2024-06-11 08:16:23.843 UTC [27] LOG: redo is not required
2024-06-11 08:16:23.861 UTC [1] LOG: database system is ready to accept connections
2024-06-11 08:16:31.622 UTC [40] FATAL: database "agent" does not exist
2024-06-11 08:16:35.666 UTC [41] FATAL: database "pollux" does not exist
2024-06-11 08:16:35.784 UTC [42] FATAL: database "node_db" does not exist

Then the "issuer-prism-agent-1" is failing because "the DB does not exist": Failed to initialize pool: FATAL: database "pollux" does not exist

so I have 2 problems here:
1/ it used to work, what changed?
2/ it should not say it is healthy if it failed with FATAL in the logs

Please provide the exception or error you saw

see above + those logs for issuer-prism-agent-1:

2024-06-11_08:54:16.426 INFO  i.i.a.a.s.MainApp@L169:[zio-default-blocking-1] {} - zio-fiber-4 running pre-migration steps.
2024-06-11_08:54:17.015 INFO  c.z.h.HikariDataSource@L80:[ZScheduler-Worker-2] {} - DBPool - Starting...
2024-06-11_08:54:18.097 ERROR c.z.h.p.HikariPool@L594:[ZScheduler-Worker-2] {} - DBPool - Exception during pool initialization.org.postgresql.util.PSQLException: FATAL: database "pollux" does not exist
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2674)
        at org.postgresql.core.v3.QueryExecutorImpl.readStartupMessages(QueryExecutorImpl.java:2786)
        at org.postgresql.core.v3.QueryExecutorImpl.<init>(QueryExecutorImpl.java:172)
        at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:271)
        at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:51)
        at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:225)
        at org.postgresql.Driver.makeConnection(Driver.java:466)
        at org.postgresql.Driver.connect(Driver.java:265)
        at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:138)
        at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:364)
        at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:206)
        at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:476)
        at com.zaxxer.hikari.pool.HikariPool.checkFailFast(HikariPool.java:561)
        at com.zaxxer.hikari.pool.HikariPool.<init>(HikariPool.java:115)
        at com.zaxxer.hikari.HikariDataSource.<init>(HikariDataSource.java:81)
        at doobie.hikari.HikariTransactor$.fromHikariConfig$$anonfun$1(HikariTransactor.scala:79)
        at zio.interop.ZioAsync.delay$$anonfun$1(ZioAsync.scala:32)
        at zio.ZIOCompanionVersionSpecific.attempt$$anonfun$1(ZIOCompanionVersionSpecific.scala:103)
        at zio.ZIO$.suspendSucceed$$anonfun$1(ZIO.scala:4638)
        at zio.UnsafeVersionSpecific.implicitFunctionIsFunction$$anonfun$1(UnsafeVersionSpecific.scala:27)
        at zio.Unsafe$.unsafe(Unsafe.scala:37)
        at zio.ZIOCompanionVersionSpecific.succeed$$anonfun$1(ZIOCompanionVersionSpecific.scala:185)
        at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:904)
        at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
        at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
        at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
        at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1029)
        at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
        at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:967)
        at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
        at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1029)
        at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:967)
        at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1029)
        at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1029)
        at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:967)
        at zio.internal.FiberRuntime.evaluateEffect(FiberRuntime.scala:381)
        at zio.internal.FiberRuntime.evaluateMessageWhileSuspended(FiberRuntime.scala:504)
        at zio.internal.FiberRuntime.drainQueueOnCurrentThread(FiberRuntime.scala:220)
        at zio.internal.FiberRuntime.run(FiberRuntime.scala:139)
        at zio.internal.ZScheduler$$anon$4.run(ZScheduler.scala:478)

Please provide the environment you discovered this bug in

- using prism agent 1.28.0 (which used to work before)
 - moved to a new machine (ubuntu 22.04.4)  but ran the same prep config I did before on the other machine that worked
 - cannot get past this error to run the agents

Anything else?

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

  • Status

    No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions