Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage/Usage-ingestor container cant deal with restart of 'broker' #3808

Open
rickbijkerk opened this issue Jan 16, 2024 · 4 comments
Open

Usage/Usage-ingestor container cant deal with restart of 'broker' #3808

rickbijkerk opened this issue Jan 16, 2024 · 4 comments
Labels
enhancement New feature or request that adds new things or value to Hive help wanted Extra attention is needed

Comments

@rickbijkerk
Copy link
Contributor

rickbijkerk commented Jan 16, 2024

We're self hosting hive through kubernetes. In kubernetes each pod (container) can be restarted at any time.
The concept of 'depends_on:' as defined in the docker-compose doesnt exist within kubernetes landscape. Each container should be able to deal with a restart of it's dependencies. This works nicely for all containers except for the beforementioned usage/usage-ingestor.

For usage this leads to the following logs:

"msg":"[503] (::ffff:127.0.0.6) POST / (reqId=f74caede-fb53-4c47-a0fb-d2c7c720b657)"}
"msg":"Not ready to collect report (token=989••••••••••••••••••••••••••fd7)"}

For the usage-ingestor pod the logs look a bit more extensive:

"broker":"hive-broker-svce:29092","clientId":"usage-ingestor","error":"The group coordinator is not available","correlationId":4,"size":55,"msg":"[Connection] Response GroupCoordinator(key: 10, version: 2)","time":1705401577124}
"broker":"hive-broker-svce:29092","clientId":"usage-ingestor","error":"The group coordinator is not available","correlationId":5,"size":55,"msg":"[Connection] Response GroupCoordinator(key: 10, version: 2)","time":1705401577507}
"broker":"hive-broker-svce:29092","clientId":"usage-ingestor","error":"The group coordinator is not available","correlationId":6,"size":55,"msg":"[Connection] Response GroupCoordinator(key: 10, version: 2)","time":1705401578130}
"broker":"hive-broker-svce:29092","clientId":"usage-ingestor","error":"The group coordinator is not available","correlationId":7,"size":55,"msg":"[Connection] Response GroupCoordinator(key: 10, version: 2)","time":1705401579163}
"broker":"hive-broker-svce:29092","clientId":"usage-ingestor","error":"The group coordinator is not available","correlationId":8,"size":55,"msg":"[Connection] Response GroupCoordinator(key: 10, version: 2)","time":1705401581205}
"broker":"hive-broker-svce:29092","clientId":"usage-ingestor","error":"The group coordinator is not available","correlationId":9,"size":55,"msg":"[Connection] Response GroupCoordinator(key: 10, version: 2)","time":1705401584505}
"groupId":"usage-ingestor-v2","stack":"KafkaJSGroupCoordinatorNotFound: Failed to find group coordinator\n    at Cluster.findGroupCoordinatorMetadata (file:///usr/src/app/index.js:74709:15)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async file:///usr/src/app/index.js:74644:37\n    at async [private:ConsumerGroup:join] (file:///usr/src/app/index.js:76901:28)\n    at async file:///usr/src/app/index.js:77034:13\n    at async Runner.start (file:///usr/src/app/index.js:77694:11)\n    at async start (file:///usr/src/app/in.js:78348:11)","msg":"[Consumer] Crash: KafkaJSGroupCoordinatorNotFound: Failed to find group coordinator","time":1705401584506}

Consumer stopped
[Consumer] Stopped","time":170540158450
Consumer disconnected
Consumer crashed (restart=false, error=KafkaJSGroupCoordinatorNotFound: Failed to find group coordinator)
Restarting consumer...
Starting Usage Ingestor...
Connecting Kafka Consumer
Subscribing to Kafka topic: usage_reports_v2

broker":"hive-broker-svce:29092","clientId":"usage-ingestor","stack":"Error [ERR_STREAM_WRITE_AFTER_END]: write after end\n    at new NodeError (node:internal/errors:405:5)\n    at _write (node:internal/streams/writable:322:11)\n    at Writable.write (node:internal/streams/writable:337:10)\n    at Object.sendRequest (file:///usr/src/app/index.js:74088:31)\n    at SocketRequest.send [as sendRequest] (file:///usr/src/app/index.js:72825:27)\n    at SocketRequest.send (file:///usr/src/app/index.js:72644:14)\n    at RequestQueue.sendSocketRequest (file:///usr/src/app/index.js:72865:23)\n    at RequestQueue.push (file:///usr/src/app/index.js:72849:16)\n    at file:///usr/src/app/index.js:74083:33\n    at new Promise (<anonymous>)","msg":"[Connection] Connection error: write after end","time":1705401584509}

{"level":50,"time":1705401584510,"pid":14,"hostname":"hive-usage-deployment-6d75fc64b-7qqzz","logger":"kafkajs","eventName":"consumer.crash","stack":"KafkaJSConnectionError: Connection error: write after end\n    at Socket.onError (file:///usr/src/app/index.js:73919:27)\n    at Socket.emit (node:events:517:28)\n    at Socket.emit (node:domain:489:12)\n    at emitErrorNT (node:internal/streams/destroy:151:8)\n    at emitErrorCloseNT (node:internal/streams/destroy:116:3)\n    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)","msg":"[Consumer] Failed to execute listener: Connection error: write after end","time":1705401584510}

Reproduction:

  • Start up all of the containers
  • Stop/start the broker container
  • check the logs of usage/usage-ingestor to find the errors above

Expected solution:

  • The usage/usage-ingestor container should be able to restart at any time without cause the processing/ingestion of before mentioned containers to stop working
@n1ru4l n1ru4l added enhancement New feature or request that adds new things or value to Hive help wanted Extra attention is needed labels Jan 24, 2024
@kamilkisiela
Copy link
Collaborator

The depends_on is really only for the docker-compose setup.
We do deploy Hive on k8s and point startupProbe, livenessProbe and readinessProbe to 2 endpoints.

  • /_health - I'm alive!
  • /_readiness - I can do work!

These two services are independent from each other, you can setup self-hosted Hive in similar fashion and it should work for you as well.

@rickbijkerk
Copy link
Contributor Author

i have done some more playing around with it and after configuring the probes i still ran into some issues.
Specifically: "KafkaJSNonRetriableError" errors straight after start up in both usage and usage-ingestor.

What in the end worked for me was pointing the livenessProbe to the /_readiness which made sense after i looked into the code and found out that only the readiness endpoint looks at the state of the kafka connection.
And since the livenessProbe is the one that kubernetes uses to determine wether or not to restart a container this fixed it.

Now when all the pods/containers start the usage/usage-ingestor will start and fail to be come 'live' and then do a single restart which does work because by then the broker/zookeeper pods have started.

And last but not least i think since your using pulumi you might not run into this as pulumi has the concept of depdencies which we sadly dont

@Elyytscha
Copy link

hit same issue today, configuring liveness probe to listen on readiness endpoint does not seem like the best idea imo.

@saihaj
Copy link
Collaborator

saihaj commented Oct 1, 2024

chatted with @Elyytscha to get more logs and understand what is going one. What I think we can do is add a retry limit here, so once we exceed the limit it kills the service this way the orchestrator can re-create the pods.

https://github.com/kamilkisiela/graphql-hive/blob/a0ee93f884c97b7f02fddc5395b9bfbc1d3f860a/packages/services/usage-ingestor/src/ingestor.ts#L114-L124

Right now this logic we have can keep trying forever but what I learned from the logs is that Kafka has a limit on how much you can retry.

Internal slack reference: https://guild-oss.slack.com/archives/C040PLJJJ02/p1727793746436379

@saihaj saihaj reopened this Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request that adds new things or value to Hive help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants