Skip to content

Exceptions in application logs with larger or larger number of requests #469

Open
@eero-t

Description

@eero-t

Setup

These errors originally happened with v0.7 ChatQnA Xeon installation [1], but e.g. updating to TEI services from 1.2-cpu version to latest 1.5-cpu, and and TGI service from 1.4 version to latest 2.2 did help, and they still happen, nearly 1 year later.

[1] https://github.com/opea-project/GenAIExamples/tree/v0.7/ChatQnA/kubernetes/manifests

Use-case

Constantly stress ChatQnA chaqna-xeon-backend-server-svc service endpoint by sending it large[2] number of queries in parallel.

Or do something else that causes responses to slow down: #1936

[2] compared to actual capacity of the service. E.g. 8 queries in parallel for service running on IceLake Xeon.

Actual outcome

Occasionally:

  • Process sending it queries gets unexpected EOF error (i.e. service reply ended before specified Content-Length)
  • Exception in logs from few of the service pods

For the exception details, see the attachments:

Logs for pods of embedding-svc, redis-vector-db, retriever-svc, tei-embedding-svc, tei-reranking-svc and tgi-svc services did not show any exceptions or other errors.

unexpected EOF error can happen before chaqna-xeon-backend-server-svc replies with the first token, or after it has already provided 100-200 tokens for the reply.

Expected outcome

  • Services handle common exceptions gracefully; shortly log the error and tell the caller that now is not a good time to do queries (return e.g. 503 Service Unavailable), instead of spamming log with exception and "crashing" the reply connection

  • Have some kind of rate-limiting for chaqna-xeon-backend-server-svc, so that if it gets too many replies before earlier ones have been processed promptly enough, it will start replying 503 pre-emptively[2] (instead of making the situation worse by trying to process all requests although it has no capacity for it currently, and then failing in middle)

Note: Rate-limiting helps when service scale-up is slow (TGI pod may take minutes from startup until it's ready to respond), and once service has been scaled as up as it can go. But it needs to be done so that it does not prevent scale-up, or cause too much fluctuation for it.

(No comment on whether that should be implemented in chaqna-xeon-backend-server-svc itself, or in some load-balancer in front of it.)

Metadata

Metadata

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions