Description
Setup
These errors originally happened with v0.7 ChatQnA Xeon installation [1], but e.g. updating to TEI services from 1.2-cpu
version to latest 1.5-cpu
, and and TGI service from 1.4
version to latest 2.2
did help, and they still happen, nearly 1 year later.
[1] https://github.com/opea-project/GenAIExamples/tree/v0.7/ChatQnA/kubernetes/manifests
Use-case
Constantly stress ChatQnA chaqna-xeon-backend-server-svc
service endpoint by sending it large[2] number of queries in parallel.
Or do something else that causes responses to slow down: #1936
[2] compared to actual capacity of the service. E.g. 8 queries in parallel for service running on IceLake Xeon.
Actual outcome
Occasionally:
- Process sending it queries gets
unexpected EOF
error (i.e. service reply ended before specifiedContent-Length
) - Exception in logs from few of the service pods
For the exception details, see the attachments:
Logs for pods of embedding-svc
, redis-vector-db
, retriever-svc
, tei-embedding-svc
, tei-reranking-svc
and tgi-svc
services did not show any exceptions or other errors.
unexpected EOF
error can happen before chaqna-xeon-backend-server-svc
replies with the first token, or after it has already provided 100-200 tokens for the reply.
Expected outcome
-
Services handle common exceptions gracefully; shortly log the error and tell the caller that now is not a good time to do queries (return e.g.
503 Service Unavailable
), instead of spamming log with exception and "crashing" the reply connection -
Have some kind of rate-limiting for
chaqna-xeon-backend-server-svc
, so that if it gets too many replies before earlier ones have been processed promptly enough, it will start replying503
pre-emptively[2] (instead of making the situation worse by trying to process all requests although it has no capacity for it currently, and then failing in middle)
Note: Rate-limiting helps when service scale-up is slow (TGI pod may take minutes from startup until it's ready to respond), and once service has been scaled as up as it can go. But it needs to be done so that it does not prevent scale-up, or cause too much fluctuation for it.
(No comment on whether that should be implemented in chaqna-xeon-backend-server-svc
itself, or in some load-balancer in front of it.)