Python nature brings a lot of challenges when dealing with blocking IO. HuggingFace SDK doesn't provide an out-of-box solution to having inference on models be threaded, although the lower-level structures (PyTorch and Tensorflow) provides the necessary tooling. HF docs suggest using multi-threaded web server, but my attempts didn't to apply the same snippet didn't resolve well.
As I needed an urgent PoC of being able to provide multi-tenant (more than one user using using LLM capabilities at once) service. I decided to build a PoC that follows workers concept, where multiple number of workers
can be started alongside backend
to provide multi-tenant API for LLM inference.
This specific demo runs falcon-40b-instruct
model in conversational
mode, and allows users to provide knowledge source, article
, and ask question
so LLM would answer it assuming its only knowledge is article
. To use this PoC:
- Create a
venv
, and activate it:
python[3[.11]] -m venv venv
source venv/bin/activate
- Install runtime dependencies:
pip install .
- Start
backend
:
python backend.py
- Start one or multiple
worker
instances: This is doable by starting new shell and sourcing earlier createdvenv
and starting theworker
instance:
# New shell, Working directory is this project
source venv/bin/active
python worker.py
- Make a request to
backend
:
curl -X POST -H "Content-type: application/json" -d '{"article": "Today is Wed. 21st. Jun 2023. The weather is hot. I am currently not at home, but at office. I am working on implementing multi-threading for the LLM backend", "question":"What date is it?"}' 'http://127.0.0.1:8080'
# The date mentioned in the article is 21st. June...
To Convert this into an MVP, following points should be tackled:
- Add health check to
backend
: Upon making request toworker
instance, ifworker
instance is unreachable over period of retries, it should be removed frombackend
registered workers. - Add access-control on
backend
endpoints: Endpoints ofbackend
for registering and de-registeringworker
instances should be scoped-down to prevent misuse. - Return inference time with requests for analytics.
- Containerize PoC to run with
docker-compose
.