A fast CPU-based API for Qwen 2.5, hosted on Hugging Face Spaces. To achieve faster executions, we are using CTranslate2 as our inference engine.
Simply cURL the endpoint like in the following.
curl -N 'https://winstxnhdw-llm-api.hf.space/api/v1/chat/stream' \
-H 'Content-Type: application/json' \
-d \
'{
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'
There are a few ways to run llm-api
locally for development.
If you spin up the server using uv
, you may access the Swagger UI at localhost:49494/schema/swagger.
uv run llm-api
You can access the Swagger UI at localhost:7860/schema/swagger after spinning the server up with Docker.
docker build -f Dockerfile.build -t llm-api .
docker run --rm --init -e SERVER_PORT=7860 -p 7860:7860 llm-api
You can enable CUDA support by building the image with the following --build-arg
flag.
docker build -f Dockerfile.build -t llm-api --build-arg USE_CUDA=1 .
docker run --rm --init --gpus all -e SERVER_PORT=7860 -p 7860:7860 llm-api