Can be started locally using after installing requirements.
python3 -m app --api_base=http://localhost:9000/v1
--service_name
: Sets the service name. Defaults to "LLM_Gateway".--api_base
: Sets the OpenAI API base URL. Defaults to "http://localhost:9000/v1".--api_key
: Sets the OpenAI API token. Defaults to "EMPTY".--service_port
: Sets the service port. Defaults to 8000.--workers
: Sets the number of Gunicorn workers. Defaults to 2.--timeout
: Sets the request timeout. Defaults to 60 seconds.--swagger_url
: Sets the Swagger interface URL. Defaults to "/docs".--swagger_prefix
: Sets the Swagger prefix. Defaults to an empty string.--swagger_path
: Sets the Swagger file path. Defaults to "../document/swagger_llm_gateway.yml".--debug
: Enables debug logs if provided.--db_path
: Sets the path to the result database. Defaults to "./results.sqlite".
Tests would use
python3 -m tests
but none yet. Chunker seems very fine at least.
next head to host:port/docs for swagger and profit
docker compose up shall start vLLM with vigostral and llm-gateway in the same network
note : any modification to any servicename.json triggers a hot-reload of /services route. Prompt template (servicename.txt) is reloaded uppon any usage request
note: mount your service manifests folder (./services here) as /usr/src/services
note: Always use a string val for OPENAI_API_TOKEN.
PYTHONUNBUFFERED=1
: Ensures that Python output is sent straight to terminal (unbuffered), making Python output, including tracebacks, immediately visible.SERVICE_NAME=LLM_Gateway
: Sets the service name.OPENAI_API_BASE=http://vllm-backend:8000/v1
: Sets the OpenAI API base URL.OPENAI_API_TOKEN=EMPTY
: Sets the OpenAI API token.HTTP_PORT=8000
: Sets the service port.CONCURRENCY=2
: Sets the number of Gunicorn workers.TIMEOUT=60
: Sets the request timeout.SWAGGER_PREFIX=
: Sets the Swagger prefix.SWAGGER_PATH=../document/swagger_llm_gateway.yml
: Sets the Swagger file path.RESULT_DB_PATH=./results.sqlite
: Sets the path to the result database.
docker run --gpus=all -v ~/.cache/huggingface:/root/.cache/huggingface -p 9000:8000 --ipc=host vllm/vllm-openai:latest --model TheBloke/Vigostral-7B-Chat-AWQ --quantization awq
docker service create \
--name vllm-service \
--network net_llm_services \
--mount type=bind,source=/home/linagora/shared_mount/models/,target=/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model TheBloke/Instruct_Mixtral-8x7B-v0.1_Dolly15K-AWQ \
--quantization awq \
--gpu-memory-utilization 0.5
A new service is created using servicename.json, as a manifest of parameters. It's associated with servicename.txt that holds the prompt template
See notes below
{
"type": "summary",
"fields": 2,
"name": "cra", // name of the service (route)
"description": {
"fr": "Compte Rendu Analytique"
},
"backend": "vLLM", // only one supported, we can add more easily
"flavor": [
{
"name":"mixtral", // the name of the flavor to use in request
"modelName": "TheBloke/Instruct_Mixtral-8x7B-v0.1_Dolly15K-AWQ", // Ensure you have this running on vLLM server or it will crash
"totalContextLength": 32768, // Max Context = Prompt + User Prompt + generated Tokens
"maxGenerationLength": 2048, // Limits the output from the model. Keep this fairly high.
"tokenizerClass": "LlamaTokenizer",
"createNewTurnAfter": 178, // Forces the chunker to create a new "virtual turns" whenever a turn reaches this number of tokens.
"summaryTurns": 2, // 2 previously summarized turns will get injected to the template
"maxNewTurns": 6, // 6 turns at max will get processed. Shall failback to less if we reach high token count (close to maxContextSize)
"temperature": 0.1, // 0-1 : creativity, shall be close to zero as we want accurate sumpmaries
"top_p": 0.8 // 0-1 : i.e. 0.5: only considers words that together add up to at least 50% of the total probability, leaving out the less likely ones. i.e 0.9 0.9: This includes a lot more words in the choice, allowing for more variety and originality.
}
]
}