Using the instructions here: https://github.com/ray-project/ray-llm#how-do-i-deploy-multiple-models-at-once I'm trying to host two models on a single A100 80G.
Two bundles are generated for the placement group:
{0: {'accelerator_type:A100': 0.1, 'CPU': 1.0}
{1: {'accelerator_type:A100': 0.1, 'GPU': 1.0, 'CPU': 1.0}}
Bundle 0 correctly generates with my configured CPU and accelerator type.
Bundle 1 adds in an additional GPU requirement.
Now, if I swap the order of the models in the multi-model config, the first model always boots and the second model doesn't because the superfluous GPU:1 entry is added (I think).
This always leads to the following log entries for the second model - e.g.
deployment_state.py:1974 - Deployment 'VLLMDeployment:TheBloke--Mistral-7b-OpenOrca-AWQ' in application 'ray-llm' has 1 replicas that have taken more t
han 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: [{"accelerator_type:A100": 0.1, "CPU": 1.0}, {"accelerator_type
:A100": 0.1, "CPU": 1.0, "GPU": 1.0}], total resources available: {}. Use `ray status` for more details.
Trying to work out if this a bug or my misunderstanding?
Happy to provide further details as needed :)
So far I've tried the provided containers plus building from source.