Skip to content

[Bugfix] Allow CUDA_VISIBLE_DEVICES='' in Platform.device_id_to_physical_device_id #18979

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Jun 26, 2025

Conversation

eicherseiji
Copy link
Contributor

@eicherseiji eicherseiji commented May 30, 2025

Since #15977, EngineCoreProc initiates the following callstack:
-> VllmConfig.__post_init__
-> CudaPlatformBase.check_and_update_config
-> is_flashmla_supported
-> NvmlCudaPlatform.get_device_capability

This causes a crash when an AsyncLLMEngine is created on a CPU-only head node with GPU worker nodes, since CUDA_VISIBLE_DEVICES="" on the head.

I think this is a regression, because in 0.8.5 it was not assumed that EngineCoreProc knew the device capability of its workers.

In a Ray setup, it's expected that the device control environment variable, like CUDA_VISIBLE_DEVICES, is set to the empty string when launching the vLLM engine. The process will still be colocated with the GPU workers on a GPU node (thus current_platform is set correctly), but the front-end process should not consume any CPU resources.

Thus, if CUDA platform was detected, we should make it possible to query general device capability, e.g. current_platform.get_device_capability() (which has default kwarg device_id=0), even if the devices aren't exposed via the environment variable.

This addresses the FlashMLA capability check regression because current_platform.get_device_capability() no longer returns None in check_and_update_config.

The change adds two tests:

Ensure that CoreEngineProc can be instantiated withCUDA_VISIBLE_DEVICES=''
Ensure that configs created with CUDA_VISIBLE_DEVICES=''and CUDA_VISIBLE_DEVICES set normally are identical
Reproducer python serve.py:

# serve.py
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="deepseek-ai/DeepSeek-V2-Lite",
        model_source="deepseek-ai/DeepSeek-V2-Lite",
    ),
    runtime_env=dict(
        env_vars={"VLLM_USE_V1": "1"}
    ),
    deployment_config=dict(
        autoscaling_config=dict(min_replicas=1, max_replicas=1),
    ),
    engine_kwargs=dict(
        tensor_parallel_size=2,
        pipeline_parallel_size=2,
        gpu_memory_utilization=0.92,
        dtype="auto",
        max_num_seqs=40,
        max_model_len=16384,
        enable_chunked_prefill=True,
        enable_prefix_caching=True,
        trust_remote_code=True,
    ),
    log_engine_metrics=True
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)

Logs:

(ServeController pid=39860) INFO 2025-05-30 15:02:57,347 controller 39860 -- Adding 1 replica to Deployment(name='LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite', app='default').
(ServeController pid=39860) INFO 2025-05-30 15:02:57,460 controller 39860 -- Replica(id='wiyjgw4t', deployment='LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite', app='default') is stopped.
(pid=gcs_server) {"asctime":"2025-05-30 15:02:57,527","levelname":"E","message":"Failed to kill actor 801904a2be71c63b35128ed702000000, status: RpcError: RPC Error message: Socket closed; RPC Error details:  rpc_code: 14","component":"gcs_server","filename":"gcs_actor_manager.cc","lineno":1788}
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) INFO 05-30 15:03:01 [__init__.py:243] Automatically detected platform cuda.
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) INFO 2025-05-30 15:03:04,828 default_LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite eanzaa1s -- Running tasks to download model files on worker nodes
(ServeController pid=39860) Traceback (most recent call last):
(ServeController pid=39860)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 418, in __init__ [repeated 4x across cluster]
(download_model_files pid=40926) No cloud storage mirror configured
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) You are using a model of type deepseek_v2 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) You are using a model of type deepseek_v2 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
(download_model_files pid=40926) INFO 05-30 15:03:10 [__init__.py:243] Automatically detected platform cuda.
(_get_vllm_engine_config pid=40926) INFO 05-30 15:03:13 [__init__.py:31] Available plugins for group vllm.general_plugins:
(_get_vllm_engine_config pid=40926) INFO 05-30 15:03:13 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
(_get_vllm_engine_config pid=40926) INFO 05-30 15:03:13 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
(_get_vllm_engine_config pid=40926) INFO 05-30 15:03:13 [config.py:213] Replacing legacy 'type' key with 'rope_type'
(ServeController pid=39860) WARNING 2025-05-30 15:03:16,852 controller 39860 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to initialize.
(ServeController pid=39860) This may be caused by a slow __init__ or reconfigure method.
(_get_vllm_engine_config pid=40926) INFO 05-30 15:03:21 [config.py:793] This model supports multiple tasks: {'embed', 'reward', 'classify', 'generate', 'score'}. Defaulting to 'generate'.
(_get_vllm_engine_config pid=40926) INFO 05-30 15:03:21 [config.py:2118] Chunked prefill is enabled with max_num_batched_tokens=2048.
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) INFO 2025-05-30 15:03:21,252 default_LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite eanzaa1s -- Using executor class: <class 'vllm.v1.executor.ray_distributed_executor.RayDistributedExecutor'>
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) WARNING 05-30 15:03:21 [utils.py:2531] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reason: In a Ray actor and can only be spawned
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) INFO 05-30 15:03:24 [__init__.py:243] Automatically detected platform cuda.
(ServeController pid=39860) WARNING 2025-05-30 15:03:27,445 controller 39860 -- Deployment 'LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite' in application 'default' has 1 replicas that have taken more than 30s to initialize.
(ServeController pid=39860) This may be caused by a slow __init__ or reconfigure method.
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) INFO 05-30 15:03:27 [core.py:438] Waiting for init message from front-end.
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) ERROR 05-30 15:03:34 [core.py:500] EngineCore failed to start.
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) ERROR 05-30 15:03:34 [core.py:500] Traceback (most recent call last):
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) ERROR 05-30 15:03:34 [core.py:500]   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 491, in run_engine_core
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) ERROR 05-30 15:03:34 [core.py:500]     engine_core = EngineCoreProc(*args, **kwargs)
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) ERROR 05-30 15:03:34 [core.py:500]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) ERROR 05-30 15:03:34 [core.py:500]   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 384, in __init__
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) ERROR 05-30 15:03:34 [core.py:500]     vllm_config.__post_init__()
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) ERROR 05-30 15:03:34 [core.py:500]   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/config.py", line 4364, in __post_init__
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) ERROR 05-30 15:03:34 [core.py:500]     current_platform.check_and_update_config(self)
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) ERROR 05-30 15:03:34 [core.py:500]   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/platforms/cuda.py", line 148, in check_and_update_config
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) ERROR 05-30 15:03:34 [core.py:500]     if use_flashmla and is_flashmla_supported()[0] \
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) ERROR 05-30 15:03:34 [core.py:500]                         ^^^^^^^^^^^^^^^^^^^^^^^
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) ERROR 05-30 15:03:34 [core.py:500]   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/attention/ops/flashmla.py", line 28, in is_flashmla_supported
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) ERROR 05-30 15:03:34 [core.py:500]     if current_platform.get_device_capability()[0] != 9:
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) ERROR 05-30 15:03:34 [core.py:500]        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) ERROR 05-30 15:03:34 [core.py:500] TypeError: 'NoneType' object is not subscriptable
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) Process EngineCore_0:
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) Traceback (most recent call last):
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777)   File "/home/ray/anaconda3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777)     self.run()
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777)   File "/home/ray/anaconda3/lib/python3.11/multiprocessing/process.py", line 108, in run
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777)     self._target(*self._args, **self._kwargs)
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 504, in run_engine_core
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777)     raise e
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 491, in run_engine_core
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777)     engine_core = EngineCoreProc(*args, **kwargs)
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 384, in __init__
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777)     vllm_config.__post_init__()
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/config.py", line 4364, in __post_init__
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777)     current_platform.check_and_update_config(self)
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/platforms/cuda.py", line 148, in check_and_update_config
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777)     if use_flashmla and is_flashmla_supported()[0] \
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777)                         ^^^^^^^^^^^^^^^^^^^^^^^
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/attention/ops/flashmla.py", line 28, in is_flashmla_supported
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777)     if current_platform.get_device_capability()[0] != 9:
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777)        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
(ServeReplica:default:LLMDeploymentdeepseek-ai--DeepSeek-V2-Lite pid=40777) TypeError: 'NoneType' object is not subscriptable

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@eicherseiji eicherseiji changed the title Fix FlashMLA detection in ray environment Avoid a crash in is_flashmla_supported() by handling Platform.get_device_capability()'s optional return value Jun 3, 2025
@mergify mergify bot added the v1 label Jun 4, 2025
Copy link

mergify bot commented Jun 4, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @eicherseiji.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@eicherseiji eicherseiji changed the title Avoid a crash in is_flashmla_supported() by handling Platform.get_device_capability()'s optional return value Avoid a crash in is_flashmla_supported() by moving block_size fixup to GPU worker Jun 4, 2025
@mergify mergify bot added the needs-rebase label Jun 4, 2025
@mergify mergify bot removed the needs-rebase label Jun 4, 2025
@eicherseiji eicherseiji changed the title Avoid a crash in is_flashmla_supported() by moving block_size fixup to GPU worker [Regression][Bugfix] Avoid a crash in is_flashmla_supported() by moving block_size fixup to GPU worker Jun 4, 2025
@eicherseiji eicherseiji changed the title [Regression][Bugfix] Avoid a crash in is_flashmla_supported() by moving block_size fixup to GPU worker [Regression][Bugfix] Avoid a crash in is_flashmla_supported() by moving FlashMLA block_size fixup to GPU worker Jun 4, 2025
@eicherseiji
Copy link
Contributor Author

@njhill your feedback on this would be greatly appreciated. Thanks!

@kouroshHakha
Copy link
Collaborator

@eicherseiji Let's add some unittest that prevents this from getting changed again down the line. Basically EngineCoreProc / AsyncLLMEngine should be instantiable on a cpu head node.

@ProExpertProg
Copy link
Collaborator

@LucasWilkinson can you take a look?

@njhill njhill requested a review from LucasWilkinson June 4, 2025 23:23
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eicherseiji! Would be good for @LucasWilkinson to check this too, and I agree with @kouroshHakha that having a test to cover this case would be good.

@LucasWilkinson
Copy link
Collaborator

Sorry still OOO so doing this on my phone (responses may be delayed / I may mis things). My main comment would be to check V0; this logic is shared between V0 and V1 (in the original implementation)

@eicherseiji
Copy link
Contributor Author

eicherseiji commented Jun 6, 2025

Thanks @LucasWilkinson, @njhill, @kouroshHakha for the review!

I added a test, and moved another recent change that introduced platform dependency (dtype: 'auto' resolution, #18751) to the GPU worker as well.

The existing paths work fine for V0 engines on Ray, so I'm leaving those be with TODOs to remove after V0 deprecation.

Please let me know any comments or concerns. Thanks :)

@eicherseiji
Copy link
Contributor Author

Reviewing CI failures

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ix quantize config validation error

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@eicherseiji
Copy link
Contributor Author

Rebased but latest nightly is pretty red: https://buildkite.com/vllm/ci/builds/22564#_

@vllm-bot vllm-bot merged commit 65397e4 into vllm-project:main Jun 26, 2025
67 of 69 checks passed
He1pa pushed a commit to He1pa/vllm that referenced this pull request Jun 26, 2025
…ysical_device_id` (vllm-project#18979)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
m-misiura pushed a commit to m-misiura/vllm that referenced this pull request Jun 26, 2025
…ysical_device_id` (vllm-project#18979)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
gmarinho2 pushed a commit to gmarinho2/vllm that referenced this pull request Jun 26, 2025
…ysical_device_id` (vllm-project#18979)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed tool-calling v1
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

8 participants