Skip to content

[Misc] allow pulling vllm in Ray runtime environment #21143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

eric-higgins-ai
Copy link

@eric-higgins-ai eric-higgins-ai commented Jul 17, 2025

Purpose

The engine is run in a spawned subprocess, which Ray interprets as a new job with its own runtime environment. This means that vllm can't be pulled through the Ray runtime environment, as we don't pass the original job's runtime env through to the subprocess.

This issue was reported here.

Test Plan

Ran a Ray job with the following code

vision_processor_config = vLLMEngineProcessorConfig(
        model="Qwen/Qwen2.5-VL-32B-Instruct",
        engine_kwargs=dict(
            tensor_parallel_size=1,  
            pipeline_parallel_size=NUMBER_OF_GPUS,
            max_model_len=4096,
            enable_chunked_prefill=True,
            max_num_batched_tokens=2048,
            distributed_executor_backend="ray",
            device="cuda",
        ),
        # Override Ray's runtime env to include the Hugging Face token. Ray Data uses Ray under the hood to orchestrate the inference pipeline.
        runtime_env=dict(
            env_vars=dict(
                HF_TOKEN="<token>",
                VLLM_USE_V1="1",
            ),
        ),
        batch_size=1,
        concurrency=1,
        has_image=False
    )
    
    #build the processor
    processor = build_llm_processor(
        vision_processor_config,
        preprocess=lambda row: dict(
            messages=[
                {"role": "system", "content": "You are a bot that responds with haikus."},
                {"role": "user", "content": row["item"]}
            ],
            sampling_params=dict(
                temperature=0.3,
                max_tokens=250,
            )
        ),
        postprocess=lambda row: dict(
            answer=row["generated_text"],
            **row  # This will return all the original columns in the dataset.
        ),
    )

    #create the dataset
    ds = ray.data.from_items(["Start of the haiku is: Complete this for me..."])
    ds = processor(ds)
    ds.show(limit=1)

Test Result

I checked in the Ray dashboard that the launched job has the runtime env provided in the engine_kwargs.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables the propagation of a Ray runtime environment to vLLM's distributed workers. This is a useful feature when vLLM is used as a component within a larger Ray application that defines a specific runtime environment.

The changes are well-targeted:

  1. The ParallelConfig is extended to hold an optional runtime_env.
  2. When creating the engine configuration inside a Ray actor, the current runtime_env is fetched from the Ray context and stored in the ParallelConfig.
  3. When the Ray executor initializes the Ray cluster, it now passes this runtime_env to ray.init(), ensuring that subsequently created workers inherit the correct environment.

I've reviewed the implementation, and the logic appears sound and correctly handles the cases where Ray is already initialized versus when vLLM needs to initialize it. The changes are constrained to the Ray execution path and should not affect other backends. Overall, this is a good addition to improve vLLM's integration with the Ray ecosystem.

Signed-off-by: eric-higgins-ai <erichiggins@applied.co>
Signed-off-by: eric-higgins-ai <erichiggins@applied.co>
Signed-off-by: eric-higgins-ai <erichiggins@applied.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants