Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ray.llm][Batch] Support LoRA #50804

Merged
merged 13 commits into from
Feb 24, 2025
Merged

[ray.llm][Batch] Support LoRA #50804

merged 13 commits into from
Feb 24, 2025

Conversation

comaniac
Copy link
Collaborator

Why are these changes needed?

  1. Support using LoRA in vLLM v0.
  2. Users have to enable LoRA mode in the vLLM engine config, and specify the LoRA adapter in the input request. The adapter will be dynamically loaded to the vLLM engine.
  3. The LoRA adapter has to be available in HF hub, or it has to be downloaded to the local file system in advance.

Example usage:

import ray
from ray.data.llm import build_llm_processor, vLLMEngineProcessorConfig

processor_config = vLLMEngineProcessorConfig(
    model="meta-llama/Llama-3.2-1B-Instruct",
    engine_kwargs=dict(
        # Enable LoRA in the vLLM engine.
        enable_lora=True,
        max_lora_rank=32,
    ),
    batch_size=16,
    concurrency=1,
)

processor = build_llm_processor(
    processor_config,
    preprocess=lambda row: dict(
        # Specify LoRA adapter path in the request.
        model="EdBergJr/Llama32_Baha_3",
        messages=[
            {"role": "system", "content": "You are a calculator"},
            {"role": "user", "content": f"{row['id']} ** 3 = ?"},
        ],
        sampling_params=dict(
            temperature=0.3,
            max_tokens=50,
            detokenize=False,
        ),
    ),
    postprocess=lambda row: {
        "resp": row["generated_text"],
    },
)

ds = ray.data.range(60)
ds = ds.map(
    lambda x: {"id": x["id"], "val": x["id"] + 5}
)
ds = processor(ds)
ds = ds.materialize()
for out in ds.take_all():
    print(out)
    print("==========")

cc @gvspraveen @kouroshHakha

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
@comaniac comaniac requested a review from a team as a code owner February 21, 2025 20:13
@comaniac comaniac added the go add ONLY when ready to merge, run all tests label Feb 21, 2025
@gvspraveen
Copy link
Contributor

Awesome!!

do you want to update the docs/guides in the sam PR for LoRA?

@@ -183,6 +191,27 @@ def _prepare_llm_request(self, row: Dict[str, Any]) -> vLLMEngineRequest:
else:
image = []

# The request is for LoRA.
lora_request = None
if "model" in row and row["model"] != self.model:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a tiny comment here that says model is lora if and only if the model id present in the row is different than the one in engine_kwargs.

We need to also document this in our docs that for LoRA this is how the preprocessor should be constructed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should document it somewhere later.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, especially if it's going to be a local path

with self.lora_lock:
# Make sure no other thread has loaded the same LoRA adapter.
if lora_name not in self.lora_name_to_request:
lora_request = vllm.lora.request.LoRARequest(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big question. Who actually has downloaded the lora adaptor weights? Are those stored in s3? in local? That part I don't understand with the lora support.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for base model, vllm engine does the downloading which comes from huggingface usually. What about lora here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can download LoRA from HF as well. For example: https://huggingface.co/EdBergJr/Llama32_Baha_3. If the LoRA weights are not available in HF, users have to download it first and specify a local path (in a shared file system). Later we could support loading LoRA weights from S3 directly in vLLM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document these in the follow up. I think there is so many gotchas here. I know because we have solved all of these issues with the online serving and it's non-trivial. Especially around loading stuff from s3.

In online we have to have a lora-aware router (which we already do through serve multiplex router). Here in ray data, I don't know how that looks like. Maybe we should do a thread safe lora download per node here. or something similar. It needs a bit more thinking (non-blocking to this PR obv)

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
@comaniac comaniac requested review from a team as code owners February 22, 2025 01:28
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
@comaniac
Copy link
Collaborator Author

@richardliaw
Copy link
Contributor

Looks great, can you also link the example back into this section: https://anyscale-ray--50804.com.readthedocs.build/en/50804/data/working-with-llms.html#configure-vllm-for-llm-inference

@pcmoritz
Copy link
Contributor

pcmoritz commented Feb 24, 2025 via email

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
@richardliaw richardliaw enabled auto-merge (squash) February 24, 2025 22:14
@comaniac
Copy link
Collaborator Author

@kouroshHakha @GeneDer could any of you stamp?

@richardliaw richardliaw merged commit e49a260 into ray-project:master Feb 24, 2025
6 checks passed
@comaniac comaniac deleted the lora branch February 24, 2025 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants