Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend] Dynamic RoPE scaling #4638

Merged
merged 5 commits into from
May 22, 2024
Merged

Conversation

sasha0552
Copy link
Contributor

@sasha0552 sasha0552 commented May 7, 2024

In #555, @WoosukKwon removed dynamic specifying of RoPE scaling with the comment:

As we discussed offline, I removed rope_scaling from ModelConfig and EngineArgs. Now rope_scaling is always read from the model's config.json.

I don't understand why this feature was removed, so this PR brings it back. Specifying the RoPE scaling on the command line is very useful, because otherwise we have to manually modify the config.json that can be managed by huggingface - so each model has to be forked to properly set a different RoPE scaling.

FIX #4334

RoPE scaling allows to use higher context without further fine-tuning

Summarization using meta-llama/Meta-Llama-3-8B-Instruct (has a native context of 8192 tokens) with type = linear and factor = 2.0:

ChatCompletion(id='cmpl-11ec2bc71a734a1eba13eaca5e0c54d9', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="The article provides information about CUDA, a parallel computing platform and programming model created by NVIDIA. It allows software developers to use graphics processing units (GPUs) for general-purpose computing, rather than just graphics processing graphics. CUDA is a proprietary, but has been adopted by many developers. The article covers CUDA's history, features, technical specifications, capabilities, and usage.\n\nCUDA's features and capabilities include:\n\n* Parallel programming model: CUDA allows for parallel processing of tasks on multiple threads and blocks\n* Memory management: CUDA's memory hierarchy includes registers, shared memory, and global memory\n* Execution: CUDA code can be executed on multiple GPUs and CPUs\n* Interoperability: CUDA supports various programming languages, including C, C++, Fortran, and Python\n* APIs: CUDA has a low-level API (CUDA Driver) and high-level (CUDA) API, with libraries and runtime\n\nCUDA has several advantages, including:\n\n* Higher performance, especially for computationally intensive tasks\n* Larger memory bandwidth and storage\n* Better power efficiency\n\nCUDA's compute capabilities include:\n\n* Wide range of compute capabilities (1.0 to 11.1)\n* Different memory types (shared, global, texture, and constant)\n* Instructions (e.g., ALU, INT, FP, and FP16)\n* Number of AL lanes, texture mapping units, and scheduling\n* Warp size and block sizes\n\nCUDA is used for various applications, including:\n\n* Accelerated rendering, video, encryption, and decryption\nBioinformatics, medical simulations, machine learning\nNeural network, proteins, cryptography, and more\n\nCUDA competes with other GPU computing stacks, such as Intel's OneAPI and AMD's ROCm.", role='assistant', function_call=None, tool_calls=None), stop_reason=128009)], created=1715044970, model='meta-llama/Meta-Llama-3-8B-Instruct', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=344, prompt_tokens=12648, total_tokens=12992))

@sasha0552 sasha0552 changed the title Dynamic RoPE scaling [Frontend] Dynamic RoPE scaling May 7, 2024
vllm/engine/arg_utils.py Outdated Show resolved Hide resolved
vllm/transformers_utils/config.py Outdated Show resolved Hide resolved
@sasha0552 sasha0552 requested a review from mgoin May 14, 2024 20:31
@sasha0552 sasha0552 mentioned this pull request May 18, 2024
6 tasks
@tom-doerr
Copy link

Could this get merged? Especially with Llama 3 being very tolerant of RoPE scaling this is very useful

@mgoin
Copy link
Member

mgoin commented May 21, 2024

Sure this would be great to get in. Currently we don't have a test for it though and it might be too intensive with a real model, so I would like to see at least a unit test implemented. @sasha0552 could you add a test to tests/test_config.py?

@sasha0552
Copy link
Contributor Author

@mgoin test added. Can you review? The failures are not related to this PR.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sasha0552! This is great. We just merged a fix for the failing tests, so please rebase and the tests should pass #4944

@mgoin mgoin enabled auto-merge (squash) May 21, 2024 19:39
auto-merge was automatically disabled May 21, 2024 19:45

Head branch was pushed to by a user without write access

@sasha0552
Copy link
Contributor Author

@mgoin can you merge? All tests passed.

@mgoin mgoin merged commit 9b9a10d into vllm-project:main May 22, 2024
62 checks passed
@sasha0552 sasha0552 deleted the dynamic-rope branch May 22, 2024 07:07
dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 31, 2024
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 8, 2024
joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jul 14, 2024
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Set RoPE scaling parameters dynamically
3 participants