-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Adds Phi-3 support #4298
Conversation
Interested to see whether |
Torch compile compatible version is here without scaling. @caiom would you mind syncronizing with huggingface/transformers#30423 and @gugarosa to make sure we have a similar format? |
I've updated the naming to match the ones used in HF and the config changes. Let's wait until the HF PR is merged. |
long_prompt_offset = (torch.any(positions > k).float() * | ||
torch.full_like(positions, k)).long() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have question on this line: In a case where there are prompts longer than k
and others shorter than k
, it will have a uniform offset (choosing the long cos_sin_cache), therefore, the shorter prompts will also refer to the long cos_sin_cache, and the outputs will be different if the short prompts are processed alone. But is this as expected? In shorter-prompt cases the short cos_sin_cache should be used I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mindest, that's correct, when a batch contains long and short sequences, it will always use long factor, even for short samples. Currently we don't support such mixed batches.
@@ -349,16 +469,17 @@ def get_rope( | |||
rope_scaling: Optional[Dict[str, Any]] = None, | |||
) -> RotaryEmbedding: | |||
key = (head_size, rotary_dim, max_position, base, is_neox_style, | |||
tuple(rope_scaling.items()) if rope_scaling is not None else None) | |||
(v for v in rope_scaling.items() | |||
if not isinstance(v, list)) if rope_scaling is not None else None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of replace the list
as None, consider turning it into a tuple
instead? That will be more accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! I fixed the code and disentangled it be to clear. I also verified that the layer caching system is working.
HF PR is merged: huggingface/transformers#30423 |
About torch.compile: The Phi3SuScaledRotaryEmbedding is fully compatible with it and there are no graph breaks. However, I noticed a longer startup time and some recompilation warnings. Maybe someone with more compile experience can add this feature later? The speedup is between 2-5x depending on the sequence length. |
("qkv_proj", "v_proj", "v"), | ||
("gate_up_proj", "gate_proj", 0), | ||
("gate_up_proj", "up_proj", 1), | ||
(".qkv_proj", ".q_proj", "q"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick question: Why adding dot here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Phi have a different set of mappings?
- does this not break the existing llama executor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ensures that we are replacing the full layer name instead of a partial layer name.
Phi3 uses the vLLM layer naming convention (no replacement necessary), therefore we have a layer named gate_up_proj, without the dots, this layer would be incorrectly renamed to gate_gate_up_proj due to line 389.
So the dot in up_proj makes sure that we only change the name of the layer if the layer name starts with "up_proj" instead of containing "up_proj" anywhere in its name.
This change is compatible with Llama and Mistral models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarification! make sense.
LGTM! Could you help adding this model to document and readme file? |
Thanks for the review. I've added Phi3 to README and to the Supported Models doc. |
Hi @caiom , Phi-3 with this is relatively slow for a 3B model. Is there a way to speed it up? I get about 73 tokens/sec on H100 with no in-context stuff. And about 35 tokens/sec with context filled. But that's with streaming. I get better performance at 125 tokens/s with no in-context stuff when not streaming. |
@@ -46,6 +46,7 @@ | |||
"OPTForCausalLM": ("opt", "OPTForCausalLM"), | |||
"OrionForCausalLM": ("orion", "OrionForCausalLM"), | |||
"PhiForCausalLM": ("phi", "PhiForCausalLM"), | |||
"Phi3ForCausalLM": ("llama", "LlamaForCausalLM"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retrospectively a quick question (more for my own understanding): why are these llama
and LlamaForCausalLM
? Isn't it on huggingface transformers library it has been defined as Phi3ForCausalLM? https://huggingface.co/docs/transformers/main/en/model_doc/phi3.
It seems like module and cls name is later used to import from vllm's /models
directory, which we have not yet added phi3.py
in to the list or created its own file yet?
I think Phi3, while on paper is very similar to Llama 2, it does have a few modifications that makes it different from llama 2 like fused qkv and mlp. Should we make Phi 3 its own file or do you think it is ok to just keep inheriting from llama this way? (or I guess based on the rest of the PR you may have already done this by just changing llama.py so maybe that just works haha)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model is very similar to llama, hence it is better to reuse the existing code base. The same is true for Mistral.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah i saw your changes to the other files and they make sense now. Ty!
Will be working on that. |
Hi sorry it's me again, I am actually getting an error with Phi-3 after installing from source
Sliding window is 2047 and that seems correct based on phi 3 code. Block size 16 is also correct...? Am I missing something? I am currently working on a custom quant for Phi 3 which maybe related. Should I open a separate issue? Oh also I am not sure how this was tested but Phi 3 mini 4k has sliding window of 2047 but I think 128k context length version has sliding window disabled or something different? |
Hi @davidgxue, sorry about that! Have a look here: #4380 |
This PR adds support for microsoft/Phi-3-mini-4k-instruct and microsoft/Phi-3-mini-128k-instruct.
Changes:
Models tested before this PR:
BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE
PR Checklist (Click to Expand)
Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]
for bug fixes.[CI/Build]
for build or continuous integration improvements.[Doc]
for documentation fixes and improvements.[Model]
for adding a new model or improving an existing model. Model name should appear in the title.[Frontend]
For changes on the vLLM frontend (e.g., OpenAI API server,LLM
class, etc.)[Kernel]
for changes affecting CUDA kernels or other compute kernels.[Core]
for changes in the core vLLM logic (e.g.,LLMEngine
,AsyncLLMEngine
,Scheduler
, etc.)[Hardware][Vendor]
for hardware-specific changes. Vendor name should appear in the prefix (e.g.,[Hardware][AMD]
).[Misc]
for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
format.sh
to format your code.docs/source/
if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.Notes for Large Changes
Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with
rfc-required
and might not go through the PR.What to Expect for the Reviews
The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:
action-required
label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.Thank You
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!