-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Adds Phi-3 support #4298
Changes from 9 commits
b61b2c4
9b18c43
6b9b1ca
cc94d0c
10da966
3f9878e
6c93f5d
d694081
9a9eebe
3ef1a41
935a705
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -46,6 +46,7 @@ | |
"OPTForCausalLM": ("opt", "OPTForCausalLM"), | ||
"OrionForCausalLM": ("orion", "OrionForCausalLM"), | ||
"PhiForCausalLM": ("phi", "PhiForCausalLM"), | ||
"Phi3ForCausalLM": ("llama", "LlamaForCausalLM"), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Retrospectively a quick question (more for my own understanding): why are these It seems like module and cls name is later used to import from vllm's I think Phi3, while on paper is very similar to Llama 2, it does have a few modifications that makes it different from llama 2 like fused qkv and mlp. Should we make Phi 3 its own file or do you think it is ok to just keep inheriting from llama this way? (or I guess based on the rest of the PR you may have already done this by just changing llama.py so maybe that just works haha) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The model is very similar to llama, hence it is better to reuse the existing code base. The same is true for Mistral. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah i saw your changes to the other files and they make sense now. Ty! |
||
"QWenLMHeadModel": ("qwen", "QWenLMHeadModel"), | ||
"Qwen2ForCausalLM": ("qwen2", "Qwen2ForCausalLM"), | ||
"Qwen2MoeForCausalLM": ("qwen2_moe", "Qwen2MoeForCausalLM"), | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -180,6 +180,10 @@ def __init__( | |
self.hidden_size = config.hidden_size | ||
rope_theta = getattr(config, "rope_theta", 10000) | ||
rope_scaling = getattr(config, "rope_scaling", None) | ||
if rope_scaling is not None and getattr( | ||
config, "original_max_position_embeddings", None): | ||
rope_scaling["original_max_position_embeddings"] = ( | ||
config.original_max_position_embeddings) | ||
max_position_embeddings = getattr(config, "max_position_embeddings", | ||
8192) | ||
sliding_window = getattr(config, "sliding_window", None) | ||
|
@@ -378,11 +382,11 @@ def sample( | |
def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]): | ||
stacked_params_mapping = [ | ||
# (param_name, shard_name, shard_id) | ||
("qkv_proj", "q_proj", "q"), | ||
("qkv_proj", "k_proj", "k"), | ||
("qkv_proj", "v_proj", "v"), | ||
("gate_up_proj", "gate_proj", 0), | ||
("gate_up_proj", "up_proj", 1), | ||
(".qkv_proj", ".q_proj", "q"), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A quick question: Why adding dot here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does Phi have a different set of mappings?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This ensures that we are replacing the full layer name instead of a partial layer name. Phi3 uses the vLLM layer naming convention (no replacement necessary), therefore we have a layer named gate_up_proj, without the dots, this layer would be incorrectly renamed to gate_gate_up_proj due to line 389. So the dot in up_proj makes sure that we only change the name of the layer if the layer name starts with "up_proj" instead of containing "up_proj" anywhere in its name. This change is compatible with Llama and Mistral models. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for clarification! make sense. |
||
(".qkv_proj", ".k_proj", "k"), | ||
(".qkv_proj", ".v_proj", "v"), | ||
(".gate_up_proj", ".gate_proj", 0), | ||
(".gate_up_proj", ".up_proj", 1), | ||
] | ||
params_dict = dict(self.named_parameters()) | ||
for name, loaded_weight in weights: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have question on this line: In a case where there are prompts longer than
k
and others shorter thank
, it will have a uniform offset (choosing the long cos_sin_cache), therefore, the shorter prompts will also refer to the long cos_sin_cache, and the outputs will be different if the short prompts are processed alone. But is this as expected? In shorter-prompt cases the short cos_sin_cache should be used I think.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mindest, that's correct, when a batch contains long and short sequences, it will always use long factor, even for short samples. Currently we don't support such mixed batches.