-
Notifications
You must be signed in to change notification settings - Fork 12.1k
Add support for Arcee AI's upcoming AFM model #14185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
NULL, NULL, NULL, | ||
model.layers[il].ffn_down, NULL, NULL, | ||
NULL, | ||
LLM_FFN_RELU_SQR, LLM_FFN_SEQ, il); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the only different from AFM and llama is only this activation function.
Not sure if in the future, we can abstract out this activation definition per-model (maybe as a hparam or a variable inside struct llm_build_llama
?) to avoid too much duplicated code. WDYT @ggerganov ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also lacks the FFN gate, but maybe could also be abstracted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the gate is not present, its value will be nullptr
, and build_ffn
will skip the nullptr value, so no further modification is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah right that makes sense ! yeah definitely seems worth considering some extra abstraction here then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw I'm just bring this up for further discussion. Feel free to merge the current PR without that
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lmk when you're ready to merge this
Ready! |
@@ -128,6 +128,7 @@ class TOKENIZER_TYPE(IntEnum): | |||
{"name": "llama4", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct", }, | |||
{"name": "pixtral", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/mistral-community/pixtral-12b", }, | |||
{"name": "seed-coder", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base", }, | |||
{"name": "arcee", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/arcee-ai/AFM-4.5B", }, # TODO confirm final URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either this shouldn't have been added, or you forgot to add the new hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in #14207
This adds support for upcoming Arcee model architecture, currently codenamed the Arcee Foundation Model (AFM).
Uses ReLU² (ReLU-squared) activation in the MLP blocks
Have tested performance of quantized model, seems to perform as expected, but keeping this a draft until it can be lightly reviewed and we confirm it's accurate
Transformers PR reference: huggingface/transformers#38621