-
Notifications
You must be signed in to change notification settings - Fork 12.3k
llama: add initial support for Falcon-H1 model family #14534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Is the old gguf still valid or new one must be generated? |
@jacekpoplawski |
Could you upload new gguf to huggingface? |
def get_tensors(self) -> Iterator[tuple[str, Tensor]]: | ||
for name, tensor in super().get_tensors(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, modify_tensors
should be overridden instead of get_tensors
to rename and/or insert tensors.
def add_attn_head_count(self, count: int) -> None: | ||
self.add_uint32(Keys.Attention.HEAD_COUNT.format(arch=self.arch), count) | ||
|
||
def add_key_value_head_count(self, count: int) -> None: | ||
self.add_uint32(Keys.Attention.HEAD_COUNT_KV.format(arch=self.arch), count) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't these already exist?
@@ -172,6 +172,7 @@ class SSM: | |||
TIME_STEP_RANK = "{arch}.ssm.time_step_rank" | |||
GROUP_COUNT = "{arch}.ssm.group_count" | |||
DT_B_C_RMS = "{arch}.ssm.dt_b_c_rms" | |||
HEAD_DIM = "{arch}.ssm.head_dim" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The head dimension in Mamba-2 is also the time step rank.
I guess it could be clearer to use a more appropriate name like this, though.
I'm not against, this is only to at least let you know.
{ LLM_KV_FINAL_LOGIT_SOFTCAPPING, "%s.final_logit_softcapping" }, | ||
{ LLM_KV_FINAL_LOGIT_SOFTCAPPING, "%s.final_logit_softcapping" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spurious extra space?
// try { | ||
auto * ctx = new llama_context(*model, params); | ||
return ctx; | ||
// } catch (const std::exception & err) { | ||
// LLAMA_LOG_ERROR("%s: failed to initialize the context: %s\n", __func__, err.what()); | ||
// } | ||
|
||
// return nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like debugging modifications made it into a commit
{ LLM_KV_SSM_CONV_KERNEL, "%s.ssm.conv_kernel" }, | ||
{ LLM_KV_SSM_INNER_SIZE, "%s.ssm.inner_size" }, | ||
{ LLM_KV_SSM_STATE_SIZE, "%s.ssm.state_size" }, | ||
{ LLM_KV_SSM_TIME_STEP_RANK, "%s.ssm.time_step_rank" }, | ||
{ LLM_KV_SSM_DT_B_C_RMS, "%s.ssm.dt_b_c_rms" }, | ||
{ LLM_KV_SSM_GROUP_COUNT, "%s.ssm.group_count" }, | ||
{ LLM_KV_SSM_HEAD_DIM, "%s.ssm.head_dim" }, | ||
{ LLM_KV_MAMBA_D_SSM, "%s.ssm.mamba_d_ssm" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like these are redundant with the existing metadata used for Mamba.
Any reason to duplicate keys with the exact same meaning?
Note that the %s
at the beginning will turn into the arch name anyway (e.g. falcon_h1
).
Where it makes sense, try to re-use existing keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the LLM_KV_FALCON_H1_
prefix should be avoided from the enum
values - it's too specific. Reuse existing keys and only add new generic keys if needed.
} else if (name.find("ssm_in.weight") != std::string::npos) { | ||
// For mamba-based models it's better to not quantize the ssm-proj layers | ||
if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_S) { | ||
new_type = GGML_TYPE_BF16; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can be quantized for pure Mamba models, although maybe for hybrid models it makes sense to keep this bigger in mixes.
Why only for Q3_K_S
, though?
{ LLM_KV_FALCON_H1_KEY_MULTIPLIER, "%s.key_multiplier" }, | ||
{ LLM_KV_FALCON_H1_LM_HEAD_MULTIPLIER, "%s.lm_head_multiplier" }, | ||
{ LLM_KV_FALCON_H1_EMBEDDING_MULTIPLIER, "%s.embedding_multiplier" }, | ||
{ LLM_KV_FALCON_H1_MAMBA_CHUNK_SIZE, "%s.ssm.mamba_chunk_size" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the mamba_chunk_size
used anywhere? I thought it was meant to be a runtime parameter for the semi-structured matrices implementation of Mamba-2 (which is not yet implemented in llama.cpp
).
d_inner = self.find_hparam(["intermediate_size", "d_inner"], optional=True) or 2 * d_model | ||
n_group = self.hparams.get("n_groups", 1) | ||
architectures = self.hparams.get("architectures") | ||
if architectures is not None and architectures[0] == "FalconH1ForCausalLM": | ||
# FalconH1F has a different d_inner | ||
d_inner = self.hparams.get("mamba_d_ssm") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another (maybe simpler) approach would be to add "mamba_d_ssm"
to the find_hparams
call for d_inner
above.
# TODO: does this really matter? | ||
assert d_inner == 2 * d_model | ||
assert d_inner % head_dim == 0 | ||
# skip the assertion for FalconH1 Model | ||
architectures = self.hparams.get("architectures") | ||
if architectures is None or architectures[0] != "FalconH1ForCausalLM": | ||
assert d_inner == 2 * d_model | ||
assert d_inner % head_dim == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That check isn't strictly necessary anyway, and probably should be removed altogether (and also in src/llama-model.cpp
).
If you want to keep it for now, but not for Falcon-H1, does self.model_arch
correspond to gguf.MODEL_ARCH.FALCON_H1
when it's that arch? (Might be simpler than reading hparams)
Many thanks @compilade for the review, tomorrow I'll address everything that was mentioned. |
if (arch == LLM_ARCH_FALCON_H1) { | ||
cur = ggml_scale(ctx0, cur, hparams.mlp_down_multiplier); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this more generic:
if (arch == LLM_ARCH_FALCON_H1) { | |
cur = ggml_scale(ctx0, cur, hparams.mlp_down_multiplier); | |
} | |
if (hparams.mlp_down_multiplier != 1.0f) { | |
cur = ggml_scale(ctx0, cur, hparams.mlp_down_multiplier); | |
} |
if (il == n_layer - 1) { | ||
// skip computing output for unused tokens | ||
ggml_tensor * inp_out_ids = build_inp_out_ids(); | ||
cur = ggml_get_rows(ctx0, cur, inp_out_ids); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the creation of inp_out_ids
at the front before the layer loop as in other models.
{ LLM_KV_SSM_CONV_KERNEL, "%s.ssm.conv_kernel" }, | ||
{ LLM_KV_SSM_INNER_SIZE, "%s.ssm.inner_size" }, | ||
{ LLM_KV_SSM_STATE_SIZE, "%s.ssm.state_size" }, | ||
{ LLM_KV_SSM_TIME_STEP_RANK, "%s.ssm.time_step_rank" }, | ||
{ LLM_KV_SSM_DT_B_C_RMS, "%s.ssm.dt_b_c_rms" }, | ||
{ LLM_KV_SSM_GROUP_COUNT, "%s.ssm.group_count" }, | ||
{ LLM_KV_SSM_HEAD_DIM, "%s.ssm.head_dim" }, | ||
{ LLM_KV_MAMBA_D_SSM, "%s.ssm.mamba_d_ssm" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the LLM_KV_FALCON_H1_
prefix should be avoided from the enum
values - it's too specific. Reuse existing keys and only add new generic keys if needed.
Fixes: #13681
Summary
• Add initial support for the Falcon-H1 model family.
• Implements model loading, basic inference, and tokenizer integration for Falcon-H1.
• Updates the build scripts and relevant documentation for Falcon-H1 compatibility.
Details
• Adapted model architecture and layer mapping to match Falcon-H1.
• Integrated Falcon-H1 tokenizer with automatic fallback if tokenizer files are missing.
• Added new test cases to verify Falcon-H1 model loading and inference.
• Cleaned up redundant code from previous Falcon integration attempts.
Notes
• The Falcon-H1 integration follows the same approach as other model families (see llama and Mamba support).
• This supersedes #14238 with a cleaner and more modular implementation.
• Refer to the Falcon-H1 repo for model weights and tokenizer files.