Skip to content

llama: add initial support for Falcon-H1 model family #14534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

ibrahimkhadraoui
Copy link

@ibrahimkhadraoui ibrahimkhadraoui commented Jul 4, 2025

Fixes: #13681
Summary
• Add initial support for the Falcon-H1 model family.
• Implements model loading, basic inference, and tokenizer integration for Falcon-H1.
• Updates the build scripts and relevant documentation for Falcon-H1 compatibility.

Details
• Adapted model architecture and layer mapping to match Falcon-H1.
• Integrated Falcon-H1 tokenizer with automatic fallback if tokenizer files are missing.
• Added new test cases to verify Falcon-H1 model loading and inference.
• Cleaned up redundant code from previous Falcon integration attempts.

Notes
• The Falcon-H1 integration follows the same approach as other model families (see llama and Mamba support).
• This supersedes #14238 with a cleaner and more modular implementation.
• Refer to the Falcon-H1 repo for model weights and tokenizer files.

@github-actions github-actions bot added the python python script changes label Jul 4, 2025
@jacekpoplawski
Copy link

Is the old gguf still valid or new one must be generated?

@ibrahimkhadraoui
Copy link
Author

@jacekpoplawski
You need to re convert with hf_to_gguf we did some changes on the data types

@jacekpoplawski
Copy link

@jacekpoplawski
You need to re convert with hf_to_gguf we did some changes on the data types

Could you upload new gguf to huggingface?

Comment on lines +6608 to +6609
def get_tensors(self) -> Iterator[tuple[str, Tensor]]:
for name, tensor in super().get_tensors():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, modify_tensors should be overridden instead of get_tensors to rename and/or insert tensors.

Comment on lines +873 to +878
def add_attn_head_count(self, count: int) -> None:
self.add_uint32(Keys.Attention.HEAD_COUNT.format(arch=self.arch), count)

def add_key_value_head_count(self, count: int) -> None:
self.add_uint32(Keys.Attention.HEAD_COUNT_KV.format(arch=self.arch), count)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't these already exist?

@@ -172,6 +172,7 @@ class SSM:
TIME_STEP_RANK = "{arch}.ssm.time_step_rank"
GROUP_COUNT = "{arch}.ssm.group_count"
DT_B_C_RMS = "{arch}.ssm.dt_b_c_rms"
HEAD_DIM = "{arch}.ssm.head_dim"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The head dimension in Mamba-2 is also the time step rank.

I guess it could be clearer to use a more appropriate name like this, though.

I'm not against, this is only to at least let you know.

Comment on lines -121 to +122
{ LLM_KV_FINAL_LOGIT_SOFTCAPPING, "%s.final_logit_softcapping" },
{ LLM_KV_FINAL_LOGIT_SOFTCAPPING, "%s.final_logit_softcapping" },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spurious extra space?

Comment on lines +2223 to +2230
// try {
auto * ctx = new llama_context(*model, params);
return ctx;
// } catch (const std::exception & err) {
// LLAMA_LOG_ERROR("%s: failed to initialize the context: %s\n", __func__, err.what());
// }

// return nullptr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like debugging modifications made it into a commit

Comment on lines +221 to +228
{ LLM_KV_SSM_CONV_KERNEL, "%s.ssm.conv_kernel" },
{ LLM_KV_SSM_INNER_SIZE, "%s.ssm.inner_size" },
{ LLM_KV_SSM_STATE_SIZE, "%s.ssm.state_size" },
{ LLM_KV_SSM_TIME_STEP_RANK, "%s.ssm.time_step_rank" },
{ LLM_KV_SSM_DT_B_C_RMS, "%s.ssm.dt_b_c_rms" },
{ LLM_KV_SSM_GROUP_COUNT, "%s.ssm.group_count" },
{ LLM_KV_SSM_HEAD_DIM, "%s.ssm.head_dim" },
{ LLM_KV_MAMBA_D_SSM, "%s.ssm.mamba_d_ssm" },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like these are redundant with the existing metadata used for Mamba.

Any reason to duplicate keys with the exact same meaning?

Note that the %s at the beginning will turn into the arch name anyway (e.g. falcon_h1).

Where it makes sense, try to re-use existing keys.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the LLM_KV_FALCON_H1_ prefix should be avoided from the enum values - it's too specific. Reuse existing keys and only add new generic keys if needed.

Comment on lines +318 to +322
} else if (name.find("ssm_in.weight") != std::string::npos) {
// For mamba-based models it's better to not quantize the ssm-proj layers
if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_S) {
new_type = GGML_TYPE_BF16;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can be quantized for pure Mamba models, although maybe for hybrid models it makes sense to keep this bigger in mixes.

Why only for Q3_K_S, though?

{ LLM_KV_FALCON_H1_KEY_MULTIPLIER, "%s.key_multiplier" },
{ LLM_KV_FALCON_H1_LM_HEAD_MULTIPLIER, "%s.lm_head_multiplier" },
{ LLM_KV_FALCON_H1_EMBEDDING_MULTIPLIER, "%s.embedding_multiplier" },
{ LLM_KV_FALCON_H1_MAMBA_CHUNK_SIZE, "%s.ssm.mamba_chunk_size" },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the mamba_chunk_size used anywhere? I thought it was meant to be a runtime parameter for the semi-structured matrices implementation of Mamba-2 (which is not yet implemented in llama.cpp).

Comment on lines 4955 to +4960
d_inner = self.find_hparam(["intermediate_size", "d_inner"], optional=True) or 2 * d_model
n_group = self.hparams.get("n_groups", 1)
architectures = self.hparams.get("architectures")
if architectures is not None and architectures[0] == "FalconH1ForCausalLM":
# FalconH1F has a different d_inner
d_inner = self.hparams.get("mamba_d_ssm")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another (maybe simpler) approach would be to add "mamba_d_ssm" to the find_hparams call for d_inner above.

Comment on lines 4913 to +4918
# TODO: does this really matter?
assert d_inner == 2 * d_model
assert d_inner % head_dim == 0
# skip the assertion for FalconH1 Model
architectures = self.hparams.get("architectures")
if architectures is None or architectures[0] != "FalconH1ForCausalLM":
assert d_inner == 2 * d_model
assert d_inner % head_dim == 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That check isn't strictly necessary anyway, and probably should be removed altogether (and also in src/llama-model.cpp).

If you want to keep it for now, but not for Falcon-H1, does self.model_arch correspond to gguf.MODEL_ARCH.FALCON_H1 when it's that arch? (Might be simpler than reading hparams)

@ibrahimkhadraoui
Copy link
Author

Many thanks @compilade for the review, tomorrow I'll address everything that was mentioned.

Comment on lines +638 to +640
if (arch == LLM_ARCH_FALCON_H1) {
cur = ggml_scale(ctx0, cur, hparams.mlp_down_multiplier);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this more generic:

Suggested change
if (arch == LLM_ARCH_FALCON_H1) {
cur = ggml_scale(ctx0, cur, hparams.mlp_down_multiplier);
}
if (hparams.mlp_down_multiplier != 1.0f) {
cur = ggml_scale(ctx0, cur, hparams.mlp_down_multiplier);
}

Comment on lines +14742 to +14745
if (il == n_layer - 1) {
// skip computing output for unused tokens
ggml_tensor * inp_out_ids = build_inp_out_ids();
cur = ggml_get_rows(ctx0, cur, inp_out_ids);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the creation of inp_out_ids at the front before the layer loop as in other models.

Comment on lines +221 to +228
{ LLM_KV_SSM_CONV_KERNEL, "%s.ssm.conv_kernel" },
{ LLM_KV_SSM_INNER_SIZE, "%s.ssm.inner_size" },
{ LLM_KV_SSM_STATE_SIZE, "%s.ssm.state_size" },
{ LLM_KV_SSM_TIME_STEP_RANK, "%s.ssm.time_step_rank" },
{ LLM_KV_SSM_DT_B_C_RMS, "%s.ssm.dt_b_c_rms" },
{ LLM_KV_SSM_GROUP_COUNT, "%s.ssm.group_count" },
{ LLM_KV_SSM_HEAD_DIM, "%s.ssm.head_dim" },
{ LLM_KV_MAMBA_D_SSM, "%s.ssm.mamba_d_ssm" },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the LLM_KV_FALCON_H1_ prefix should be avoided from the enum values - it's too specific. Reuse existing keys and only add new generic keys if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Falcon-H1
5 participants