-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Granite Four #13550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
gabe-l-hart
wants to merge
72
commits into
ggml-org:master
Choose a base branch
from
gabe-l-hart:GraniteFour
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Granite Four #13550
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* ggml : improve ggml_mul speed when masking recurrent states
* ggml : make the ggml_mul fast broadcast path more consistently formatted
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.
The max index is 31, so trimming the arguments is necessary.
Whoops, this is needed for the offset in the concatenated output.
This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.
This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.
This includes a slight architectural change where create_memory now only uses model architectures in the switch statement if their required cache type is not handled by llm_arch_is_[recurrent|hybrid]. Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* origin/compilade/mamba2: (24 commits) kv-cache : allow context shift for recurrent models convert : avoid AutoConfig for Mamba and Mamba2 hparams kv-cache : remove const_cast when setting inputs for s_copy metal : single-user mamba2 inference works metal : add missing args for nb references in ssm_scan_f32_group metal : fix confusion between ; and , convert : fix flake8 lint ggml : avoid multiply by D in GGML_OP_SSM_SCAN ggml : remove unused fast broadcast path in GGML_MUL metal : fix wrong number of tokens per sequence in SSM_SCAN metal : fix SSM_SCAN state head offset metal : add back n_seqs to SSM_SCAN args metal : remove unused arguments for SSM_SCAN metal : use log and exp instead of log1pf and expf in SSM_SCAN metal : fix SSM_SCAN pipeline scope metal : attempt to adapt SSM_SCAN for Mamba-2 llama : avoid redundant state copy for Mamba 1 and 2 convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present llama : add missing break llama : remove unused variable ...
This is borrowed and adapted from the original implementation ggml-org#10810 Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This is a manual copy from my draft branch https://github.com/gabe-l-hart/llama.cpp/blob/GraniteFourDraft/convert_hf_to_gguf.py#L5076 Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…_t> for layer index arr in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…d_attn_inp_kv_unified Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This allows other architectures like bamba and granitemoehybrid to use mamab2 without a growing architecture `if` statement inside the mamba implementation. Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
… methods This will allow these layer-builder methods to be used from other build structs without complex inheritance. Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Also no need to pass in kv cache since it's already in the inp_attn Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
It generates (garbage) tokens! Still lots of debugging to do. Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…n hybrid Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This is helpful for hybrid models that want to do gguf param setting by calling multiple parent classes without needing to make those parent classes try/except on every attempt to set a gguf value. Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This re-uses the Bamba code paths heavily and simply adds the missing parts for loading MoE and the shared expert. Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…base Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Apple Metal
https://en.wikipedia.org/wiki/Metal_(API)
ggml
changes relating to the ggml tensor library for machine learning
python
python script changes
testing
Everything test related
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR is the end-point for architecture support for Granite 4.0 (#13269 . It incorporates a number of changes from other in-flight branches that will need to be merged first:
Additionally, this PR replaces some work done on other PRs / branches:
Bamba
support: Bamba architecture #10810Bamba
support: https://github.com/gabe-l-hart/llama.cpp/tree/BambaArchitectureRefactorGranite 4.0
support: https://github.com/gabe-l-hart/llama.cpp/tree/GraniteFourDraftBamba
work, this will also be abandoned in favor of this PRJamba
: llama : support Jamba hybrid Transformer-Mamba models #7531master
.Jamba
support in this branch, but on further inspection, it looks like theJamba
architecture has some additional bells-and-whistles (eg sliding-window-attention) that would need further work, so my plan is to leaveJamba
off for now and possibly tackle it later (hopefully it's much easier than the original branch!)Outstanding Questions
Besides the upstream PRs, there are a few questions to answer before this PR is merge ready:
llama-kv-cache
beyond those in feat: Hybrid unified/recurrent cache #13276, but they depend on the addition ofhparams.recurrent_layer_arr
which is only populated correctly if there is a valid model architecture to check against. Should I move all of these changes to the hybrid cache PR or keep them here where the model architectures become real?hparams.recurrent_layer_arr
? Using a max-layer-sizestd::array
doesn't feel quite right.Bamba
andgranite-4.0-tiny-shared-preview
on this branch vs the respective draft branches, so I need to determine if this is due to changes in the attention implementation (ie "working as expected") or a bug somewhere.dymamic_cast
to get the right cache type could be expensive (though it's likely negligible relative to the tensor math). Should we do something more clever to handle different cache types inllama-graph
?switch
statement for determining the type of KV cache to allocate inllama-model.cpp
seems redundant withllama_model_is_recurrent
andllama_model_is_hybrid
. Should we use those functions instead and eliminate the duplicate logic and additional place to tweak for new recurrent / hybrid models?Testing
To test out this branch, I've been using the following models:
granite-4.0-tiny-preview
: https://huggingface.co/ibm-granite/granite-4.0-tiny-previewBamba-9B-v1
: https://huggingface.co/ibm-ai-platform/Bamba-9B-v1mamba2-370m-hf
: https://huggingface.co/AntonV/mamba2-370m-hfDetails
This PR has a lot of changes in it, some of which are isolated in the prereq-PRs above. In addition to the general
mamba2
andllama_kv_cache_hybrid
changes, this PR does the following:python side
BambaForCausalLM
andGraniteMoeHybridForCausalLM
gguf_writer.py
that allows duplicate key/value pairs throughadd_key_value
if (and only if) they match both value and type with the existing key. This is a convenience for hybrid models so that the converter doesn't need to rewrite the hparam conversion from multiple parents.HybridAttention
section underKeys
inconstants.py
to holdattention.layer_indices
. OPEN QUESTION: Should this just go underAttention
?c++ side
llama_model_is_hybrid
akin tollama_model_is_recurrent
llama_model_is_recurrent
intollm_arch_is_*
implemented inllama-arch.*
andllama_model_is_*
implemented inllama-model.*
. This was done so that they could be used during model initialization before the model itself can be passed as the argument, specifically to determine how to populatehparams.recurrent_layer_arr
(see below).hparams.recurrent_layer_arr
and support parsing ithparams.n_embd_k_s
/hparams.n_embd_v_s
0
. This should be fine since none of those places interact with the hybrid caching.hparams.recurrent_layer(uint32_t)
to check whether a given layer is recurrentbamba
andgranitemoeshared
inllama-arch.*
(the boring part!)hparams
as an additional argument to thellama_model.create_memory
methodllama-graph
, anywhere that a specific cache type needs to be fetched, it is grabbed using new methodsget_recurrent_cache
/get_unified_cache
. These methods usedynamic_cast
to handle both non-hybrid caches and hybrid caches.llama-model.cpp
bamba
andgranitemoehybrid
inllama-model
build_mamba_layer
/build_mamba2_layer
fromllm_build_mamba
andbuild_attention_layer
/build_layer_ffn
fromllm_build_granite
intostatic
methods on their respective classes. This makes for some gross function signatures where member data needs to be explicitly passed, but it allows the hybrid model architecture(s) to use these methods without complex inheritance.