-
Notifications
You must be signed in to change notification settings - Fork 12.1k
Hybrid recurrent cache #13979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Hybrid recurrent cache #13979
Conversation
src/llama-graph.cpp
Outdated
const llama_kv_cache_unified_state * llm_graph_context::get_state_unified() const { | ||
const auto * umstate = dynamic_cast<const llama_kv_cache_unified_state *>(mstate); | ||
if (!umstate) { | ||
const auto hmstate = dynamic_cast<const llama_kv_cache_hybrid_recurrent_state *>(mstate); | ||
if (hmstate) { | ||
umstate = hmstate->get_state_attn(); | ||
} | ||
} | ||
GGML_ASSERT(umstate); | ||
return umstate; | ||
} | ||
|
||
const llama_kv_cache_recurrent_state * llm_graph_context::get_state_recurrent() const { | ||
const auto * rmstate = dynamic_cast<const llama_kv_cache_recurrent_state *>(mstate); | ||
if (!rmstate) { | ||
const auto hmstate = dynamic_cast<const llama_kv_cache_hybrid_recurrent_state *>(mstate); | ||
if (hmstate) { | ||
rmstate = hmstate->get_state_recurrent(); | ||
} | ||
} | ||
GGML_ASSERT(rmstate); | ||
return rmstate; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These dynamic casts should not be necessary. Instead you need a new llm_graph_context::build_attn_inp_kv_hybrid_recurrent()
method, similar to llm_graph_context::build_attn_inp_kv_unified_iswa()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working through this now and a couple of questions are coming up:
- Would it be best to combine
build_inp_s_copy
withbuild_attn_inp_kv
for hybrid so that models call just one "build inputs" function, or keep them separate for simplicity? - For the
build_attn
methods, each has a correspondingllm_graph_input_attn_*
class. Thebuild_inp_s_*
methods don't have this pattern which would make this a bit harder to have code reuse. Are there plans to refactor that further @compilade? - In the
mamba2
branch,s_mask
seems to be totally removed. I'd prefer not to do all of the boilerplate for duplicatingbuild_inp_s_mask
for the hybrid recurrent case if that's definitely going to be going away. Is there any reason that might stick around past the merge ofmamba2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answering out of order, but it should still make sense:
2. For the
build_attn
methods, each has a correspondingllm_graph_input_attn_*
class. Thebuild_inp_s_*
methods don't have this pattern
They do follow this pattern, see
Line 190 in 7e00e60
class llm_graph_input_s_copy : public llm_graph_input_i { |
(this is on the current master
)
I think you might mean the build_attn_*
methods also return instances of llm_graph_input_attn_*
?
That seems to be directly related to llm_graph_context::build_attn()
which has multiple implementations which differ by the type of the first argument (e.g. for llm_graph_input_attn_kv_unified
, llm_graph_input_attn_no_cache
, etc.)
Are there plans to refactor that further @compilade?
Not really, outside of removing s_mask
(and related functions and classes) as part of #13834.
- Would it be best to combine
build_inp_s_copy
withbuild_attn_inp_kv
for hybrid so that models call just one "build inputs" function, or keep them separate for simplicity?
Personally, I think it would be simpler to keep them separate, because they are fundamentally different (one is intended to be used by build_copy_mask_state
(renamed to build_recurrent_state
in #13834), while the other is used by build_attn
), and they are pretty much independent, even in hybrid models (at least for Jamba, the recurrent and self-attention layers are mostly independent on that front).
I don't see how build_attn
would ever need s_copy
.
build_inp_s_copy
and build_inp_attn_kv_*
are called once at the beginning of the graph, while build_attn
and build_recurrent_state
are called once per layer (where applicable, and so usually different layers for both).
3. Is there any reason [
s_mask
] might stick around past the merge ofmamba2
?
No reason to keep it, s_mask
will be removed. Its functionality is redundant with s_copy
, and otherwise prevents minimizing unnecessary state copies. It was used to clear the states, but the same can be done through inp_s_copy
and clearing by copying a zero-ed state (which is the rs_z
'th state in the mamba2
branch (and #13834)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's super helpful! I was missing the distinction between build_attn_inp
and build_attn
which makes perfect sense.
Personally, I think it would be simpler to keep them separate
I agree on my personal gut feeling, so I'll go with this and see how it feels once complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think this feels a lot cleaner now. For build_inp_s_copy
, I opted to add an optional parameter so that the caller can take ownership of casting the cache state rather than duplicating the function into build_inp_s_copy_hybrid
. That felt a little cleaner w.r.t. code reuse, but I'm happy to do a separate method if that's preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there's one more place that will need changing in build_copy_mask_state
(renamed to build_recurrent_state
on mamba2
). Similar to build_inp_s_copy
, I think the cleanest way to do this for code reuse is to add an optional parameter that, if unset, will use the current logic of casting mstate
.
// TODO: will the recurrent cache be in an undefined state at this point? | ||
LLAMA_LOG_ERROR("%s: failed to prepare recurrent ubatches\n", __func__); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but that will be fixed in #13834
(Noting here in case this gets merged first so that I don't forget to update the comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I strip out this TODO at this point?
ab918bb
to
60ca3ba
Compare
@ggerganov I've noticed that the
|
This is an attempt to handle race conditions between /health returning OK and the other endpoints not returning successfully. ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
I've tried adding retry logic for all requests in 39a93b3 to work around the race between |
The changes to the server tests should not be needed. Let's revert the commit for now and I'll investigate. |
39a93b3
to
60ca3ba
Compare
@ggerganov Thanks, it looks like those changes didn't fix the failures anyway, so definitely not the right fix. I've reset them out and can open an issue with details of what I see locally |
Issue for follow up on |
7958d84
to
3669876
Compare
I've rebased on #13834. Drafting for now until it's merged |
3669876
to
8c59841
Compare
That was quick! Undrafting now that #13834 is merged |
Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour
…s in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…l is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: ggml-org#13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…he interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…empt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
No longer needed now that unified isn't also supporting recurrent ggml-org#13979 (comment) Branch: HybridRecurrentCache
Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This is no longer needed now that there are separate implementations ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
1a7e23d
to
b3c948a
Compare
…y state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…ntion pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs * as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent * as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent * as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
n_seq_max | ||
)) {} | ||
|
||
llama_memory_state_ptr llama_kv_cache_hybrid_recurrent::init_batch(const llama_batch & batch, uint32_t n_ubatch, bool embd_pooled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've tracked the mysterious break in the Granite 4 rebase chain to something in this method. What I've found is that for the attention layers, the kq_mask
is different between my (old) working branch and my rebased broken one. That, in turn led me to llama_kv_cache_unified::set_input_kq_mask
which was last modified in #14130 which also had some changes in init_batch
. I'm still fumbling around understanding how all the batching works, but it feels like something I'm doing here is probably incorrect for how the unified cache expects its ubatches
to show up.
@compilade @ggerganov any chance anything jumps out at you? I think it might be related to #13746 (comment) from several PRs ago?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also see #14119 touching the same areas, so it's also possibly related
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried running a non-hybrid model with the unified cache and attempted to refactor the unified cache to use split equal. The model still works, so I don't think it's a split_simple
vs split_equal
problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've debugged the bad output to this line. The previous op (here) has matching inputs before/after the breakage, but the output of the ggml_soft_max_ext
op is different.
I've stepped through the op in cpu mode and I'm pretty confident I've verified that the tensor data in kq_mask
is different. In the not-broken run, the only cells with 0
instead of -Inf
are in the first position of each row, whereas in the broken run, there are 0
s in other positions (eg both 4096
and 4097
whereas 4097
is -Inf
in the not-broken run).
This leaves me pretty sure that something in how the kq_mask
is initialized changed. The "broken" version is https://github.com/gabe-l-hart/llama.cpp/blob/broken/GraniteFour/src/llama-graph.cpp (based off of ed52f36
on master
) while the "working" version is https://github.com/gabe-l-hart/llama.cpp/blob/GraniteFour/src/llama-graph.cpp (based off of f470bc3
on master
).
bool llama_kv_cache_hybrid_recurrent_state::apply() { | ||
assert(status == LLAMA_MEMORY_STATUS_SUCCESS); | ||
|
||
kv->get_kv_attn() ->apply_ubatch(heads_attn[i_next], ubatches[i_next]); | ||
kv->get_kv_recurrent()->find_slot(ubatches[i_next]); | ||
|
||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gabe-l-hart There might be a problem here regarding the attention state: n_kv
and head
are not really set here (while in llama_kv_cache_unified_state::apply()
, they normally are, but that method isn't called here).
But it's not as simple as calling the apply()
methods of the child states here because of how they are initialized by the hybrid cache (minimally, which makes the update()
step of the unified cache not work). Maybe that initialization needs to be revisited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. At least the llama_kv_cache_unified_state
state has to be initialized in a similar way as it is done in llama_kv_cache_unified_iswa_state
:
llama.cpp/src/llama-kv-cache-unified-iswa.cpp
Lines 216 to 232 in fb85a28
llama_kv_cache_unified_iswa_state::llama_kv_cache_unified_iswa_state( | |
llama_kv_cache_unified_iswa * kv, | |
llama_sbatch sbatch, | |
std::vector<uint32_t> heads_base, | |
std::vector<uint32_t> heads_swa, | |
std::vector<llama_ubatch> ubatches) | |
: status(LLAMA_MEMORY_STATUS_SUCCESS), | |
sbatch(std::move(sbatch)), | |
ubatches(std::move(ubatches)) { | |
// note: here we copy the ubatches. not sure if this is ideal | |
state_base.reset(new llama_kv_cache_unified_state(kv->get_base(), {}, std::move(heads_base), this->ubatches)); | |
state_swa .reset(new llama_kv_cache_unified_state(kv->get_swa (), {}, std::move(heads_swa), this->ubatches)); | |
status = llama_memory_status_combine(state_base->get_status(), state_swa->get_status()); | |
} | |
This is a re-opened version of #13904 after #13746 was merged
Description
This PR introduces the
llama_kv_cache_hybrid_recurrent
cache implementation. It follows the pattern ofllama_kv_cache_unified_iswa
by holding two child cache instances and implementing the interface logic such that it manages both correctly for the appropriate layers.Changes
The main change in this PR is the addition of
llama_kv_cache_hybrid_recurrent
inllama-kv-cache-hybrid-recurrent.*
. In addition to this, the PR does the following:llama_model_is_hybrid_recurrent
public API (akin tollama_model_is_recurrent
)LLM_KV_ATTENTION_LAYER_INDICES
as an hparam to hold the indices of the layers that should use attention (versus recurrent)iswa
, but that mechanism also isn't particularly extensible. It might be more appropriate to have a generic mechanism for indicating the type of caching to use for each layer, but that would start to approach the generic hybrid implementation that I originally attempted which ended up being too abstract (feat: Hybrid unified/recurrent cache #13276).llm_graph_context
that need a specific type of cache to use getters (get_state_unified
/get_state_recurrent
) that will properly handlellama_kv_cache_hybrid_recurrent
n_embd_k_s
/n_embd_v_s
layer-dependent and use layer indices when calling them in the existing cache implementationsllama_kv_cache_recurrent
llama_model::create_memory
to usellm_arch_is_recurrent
andllm_arch_is_hybrid_recurrent
rather than relying on adding models to theswitch
statement which was redundant with the implementation of these functions