-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : refactor sampling v2 #9294
Conversation
Not sure it's the right place or time to talk about this but in another issue a guy had the idea of, "if the grammar says the character/word can only be "xxx" and nothing else, don't bother asking the LLM what to say for the X next tokens". As there is a refactoring going on, maybe it's the right time to implement it. |
It's not in the scope of this change, but also the grammar never fits only one token. For example, all three tokens "x", "xx" and "xxx" would fit the grammar in that case. One way would be to use the longest token. Another way would be to tokenize "xxx" and use the resulting tokens. Not sure |
762e955
to
3c46719
Compare
This is getting close to ready. Later today will add detailed description of the changes, some comments in the code and do a bit more testing. @slaren PTAL - any comments and suggestions are welcome. |
include/llama.h
Outdated
LLAMA_API struct llama_constraint * llama_constraint_init_top_k (int32_t k, int32_t min_keep); | ||
LLAMA_API struct llama_constraint * llama_constraint_init_top_p (float p, int32_t min_keep); | ||
LLAMA_API struct llama_constraint * llama_constraint_init_min_p (float p, int32_t min_keep); | ||
LLAMA_API struct llama_constraint * llama_constraint_init_tail_free (float z, int32_t min_keep); | ||
LLAMA_API struct llama_constraint * llama_constraint_init_typical (float p, int32_t min_keep); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what's the history of the min_keep
parameter in all of these samplers, from what I can tell parameter is not used in the examples except by the server, but it seems very suspect to me.
Edit: looks like it has been there since the beginning (#1126), and there was never any explanation of why it is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed min_keep
from the top_k
sampler as it didn't make sense. For the p
-based samplers, I think it makes sense to guarantee minimum number of candidate results, regardless of the p
value.
Thanks for the review. I got side tracked a bit with a bug in the |
8307e96
to
11c2e46
Compare
src/llama-grammar.h
Outdated
// be positioned at a character range (see `llama_grammar_advance_stack`), and | ||
// produces the N possible stacks if the given char is accepted at those | ||
// positions | ||
llama_grammar_stacks llama_grammar_accept( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello! Why does llama_grammar_accept
return the stacks? It was previously passed by reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing. I changed it because I thought it improves the signature of the function, but I missed that it would lead to extra memory allocations. So I restored the original signature. f9762c6
Edit: though after one more look, I think it does not matter since we move stacks_new
either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
In the future we should probably simplify llama_token_data
(or remove it entirely) to keep only one value per token and add a flag to llama_token_data_array
to indicate whether the current values are probabilities (ie. normalized to sum 1) or not, so that samplers that can only operate with probabilities can know if they need to call softmax. Having two values per token is very confusing to me because some samplers operate on one or other, and this can lead to situations where a sampler modifies the probabilities, and the next one calls softmax which discards all the changes to the probabilities and computes them from the logits again. I cannot tell if there are already situations like that which end with some samplers that operate on probabilities being ignored.
Is there a way to provide |
Somehow after this change I see breakage in llava sampling - I'm still OOO so I didn't deep dive yet, see mudler/LocalAI#3497 for reference on how it breaks in LocalAI. Would be really appreciated if anyone have an idea/pointers of what's going on ! Thank you!
|
Which commit are you using? I think you need to update to 1b28061 |
mmh, tried with the latest commit (e6b7801) but still crashing with:
|
That's not related to the sampling changes, the only difference is that get_rows operations are bound-checked in all builds now, while previously it was only checked in debug builds. The clip implementation is broken and needs to be fixed: #9066 (comment) |
Thanks for that bit, totally missed it. What it's weird is that now for me it's a 100% hit since I started pinning new version of llama.cpp - as I have test suites running vision tests this never popped up until now. It's not sporadic at all, but really consistent and can't get a single time passing Commit still working here: 815b1fb |
You would need to run the test suite with a debug build to be able to hit the assert. If you want the previous behavior you can revert #9354 in your build, but that still does not make it any less broken, it just hides the issue. |
Thanks for the hints @slaren - I actually tried to comment the assert at all as well to double check - but as you suggested it "hides" it, and crashes in the same way. another datapoint from my side - seems the suggestion in the comment, to edit
actually "fixes" the issue here - probably it's not ideal, but at least seems indeed something is off in the clip implementation when loading images. Sorry for making noise here - is there already an issue open for this issue or shall I open one? I can't find one for this specific issue |
I don't think there is an open issue about this currently, I know that it was briefly discussed in #9066, but that one is already closed. |
Maybe worth to notice: signature of llama_token llama_sampling_sample(
struct llama_sampling_context * ctx_sampling,
struct llama_context * ctx_main,
struct llama_context * ctx_cfg,
int idx = -1);
llama_token gpt_sampler_sample(
struct gpt_sampler * gsmpl,
struct llama_context * ctx,
int idx,
bool grammar_first = false); Compiler may not though any error or warning because values set to |
Hm, it will generate an error. Don't think there is an issue. |
@ggerganov Hi, I'm a bit confused. Why was Classifier-Free Guidance removed? Are there any issues with it? |
The implementation does not need to be part of |
The changes here reflect the changes made in the big llama.cpp sampling PR ggerganov/llama.cpp#9294 The sampling functionality is now broken into the base interface (llama_sampler) and the generation implementation (gpt_sampler). The changes here reflect that. Since the sampling.h/sampling.cpp code uses c++ STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to access a pure-C interface. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
The changes here reflect the changes made in the big llama.cpp sampling PR ggerganov/llama.cpp#9294 The sampling functionality is now broken into the base interface (llama_sampler) and the generation implementation (gpt_sampler). The changes here reflect that. Since the sampling.h/sampling.cpp code uses c++ STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to access a pure-C interface. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
The changes here reflect the changes made in the big llama.cpp sampling PR ggerganov/llama.cpp#9294 The sampling functionality is now broken into the base interface (llama_sampler) and the generation implementation (gpt_sampler). The changes here reflect that. Since the sampling.h/sampling.cpp code uses c++ STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to access a pure-C interface. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(ext_server): Port llama.cpp sampling refactors to ext_server This was a fairly large changeset. I closely followed the changes here: ggerganov/llama.cpp@df270ef Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(server.cpp): Refactor server.cpp logging for llama.cpp overhaul Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Bump llama.cpp to the latest master with `granite` support This does not yet have granite MoE support, but that can come in a follow up PR Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(patches): Update all patches (except solar-pro) to work with bumped llama.cpp Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(solar): Update solar patch for llama.cpp bump Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): Bump llama.cpp for granitemoe support Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): Bump llama.cpp for granitemoe support Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(solar): Update the solar-pro patch for latest llama.cpp bump Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): Bump to the latest master of llama.cpp Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(patches): Update all patches for latest bump Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama): Always run sync.sh from the right directory Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama/patches): Update llama patches Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama)!: Rough sync with llama.cpp submodule There are a number of changes that will need to be propagated to llama.go before any of this works! Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama/patches): Add a patch and update for missing ggml-impl.h include This include is where the ggml_cgraph struct is defined. It is included in many of the .c files to define the forward declartion in ggml.h. It seems that with the subset of code included here, the import was somehow lost (or out-of-order) when building, so adding this include to llama.cpp fixes the missing definition. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama/sync): Add missing ggml-cpu-impl.h copy-over in sync.sh Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Add missing log.cpp This was added as part of the logging overhaul done in llama.cpp Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Overhaul use of sampling module for llama.cpp changes The changes here reflect the changes made in the big llama.cpp sampling PR ggerganov/llama.cpp#9294 The sampling functionality is now broken into the base interface (llama_sampler) and the generation implementation (gpt_sampler). The changes here reflect that. Since the sampling.h/sampling.cpp code uses c++ STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to access a pure-C interface. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Fix the impl of SampleTokenGreedy for new sampling I don't think this method is currently used, so it could probably just be removed so that all sampling goes through the GPT interface, but in the interest of doing no harm, this should keep the method working as expected. Branch: IBMGraniteArchitectureSupport * fix(llama): Remove unused SampleTokenGreedy Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(sync): Remove bash-specific change to sync.sh Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * chore(gofumpt): Format on llama.go to pass linting Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llm): Fix missing <thread> include in ext_server Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Remove TODO about grammar_first This feature was not used/needed previously so should be fine without plumbing it through now. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Better naming for sampling wrapper and args Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Fix patch 05 to use new wrapper api and re-sync Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * runner: Flush pending responses before returning If there are any pending reponses (such as from potential stop tokens) then we should send them back before ending the sequence. Otherwise, we can be missing tokens at the end of a response. Fixes #6707 * fix(llama/sampling): Use gpt_sampler with a forward declaration Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Remove unnecessary patch for gguf impl header This was caused by an earlier mistake in the embeddings patch that was dereferencing the pointer instead of using the wrapper API. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llm): Remove use of deprecated --log-disable flag Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
- Add `struct llama_sampler` and `struct llama_sampler_i` - Add `llama_sampler_` API - Add `llama_sampler_chain_` API for chaining multiple samplers - Remove `LLAMA_API_INTERNAL` - Add `llama_perf_` API and remove old `llama_print_timings` and `llama_reset_timings`
* fix(ext_server): Port llama.cpp sampling refactors to ext_server This was a fairly large changeset. I closely followed the changes here: ggerganov/llama.cpp@df270ef Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(server.cpp): Refactor server.cpp logging for llama.cpp overhaul Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Bump llama.cpp to the latest master with `granite` support This does not yet have granite MoE support, but that can come in a follow up PR Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(patches): Update all patches (except solar-pro) to work with bumped llama.cpp Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(solar): Update solar patch for llama.cpp bump Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): Bump llama.cpp for granitemoe support Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): Bump llama.cpp for granitemoe support Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(solar): Update the solar-pro patch for latest llama.cpp bump Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): Bump to the latest master of llama.cpp Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(patches): Update all patches for latest bump Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama): Always run sync.sh from the right directory Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama/patches): Update llama patches Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama)!: Rough sync with llama.cpp submodule There are a number of changes that will need to be propagated to llama.go before any of this works! Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama/patches): Add a patch and update for missing ggml-impl.h include This include is where the ggml_cgraph struct is defined. It is included in many of the .c files to define the forward declartion in ggml.h. It seems that with the subset of code included here, the import was somehow lost (or out-of-order) when building, so adding this include to llama.cpp fixes the missing definition. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama/sync): Add missing ggml-cpu-impl.h copy-over in sync.sh Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Add missing log.cpp This was added as part of the logging overhaul done in llama.cpp Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Overhaul use of sampling module for llama.cpp changes The changes here reflect the changes made in the big llama.cpp sampling PR ggerganov/llama.cpp#9294 The sampling functionality is now broken into the base interface (llama_sampler) and the generation implementation (gpt_sampler). The changes here reflect that. Since the sampling.h/sampling.cpp code uses c++ STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to access a pure-C interface. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Fix the impl of SampleTokenGreedy for new sampling I don't think this method is currently used, so it could probably just be removed so that all sampling goes through the GPT interface, but in the interest of doing no harm, this should keep the method working as expected. Branch: IBMGraniteArchitectureSupport * fix(llama): Remove unused SampleTokenGreedy Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(sync): Remove bash-specific change to sync.sh Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * chore(gofumpt): Format on llama.go to pass linting Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llm): Fix missing <thread> include in ext_server Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Remove TODO about grammar_first This feature was not used/needed previously so should be fine without plumbing it through now. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Better naming for sampling wrapper and args Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Fix patch 05 to use new wrapper api and re-sync Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * runner: Flush pending responses before returning If there are any pending reponses (such as from potential stop tokens) then we should send them back before ending the sequence. Otherwise, we can be missing tokens at the end of a response. Fixes ollama#6707 * fix(llama/sampling): Use gpt_sampler with a forward declaration Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Remove unnecessary patch for gguf impl header This was caused by an earlier mistake in the embeddings patch that was dereferencing the pointer instead of using the wrapper API. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llm): Remove use of deprecated --log-disable flag Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
alt: #8643
ref: #5214
Overview
llama_sampling_
andllama_grammar_
with newllama_sampler_
common: struct llama_sampling_context
incommon: struct gpt_sampler
struct llama_sampler_i
interfaceAPI Changes
struct llama_sampler
andstruct llama_sampler_i
llama_sampler_
APIllama_sampler_chain_
API for chaining multiple samplersLLAMA_API_INTERNAL
llama_perf_
API and remove oldllama_print_timings
andllama_reset_timings
Implementation details
common/grammar-parser
insrc/llama-grammar
llama_context
no longer comes with a built-in sampling context. The user code is responsible for creating, using, saving and loading thellama_sampler
objects as needed. As a consequence, thellama_state
no longer serializes the RNGllama-sampling.cpp
can be used as examples for implementing custom samplers in user codeExample
Comparison of user sampling code before and after:
Future plan and ideas
llama_decode_with_sampler()
)llama_decode_
API to support multiple decoding runs (e.g.llama_decode_n()
)llama-sampling.cpp
could be split into separate source filesstruct llama_vocab
through the public API and change calls that currently usestruct llama_model
to use it when appropriatering_buffer
code by implementingggml_ring_buffer
for fixed-size objectsllama_token_data
llama : refactor sampling v2 #9294 (review)