-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lookup: complement data from context with general text statistics #5479
lookup: complement data from context with general text statistics #5479
Conversation
I think intuitively speaking drafting should focus on longer phrases, sequences, etc that are deterministically predictable in most cases. Like for example “state-of-the-art” being separated in multiple tokens is pretty redundant but once that pattern is established its trivial to predict the last half in contexts where it appears. I wonder if you can explicitly target continuations like this by avoiding drafting when the next token isn’t a continuation of a word / sequence and is the start of a new sentence (less predictable) for example? |
I tried explicitly checking for work boundaries but the results were not good. But of course it's always possible that I just did it wrong. In any case, it would also be useful to have a generic implementation since you can (in principle) use transformers for more than just text. |
examples/lookup/lookup.cpp
Outdated
const int static_ngram_start = inp_size-2 + draft.size()-1; | ||
uint64_t static_ngram = get_token(inp, draft, static_ngram_start); | ||
for (int j = static_ngram_start; j < static_ngram_start + 2; ++j) { | ||
const uint64_t ngram_part = get_token(inp, draft, j); | ||
static_ngram <<= 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems incorrect - it's calling get_token(inp, draft, static_ngram_start);
twice: one before the loop on line 285 and one in the first loop iter on line 287
Also, printing the resulting static_ngram
, the major 16-bits are always 0
static_ngram: 0000000000000000 0010010110101010 0010010110101010 0111010011000100
their
static_ngram: 0000000000000000 0111010011000100 0111010011000100 0000001111110001
love
static_ngram: 0000000000000000 0000001111110001 0000001111110001 0001010011110000
for
static_ngram: 0000000000000000 0001010011110000 0001010011110000 0000000101101011
each
static_ngram: 0000000000000000 0000000101101011 0000000101101011 0000010011110101
other
static_ngram: 0000000000000000 0000010011110101 0000010011110101 0000001110010100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You see, the trick is that I did it wrong but because the I did it consistently wrong the code still works. The static n-grams always have a size of 2 (1 is underfitting, 3 is overfitting with wikitext-103) so in actuality the major 32 bits are supposed to always be 0. The impact of the bug should only affect n-grams of size 4 but those are not very useful anyways because 3 already overfits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see
The impact of the bug should only affect n-grams of size 4 but those are not very useful anyways because 3 already overfits.
My intuitive understanding is that the larger the n-gram that we can match in the corpus - the better. For example, for code completion use cases this is most certainly true due to the repetitive nature of code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For an infinitely large corpus this is definitely true. But you essentially have a bias <-> variance tradeoff. Smaller n-grams have more bias because they cannot model the text distribution as well as larger n-grams. But larger n-grams have more variance because the number of possible token combinations increases exponentially with n-gram size. Since the tokens for a continuation follow a multinomial distribution you can estimate that for a vocabulary size of 32000 an increase of 1 in the n-gram size increases the error on the empirical token probabilities increases by a factor of
When it comes to the lookup from the context 3 vs. 4 also makes little difference because there would need to be token sequences where three tokens are the same but those immediately before and afterwards are different. And I think there simply aren't that many cases where this happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it comes to generating code or highly structured data (e.g. JSON) or summarizing/extracting information, large n-gram matches are much more frequent compared to free-form text completion. My main desire here is that the potential implementation for n-gram caches that we will implement in llama
will be general enough and not have limitations on the n-gram size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, the main reason I'm encoding the tokens as 64 bit integers is that uint64_t
is a hashable type. If we were to write a custom hashing function we could go up to whatever length we want.
I fixed the off-by-one issue with n-gram size and uploaded a fixed These are my results for the fixed version:
I would say they are worse than the results I got before but because the output has changed and I only tested a single prompt/seed the result is inconclusive. I think it would be useful to have something like the |
I tested this PR on my RTX 3090 using Yi 34b 200k RPMerge q4_K_M. The tokenizer is different so a different
I get these results:
Compared to Miqu the generations had fewer repetitions which is good in terms of output quality but of course also means that there is less potential for speedup from lookup decoding. So even though the t/s as a function of batch size scales much better on an RTX 3090 compared to a P40 the speedup is still low. But again, with a sample size of 1 it is very difficult to make conclusive statements. |
I've been thinking, maybe you could synthetically generate the data to use for the static lookup. Basically, just generate a bunch of outputs for different seeds and use them as input for I would also want to try just using a lot more data for the static lookup in order to make an n-gram size of 3 usable. Unfortunately The Pile was taken down due to copyright issues. But maybe there are other publicly available datasets that are suitable; I'll investigate. |
I ran some synthetic data generation as a test and at a large sample size the rate of data generation for Mixtral q8_0 with 3xP40 is ~7.9 MiB / 24 hours. It would still take months to get even a single gigabyte of data but it seems my earlier estimate was too pessimistic. |
It should be possible to generate multiple sequences simultaneously with the batch API, that should be a lot faster. |
Only for dense models. For MoE models the amount of data that you need to load increases by a factor of 4 for batched generation. Or if you can fit the entire model onto a single GPU you could use job scheduling software like SLURM to more efficiently parallelize the generation. Or just use better hardware than P40s. |
8c900ab
to
49e794f
Compare
I moved the n-gram cache code to common with a C++-compliant interface. I added a CLI argument |
common/common.h
Outdated
void llama_ngram_cache_draft( | ||
std::vector<llama_token> & inp, std::vector<llama_token> & draft, int n_draft, | ||
std::vector<llama_ngram_cache> & ncs_t1, int ngram_min, llama_ngram_cache & nc_t2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of ncs_t1
and nc_t2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to add a comment; "t1" means tier 1 and "t2" means tier 2. With lookup decoding there is a tradeoff between relevance (e.g. tokens in context) and sample size (e.g. Wikitext). So I think the best approach will be to first try to use tokens with high relevance (tier 1) and to then broaden the search to sources with less relevance but higher likelihood of containing the n-gram in question (tier 2, tier 3). I think the best solution (for regular text) would be to use the context as tier 1, all past user generations as tier 2, and a general text corpus as tier 3.
I generated ~8 MiB of text and used it to create a lookup cache. The results seem to be improving as more data is added but very slowly. With a month of compute the results could become usable but I don't really want to wait that long. I'm currently downloading Project Gutenberg which will hopefully yield better results. |
I tried scaling up the amount of data for the static ngram cache with Project Gutenberg. The ngram size is 2. Data obtained from The acceptance rate improves with any static ngram cache but there seems to be no indication that for 2-grams there is scaling beyond ~700 MiB of input data: Ngram caches obtained from Wikitext train perform better than the caches from Project Gutenberg but this may be the result of overfitting on formatting: Wikitext contains formatting that can be much better predicted with a Wikitext cache and Project Gutenberg contains repeating text that I maybe did not sufficiently filter out. |
When subjectively evaluating lookup performance on actual model outputs the trend seems to be that model outputs > Wikitext > Project Gutenberg. It may be that (I think) Wikitext only contains English compared to Project Gutenberg which also includes other languages. Or I just did the data preparation wrong. |
Ive done some more testing. Data generated from LLaMA 2 q8_0 also seems to work for ngram caches: This type of data production is much more scalable than with a big model like Mixtral, especially because I can now use as many GPUs as I want without even the need for these GPUs to run the same backends or to be on the same machines. It would also be possible to generate the data with a batch size >1 since LLaMA 2 7b is a dense model. |
Today at LIPS I talked with Francois Charton from Meta AI. He suggested that, in order to get more information per generated token I should not just generate text and take the frequencies of the tokens that were actually generated but to instead build the ngram cache based on logits or the token probabilities put out by the model. The idea is sound and should make synthetic data generation more efficient. |
2c70d79
to
a908794
Compare
I added a CLI argument for a dynamic lookup cache that gets updated by user generations; It seems to help a bit. The idea of using token probability distributions instead of the actually generated tokens did not seem to result in an improvement (but made the code slower and more complicated). |
This is the algorithm that I have settled on:
For the last week I've been generating more data for static lookup caches using 1x RTX 3090, 3x P40, and 1x RX 6800. I analyzed the scaling for up to 51.2 million tokens with The acceptance rate for 2-gram and 3-gram caches decreases at first due to overfitting and they still show scaling. 2-gram caches should be able to reach at least the same acceptance rate as 1-gram caches so the scaling should continue for at least ~10x more data. The rate of accepted tokens per generated tokens is also still scaling: Roughly 1 in 6 tokens can be drafted correctly. With the current code and the biggest 2-gram cache I also looked at the performance when repeatedly generating tokens for the same prompt: The command that I used is something like this:
Lookup from the context, a static text corpus, and previous generations all improve the overall generation speed. The top speedup is ~1.4x. I will now focue on getting this PR into a state where it can be merged. |
examples/lookup/CMakeLists.txt
Outdated
set(TARGET lookup-create) | ||
add_executable(${TARGET} lookup.cpp) | ||
install(TARGETS ${TARGET} RUNTIME) | ||
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT}) | ||
target_compile_features(${TARGET} PRIVATE cxx_std_11) | ||
|
||
set(TARGET lookup-merge) | ||
add_executable(${TARGET} lookup.cpp) | ||
install(TARGETS ${TARGET} RUNTIME) | ||
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT}) | ||
target_compile_features(${TARGET} PRIVATE cxx_std_11) | ||
|
||
set(TARGET lookup-stats) | ||
add_executable(${TARGET} lookup.cpp) | ||
install(TARGETS ${TARGET} RUNTIME) | ||
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT}) | ||
target_compile_features(${TARGET} PRIVATE cxx_std_11) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these have the same source lookup.cpp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, at some point I refactored this and forgot to adjust those lines.
common/common.h
Outdated
typedef uint64_t llama_ngram; // Each of the 4 16 bit sections represents a token id. | ||
typedef std::unordered_map<llama_token, int32_t> llama_ngram_cache_part; // token -> number of times token has been seen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is an array of 4 16-bit tokens, why not define it as such instead of using bit manipulation? However 16-bit does not seem enough to store a token id, we already have models with n_vocab
far above 65535, eg. baichuan2 has 125696.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that an array is not a hashable type. Because of this you cannot use it as the key in std::unordered_map
without also defining e.g. a corresponding hashing function. I would have preferred not to do this since it adds complexity but if we already have models with larger vocabularies it cannot be helped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the implementation. Regarding the C-style API - we will figure this out later. For now we can use the proposed C++ interface and try to create wider application of the speculative sampling approach
One minor thing to consider is to move the llama_ngram_cache
functions into separate common/ngram-cache.h/.cpp
files (similar to common/sampling.h/.cpp
)
0e9e8a4
to
8dcd771
Compare
I rebased on top of master and moved the new code to dedicated files. Since several things have been changed and I have generated additional data I will re-run my workflow for the static ngram cache soon. If I encounter no issues I will then merge. |
e7d1e38
to
45cb565
Compare
45cb565
to
a51a401
Compare
acb5375
to
e04cf1a
Compare
…ml-org#5479) * lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens
…ml-org#5479) * lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens
…ml-org#5479) * lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens
…ml-org#5479) * lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens
The goal of this PR is a proper implementation of #5398 . It builds on top of #5462 . The changes relative to #5462 are that lookup decoding now not only uses information from the context but also from a static text corpus.
More specifically, the code first tries to find a suitable draft token in the current context. The ranking of those draft tokens is the product of the frequencies with which the token appears as a continuation in the context and the text corpus. If no suitable token can be drafted from the context the code instead tries to draft a token from the text corpus only.
I use commands such as this for testing:
On my system with 3x P40 I currently get the following results for Miqu q5_K_M:
Compared to master more tokens are drafted and the acceptance rate is also higher. The end-to-end speedup compared to regular decoding is up to ~1.11. The drop when going from
--draft 3
to--draft 4
comes from the generation output changing (if I remember correctly rejection sampling does not guarantee bit-for-bit identical results). In any case, due to the low amount of compute on P40s they scale relatively poorly with batch size; I expect the speedup on e.g. 3x RTX 3090 to be much larger. I currently don't have a good instruct model for 1x RTX 3090; I'll post numbers once I do.To use this PR, you need a
lookup.bin
file.On Linux it can be generated with these commands:
I've also uploaded the file that I generated here. Put the file into the llama.cpp root directory and run the lookup example as normal. The
lookup.bin
file is based on 541 MiB of text and is 352 MiB in size. However, this scaling is not linear. With a smaller 11 MiB input filelookup.bin
is 20 MiB in size.What is still to do in this PR is figure out the best way to integrate this feature with the rest of the llama.cpp code. Currently the static lookup information is simply stored as a binary file (because I'm not familiar with how GGUF works). The creation of the file also takes a long time and is not really scalable. For the user-facing interface I imagine something like this:
I'm not sure about the llama.cpp API. Currently the data structures are C++ hashmaps. My understanding is that the API in
llama.h
is supposed to be C compatible so I don't think they can simply be exposed. The hashmaps can be serialized touint64_t
andint32_t
(which is how they're stored in the binary) but the conversion does take some time. Each n-gram size needs a separate hashmap. The context hashmaps exist for n-grams length 1-4, the static hashmaps only exist for n-gram length 2 (1 too unspecific, not enough input data for 3). The methods needed would essentially be saving/loading the hashmaps to/from files as well as methods to initialize a hashmap and to fill it up with tokens. Feedback would be very welcome.