-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Layer skipping/self-speculation demo #3565
base: master
Are you sure you want to change the base?
Conversation
Here's a bit more information about the results:
In friendlier table format:
If we were planning on skipping half the layers, skipping the ones in the list up to and including 22 looks promising. I haven't done any experiments with skipping multiple layers yet. Also, this is just a sketch of the process not necessarily actual usable results since it's only running 20 chunks of perplexity to come up with that which isn't necessarily representative. |
I am very interested in generating the same table for LLaMA v1 7B and LLaMA v2 7B. |
I asked the authors of the self speculation paper about which layers they found to be best for skipping (that's with a 13B LLaMA2 model with 80 layers). This is what they said: "[...] You can use attention layers [3, 5, 6, 8, 10, 11, 14, 15, 18, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37] and MLP layers [6, 9, 10, 11, 15, 24, 25, 27, 28, 35] as a starting point [...]" Talking about skipping attention vs MLP layers separately is probably a good indication that my current approach isn't fine-grained enough. I really don't know how to skip those parts separately, rather than just passing the entire layer. It doesn't seem like the skip everything approach can get anything close to reasonable results when skipping more than 10-20% of layers.
This isn't exactly what you asked for, but I ran some tests with a 7B Mistral OpenOrca model (32 layers).
First column is the number of skipped layers, second is the last skip layer added, last column is the ppl difference compared to not skipping any layers. This is also just running 15 chunks. Just for example, at I made some changes to the edit:
If you mean the first layer (my stuff prints out the layer index so that would be layer 0 here) then yes, definitely. Skipping it increases perplexity by like 3000. Layers at indexes 0, 1 have a massive impact (especially 0). The last layer doesn't seem very skippable either, it'll basically double perplexity. |
Here's the full output for the first past through the Mistral 7B model: Expand
Interestingly, the first pass has 9, 13, 14, 16, 15, 8, 3, 7, 25, 10, 12, 4, 11, 5, 17, 23, 6, 24, 21, 27, 22, 20, 18, 26, 29, 2, 28, 19, 30, 31, 1, 0 from least impactful to most. However, incrementally skipping the least impactful layer results in order 9, 14, 13, 25, 22, 10, 27, 24, 23, 8, 20, 15, 12. So the layers you've already skipped can affect the impact of other layers. |
It seems pretty interesting. According to Locating and Editing Factual Associations in GPT paper, MLPs seem to store factual information and act as a sort of key-value store. It would make sense to take a perplexity hit. If we look at attention as "recall" of data stored in MLPs, then it seems reasonable why losing just attention wouldn't matter as much. The knowledge would still be there, it would just contain more noise, that could be "filtered" by latter layers, as opposed to what happens when we skip whole layers. |
Any guidance on running this? I only see ETA message output when running For 70B Q2_K, is |
Sure, I can try to help. Just keep in mind right now I'm a lot less confident that the approach of just skipping whole layers is going to work. You probably can skip a few (also to actually use this in inference or whatever you'd have to hack that also. Basically, this isn't really useful for anything except research right now. 80 should be the right number of layers for LLaMA2 70B. If you scroll down a bit past where you changed
This is probably because running perplexity on a 70B model is going to be quite slow and right now the progress is disabled. So you won't see anything until it's completed 15 chunks (you can also try adjusting that part if you want, it's called Q2_K already impacts the model quality a decent amount, so it's possible even if you can skip some layers you might just be better off going down to a 30B. You're certainly welcome to try (and I'd be interested in the output). Hmm, also, this actually won't immediately help with memory issues because all it does is skip evaluating those layers, they still get loaded. They also still get used for some stuff, like if rope shift occurs. |
Makes sense, not that I'm really qualified to have an opinion. I also found this: https://www.lesswrong.com/posts/j84JhErNezMxyK4dH/llm-modularity-the-separability-of-capabilities-in-large |
I'm currently running the 70b test on the gpu (no layers offloaded), the numbers are a bit different to cpu. But I ran the early stage twice and the numbers are deterministic for my system.
|
Even on GPU it's going to take a loooooong time. edit: By the way, you might be able to speed it up some by offloading layers. Even though it's purely prompt processing, offloading layers still seems to help. Also if you have most of the model offloaded, you'll usually want to set the number of threads fairly low. |
I just have some idea, offloading specific layers to gpu, let's say attn layers only, will the speed quicker than just loading them in sequence? |
I'm not completely sure what you mean. Generally speaking, the more stuff you can put on the GPU the better performance will be. Also, the result from running attention is needed for the MLP/FFN layer so if you run only one of MLP or attention on the GPU you'll have to arrange to copy the data back and forth. So I would guess you probably don't want to be in a situation where you only put the attention layers on GPU but not MLP or vice versa. |
@KerfuffleV2 oops. I think I botched recording the full output. Its still running though btw. the output has exceeded the terminal window, and I'm not sure I added But here is my recent output with https://huggingface.co/firelzrd/Xwin-LM-70B-V0.1-GGUF/blob/main/Xwin-LM-70B-V0.1.Q2_K.gguf : https://pastebin.com/zYcGydkP |
Thanks! That includes at least the first few passes through the layers, I'd say the first 1-2 are probably what's most interesting here. You can probably just stop it now. :) I'm looking into how to skip attention and MLP separately but I got kind of distracted trying to refactor the graph building to be more understandable. Might just fall into the rabbit hole of trying to close #3382 |
Sorry I have to confess that's exactly what my naive thinking😂, change the loading sequence in limited vram is what I mainly considered.(like starting at layer 2 instead of the beginning) |
That's pretty much how it works, except the attention/FFN layers are handled together (rather than being able to skip one of those parts). When you set |
Memory question: if you do replicate the draft scheme they showed, are you left with a model only 57.5% the size of original f16 (34 of 80 for the 13B)? I reran the test a bit with https://huggingface.co/TheBloke/Llama-2-70B-GGUF/blob/main/llama-2-70b.Q2_K.gguf so we can see the numbers with the base model:
|
Thinking more about this, using that layers in GPU to do the speculation process? So no need anything else but only one huge model & big speeded up! |
The first thing is this stuff is just talking about skipping running evaluation on those layers, they still get loaded into memory and everything. In the self-speculation stuff, they'd also eventually be used in cases where the draft doesn't get accepted. You could potentially just skip even loading those layers and try to run it like a standalone pared down model. Not really what I'm doing here though. Also, they said: "[...] You can use attention layers [3, 5, 6, 8, 10, 11, 14, 15, 18, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37] and MLP layers [6, 9, 10, 11, 15, 24, 25, 27, 28, 35] as a starting point [...]" In a model, a "layer" is attention + MLP (AKA FFN). So skipping just attention or just MLP is basically only skipping half a layer (assuming the size of the tensors involved is about the same, I don't know that for a fact). If that assumption is correct, skipping a total of 34 MLP or attention layers would be more like skipping 17 complete layers.
You mean arrange so the layers used for drafting get loaded onto the GPU even if they're not contiguous in the full model? That's definitely an idea, but I don't know if the overhead of having to copy stuff around would be worth it. (Also, implementing that sort of stuff is beyond my ability at the moment.) |
Then I guess there's a easier way, load full model in cpu, and , load demo layers in GPU 😆 |
Full Q2_K 70B base model results: results.txt Maybe it will be useful for chimera models or smaller quantizations later on. |
Looks like the people that wrote the self-speculation paper now released code and also some examples of MLP/attention layers to skip: https://github.com/dilab-zju/self-speculative-decoding/blob/main/skip_layers.json (I didn't get a chance to look closely at the code yet.) |
I finally figured out how to skip MLP and attention layers separately. One weird thing is if skipping MLP or attention (but not both) on the very last layer evaluated, it runs out of space without my hack to force a skip on the last layer when alloc is in measure mode. The way If I get a chance I want to see if I can implement self speculation on top of GG's tree speculation stuff, I think it might not be too hard with this backend stuff in place. (If anyone else wants to try this, please don't hold back on my account. I'd love for someone to steal my thunder here.) |
b707b43
to
d3c08ea
Compare
I like the exploration spirit here :) It should be straightforward to demonstrate self-speculation with what you have by adapting the |
d3c08ea
to
4a368ed
Compare
Fun fact: Running this is now twice as slow. Yay. This also confirms that skipping attention usually works better than skipping MLP layers. At least from what I've seen so far, you can skip a bunch of attention layers even in a really small model like a 3B before skipping an MLP layer is better. I wanted to modify the The simplest way to test that is just to load the same model two times and skip stuff in the draft one, not memory efficient but it should demonstrate self-speculation. Once I have some data about good layers to skip for small models like my 3b or a 7b then I'll see about testing that. |
It should work with using 2 separate |
I improved the logic for the perplexity skip layer searching stuff. It'll prune the worst results each pass. As far as I can see, the end result is the same except we get there much, much faster. It'll also abort a test early if the results are absurd. This last push also adds hacked in support for skipping layers on the draft in By the way, don't try to skip layers on the draft at prompt evaluation time unless you like seeing a 1% acceptance rate. Results for just speculating against exactly the same model with no skips:
Only 65% accepted for the exact same model is kind of disappointing. Like even if we could cut around half of the whole model and get exactly the same drafting accuracy it still would barely break even. Can this really be right? Anyway, with skipping:
It actually does outperform running speculation with an identical draft model and no skips, but it still is worse than just not using speculation at all as far as I can see. edit: By the way, if you want to try using the results at https://github.com/dilab-zju/self-speculative-decoding/blob/main/skip_layers.json you can generate the skips in Python: a = tuple([10, 11, 13, 14, 16, 18, 19, 20, 21, 22, 25, 27, 28, 29, 30, 31, 35, 41, 43, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 60, 61, 62, 63, 65, 66, 67, 68, 69, 70, 74, 75, 78, 79])
m = tuple([2, 4, 9, 10, 13, 14, 16, 20, 21, 22, 24, 25, 26, 27, 28, 29, 31, 34, 37, 41, 47, 48, 49, 50, 53, 54, 55, 57, 58, 60, 62, 63, 66, 67, 68, 70, 76])
n_layers = 80
# Populate batch.run_layers with this:
print(tuple((1 if x in a else 0) + (2 if x in m else 0) for x in range(n_layers))) |
Right, it will work but it won't reuse GPU layers or anything. So ideally you could use the same context for both when doing self-speculation. It should be possible just having the draft use a different sequence id, right? (But there's complexity relating to managing the logits after eval so they're available when they need to be.) |
Here's a comparison with 2 7B f16 base models (on master, using the same model as draft)
"Once upon a time," 66.142% If the draft model is Q4_0: Q4_K_M: So, smaller models can have higher scores? |
What sampling parameters do you use? You will get the most benefit with greedy sampling and in this scenario of using the same model for drafting, it will result in 100% acceptance rate. |
I actually was using greedy sampling, seems like the default repetition penalty settings mess it up though. Setting I tweaked the speculation algorithm a bit: const float skip_scale = 1.50f + std::min(2.0f, 0.75f * float(i)); // 61.76
if (cur_p.data[0].p < skip_scale*cur_p.data[1].p) {
LOG("stopping drafting, probability too low: %.3f < 2*%.3f\n", cur_p.data[0].p, cur_p.data[1].p);
break;
} Was able to get the prediction rate up to 60 when skipping half the 70B draft model layers:
Interestingly it only drops to 57.5% with the repetition penalty on. |
317910c
to
74eebc6
Compare
Oops, I actually didn't mean to add the Anyway, as for layer skipping I messed with the perplexity tool to allow "anti mode" - start with all but the first/last layers disabled and add them back gradually based on which ones seem most important. I haven't done much testing with this, but maybe it's a way to assess the most important ones in a more time-efficient way. |
Here's something kind of interesting. When using a 70B and the recommended layer skips, where would you expect perplexity to end up? This is the reference without any skips:
If you said "Maybe double, or at most triple the reference?" you'd be thinking pretty much the same as I did. This is the actual result for running perplexity with those skips:
I find it really surprising that with the model so severely compromised that it can still do anything usable, but it actually does seem to work pretty well as a draft model with those skips. |
dcccbe3
to
13e08d0
Compare
Add layer may produce better results? (In some case we know models are overfitting, add noise in training can increase the performance, so I wildly guess that will work here too) |
I don't know if there's really anything much worthwhile in these changes to When use self-speculation, it seems to help copying some of the latest KV entries from the full model into the sequences the cut down one is using (this is possible since only one context is used in both cases). Note: It detects that it's running in self-speculation mode just by comparing the I added an approach when I normalize and run top K on the logits for both draft and main model, which allows stuff like enabling greedy sampling and also the logic internally can look at the values the models return in a more objective way. I also use a different approach to sampling from the draft where I save the logits, run the normal sampling function and then set the probability of picking any token ids that already got picked to As for results, I've found some strategies that work a little better than
Those stats look pretty good but still:
vs not using speculation:
I feel like there's enough info in the log file for someone smarter than me to figure out a better strategy.
When sampling from the target:
"Shoulda picked" - we also check and log for candidates the draft suggested but that we didn't actually pick. When the target picks one, you'll see that log message ( I had to rewrite the KV cache shuffling stuff to make it work with a shared context. I think there might be something wrong there, even though the model produces coherent results. These are the sampling settings I used:
Model used was (The fact that I saw it pick token id 1 once is what makes me think my KV cache shuffling could have an issue.) |
I'm not completely sure what you mean. Do you mean the "add layers back" mode in Also, it might seem unintuitive, but adding layers can make perplexity worse (much worse!) sometimes. Also removing layers can make it better (even for the full model) though usually it's not by much. |
Since this is an interesting demonstration, I'll reopen this for visibility. |
This is a demo of skipping (or potentially repeating/reordering) layer evaluation when running a model.
This might sound like a weird thing to want to do, but it's mainly to enable research. For example, k-quants try to devote more bits to some layers compared to others since quantizing some layers affect the accuracy of the output more than others.
Another potential use is self-speculative decoding (#3435) - basically speculative decoding where the helper model isn't a separate smaller model but the same one run with some layers skipped. But the first thing you need to figure out to be able to do that is which layers you can skip and still get a result accurate enough for it to be used for speculation.
From the llama.cpp side, this is just a sketch of what the API could look like. It's also only implemented for LLaMa models right now. The list of layer indexes to run is supplied in the batch. If it's NULL then all layers run like normal. When set, there aren't really any requirements for what you can put in there so you could use it to run the first layer 10 times if you wanted, or whatever.
Also included is a hacked version of
perplexity
that just runs the first 10 chunks, then skips a layer. I.E. The first 10 chunks skip no layers, then it repeats with layer 0 skipped, then with layer 1 skipped, etc. Apps can't query how many layers exist in a model yet as far as I know so this is hardcoded to 26 (the number of layers in the model I was testing with). If you want to try it with a different model, just setn_layers
to the correct value.Example output with a 3B OpenOrca model:
expand