Speculative decoding potential for running big LLMs on consumer grade GPUs efficiently #10466

steampunque · 2024-11-23T17:57:53Z

steampunque
Nov 23, 2024

I recently added an efficient greedy-only spec decode to my downstream server patch (a completely different implementation than the current spec decode PR). I then evaluated tg performance for two cases : 1) Solve the first humaneval problem with coding model and 2) solve the goldcoin problem with general model. I used Qwen 14B for the target and 0.5B, 1.5B, and 3B for the drafts. I evaluated tg vs. draft token length on a 4070 fully offloaded with the target and draft weights where target is IQ4_XS quant and draft is Q6_K quant.

HUMANEVAL first problem:

TARGET Qwen2.5-Coder-14B-Instruct
DRAFTS Qwen2.5-Coder-0.5B-Instruct, Qwen2.5-Coder-1.5B-Instruct, Qwen2.5-Coder-3B-Instruct

TPS vs draft tokens:

draft tokens	0.5B	1.5B	3.0B
0	46.77	46.95	46.59
1	70.08	63.15	56.42
2	78.03	64.85	54.13
3	87.90	68.79	57.96
4	102.74	76.35	62.57
5	102.72	70.51	56.40
6	104.03	70.53	52.77
7	101.85	65.51	51.03
8	111.00	67.08	49.78
9	113.12	66.72	51.08
10	115.41	64.04	47.44
11	113.78	62.30	45.55
12	116.84	61.83	43.77
13	112.34	60.47	42.43
14	111.70	59.16	40.06
15	112.07	54.89	39.10
16	106.33	54.94	37.82
32	93.19	37.49	24.67

GOLDCOIN

I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river? Use step-by-step reasoning to solve this problem.

TARGET Qwen2.5-14B-Instruct
DRAFTS Qwen2.5-0.5B-Instruct, Qwen2.5-1.5B-Instruct, Qwen2.5-3B-Instruct

TPS vs draft tokens:

draft tokens	0.5B	1.5B	3.0B
0	46.55	46.51	46.44
1	59.88	55.88	50.23
2	61.99	54.40	46.58
3	63.41	53.76	45.47
4	64.98	54.78	43.87
5	58.75	48.39	38.10
6	55.50	44.24	35.20
7	51.99	42.61	32.03
8	51.42	41.10	31.08
9	49.93	39.42	29.37
10	49.90	37.63	27.47
11	46.82	35.99	26.29
12	45.86	34.62	25.31
13	42.83	33.13	23.57
14	41.71	31.78	22.65
15	39.79	30.85	21.50
16	38.45	28.98	20.42
32	23.99	17.12	11.71

TARGET Llama 3.1 8B Instruct
DRAFT Llama 3.2 1B Instruct

TPS vs draft tokens:

draft tokens	1B
0	62.21
1	89.76
2	101.40
3	108.34
4	112.90
5	113.97
6	111.72
7	105.54
8	109.59

TARGET Gemma 2 9B it IQ4_XS
DRAFT Gemma 2 2B it IQ4_XS

TPS vs draft tokens:

draft tokens	1B
0	62.96
1	62.83
2	54.86
3	50.25
4	50.61
5	44.68
6	40.52
7	36.26
8	31.46

Results Summary:

Coding shows a max speedup of 2.5x tg at 10 draft tokens speculated using 0.5B model. At 1.5B draft the max speedup is 1.63x at 4 draft tokens. At 3B draft the max speedup is 1.33 at 4 draft tokens. The efficiency crossover (where draft+target is the same as no draft) is >32 draft tokens for 0.5B, >16 draft tokens for 1.5B, and 11 draft tokens for 3B.

Goldcoin shows a max speedup of 1.4x tg at 4 draft tokens speculated using 0.5B model. at 1.5 draft the max speedup is 1.17x at 4 draft tokens. At 3B draft the max speedup is 1.08 at 1 draft token. The efficiency crossover (where draft+target is the same as no draft) is 12 tokens for 0.5B, 6 tokens for 1.5B, and 3 tokens for 3B.

With Llama 3.18B instruct drafted by Llama 3.2 1B instruct a speedup in token gen of 1.83x is found at draft tokens of 5.

With Gemma2 9B it drafted by Gemma2 2B it there is never any speculative decoding speedup. Guess 2B not distilled from 9B at all but was trained on a completey different data set.

Conclusions and potential for running big LLMs on consumer grade GPUs:

Small draft model is needed (sine qua non). 0.5B size seems to work well. Any model in the range of 8G or above can benefit by distilling a 0.5B draft and speculating the model. Returns fall off rapidly as draft gets bigger, already questionable at 1.5B and not really useful at 3B draft. Coding is far more efficient than general text gen with speculation. Qwen 2.5 series is perfect for exploiting the potential of speculation.

For running big LLMs on consumer grade GPUs with limited memory it is desired to avoid the need to store all model weights and output layer in VRAM because there is not enough room. Most of the model weights are sitting there doing nothing most of the time, i.e. a 32 layer model has 31 dead weights sitting there occupying VRAM doing nothing 31/32 of the time. To get around this problem it is necessary to dynamically swap the layers into VRAM as they are needed from CPU RAM which is normally much higher capacity. If the draft size at the efficiency crossover is big enough, there may be (emphasis on may, it needs to be investigated for feasibility) enough time to compute the target batch (say 8 to 10 samples) and simultaneously transfer the next layer into the GPU. The GPU capacity needs 1 working layer allocation and one transfer allocation (two total model layers which are ping ponged between compute and transfer) + a fully offloaded speculator. KV for speculator and target should also both be in GPU mem. Even if it is needed to go above the efficiency crossover, it can still be more efficient to do dynamic layer loading to GPU because offloading to CPU is an immediate 10X or higher slowdown due to memory BW limits.

jukofyork · 2024-11-30T07:49:53Z

jukofyork
Nov 30, 2024
Collaborator

Would there be any benefit in pruning down a 0.5B model to be even smaller? From your examples above it looks like the speculative models' size reduction has the biggest effect?

You could prune the later layers like this: https://arxiv.org/abs/2403.17887

but with a calibration dataset you could probably prune down the width of the MLP hidden state quite significantly too... The imatrix code already does a kind of "soft" version of this for quantisation.

I think you could even apply L1-regularisation during fine-tuning to spasify the weights and then remove all those close to zero, but the effectiveness of this would depend on whether the induced sparseness was evenly distributed for the corresponding tensors in each layer (which from the paper above; I doubt is the case).

It would be interesting to see where the balance point is between "tiny and fast/dumb" vs "small but slower/less-dumb" actually is.

If using greedy speculation then it won't make any difference, but if you have to actually apply the softmax (instead of just finding the maximum logit), then for stuff like coding using only English; it would be perfectly valid to remove a lot (most) of the tokens and prune down the input_embedding and lm_head tensors due to softmax (aka multinomial logistic regression) having the IIA property. I'm not sure what happens though if the model encounters a token you have pruned away like this?

5 replies

steampunque Nov 30, 2024
Author

Would there be any benefit in pruning down a 0.5B model to be even smaller? From your examples above it looks like the speculative models' size reduction has the biggest effect?

@jukofyork I tried the AMD-Llama-135m on Llama2 and AMD-Llama-135m-code on CodeLlama and didn't see as much benefit. Just based on pure hunch I think the "magic" of speculation is exposed with distilled models using the full cross entropy loss training metric on a large data set so the model doesn't lose any particular part of its knowledge. It gives a "fuzzy" version of the bigger model tending to follow the same patterns as the big model. Thats why I think Qwen2.5 series and Llama3.1 paired with Llama3.2 work so well.... I think (hypothesize) all are using distilling to make the smaller versions of the models. In theory any created model could be distilled but its compute intensive (similar to training a model from scratch as far as I understand) so need to rely or ask the model creators to make distilled versions of the models they release. I think 0.5B is a reasonable size but could not explore anything smaller since thats where Qwen 2.5 stopped.

I thought more about dynamic layer offload speedup potential too and the numbers I am pushing around seem like the idea can be viable on a 4070 with gen4 PCI (I think a single 4070 might be able to push 5tps gen on Llama 70B which is close to its compute bound potential), but most likely not on a 4090 as the ratio of compute to PCI BW is about 2X on the 4090, it would need gen5 PCI to be viable. Still 2X reduction from peak is a lot better than 10X when trying to compute on CPU with its x10 less or more memory BW. I was thinking about feeding some ggml and the vulkan backend into Qwen 2.5 32B and see if it can suggest how to wedge in dynamic offload concept efficiently.

jukofyork Nov 30, 2024
Collaborator

Just based on pure hunch I think the "magic" of speculation is exposed with distilled models using the full cross entropy loss training metric on a large data set so the model doesn't lose any particular part of its knowledge. It gives a "fuzzy" version of the bigger model tending to follow the same patterns as the big model. Thats why I think Qwen2.5 series and Llama3.1 paired with Llama3.2 work so well.... I think (hypothesize) all are using distilling to make the smaller versions of the models. In theory any created model could be distilled but its compute intensive (similar to training a model from scratch as far as I understand) so need to rely or ask the model creators to make distilled versions of the models they release. I think 0.5B is a reasonable size but could not explore anything smaller since thats where Qwen 2.5 stopped.

Interesting!

I think you're right that "distilled" training is likely to be pretty hard to do as you'll need the full outputs to use cross-entropy loss on (I assume they do it this way?).

If you are only interested in greedy sampling then one interesting alternative would be to use (smoothed) hinge loss on the "all vs one" targets:

Cross-entropy loss gets heavily penalised for the correct class having a low probability output (due to the -log(y) ), but if we only care about greedy sampling then we don't really neen these "well-calibrated" probability estimates at all...
Hinge loss on the other hand only cares about getting the output correct or not, and has no notion of calibration.

Hinge loss has other potential benefits too:

It doesn't have to pass the logits through the softmax function, and essentially works on the raw logits themselves (ie: no calls to exp() or summation needed).
The "distillation" data generation would simply be a case of running the parent model with temperature=0 to generate training data as usual, so no fiddling about with full probability distributions as targets, etc.

The only downside is it can sometimes be a bit tricky to get working with gradient descent.

steampunque Nov 30, 2024
Author

Interesting!

I think you're right that "distilled" training is likely to be pretty hard to do as you'll need the full outputs to use cross-entropy loss on (I assume they do it this way?).

I dont have backround in ML but my basic understanding is teacher student approaches to build models do this. Particularly Llama 3.1 I believe they trained 405B then distilled the smaller models with subsets of data corpus off 405B (pretty sure about that). I am speculating Qwen did the same with their 2.5 series and they wound up with quite performant small models. So its essentially a full train from nothing using cross entropy loss to minize the error between target and teacher on a desired training corpus. Way beyond the compute potential of users must be done by model creators on GPU farms.

If you are only interested in greedy sampling then one interesting alternative would be to use (smoothed) hinge loss on the "all vs one" targets:

I dont know enough about ML to understand the implications here, though based on my comm theory background it seems analagous to a hard decision decode vs a soft decision decode which can often results in a fairly significant entropy loss in communication systems.

* Cross-entropy loss gets heavily penalised for the correct class having a low probability output (due to the `-log(y) `), but if we only care about greedy sampling then we don't really neen these "well-calibrated" probability estimates at all...

Still reserved about possible entropy loss in the training of the draft....

* Hinge loss on the other hand only cares about getting the output correct or not, and has no notion of calibration.

Hinge loss has other potential benefits too:

* It doesn't have to pass the logits through the softmax function, and essentially works on the raw logits themselves (ie: no calls to `exp()` or summation needed).

* The "distillation" data generation would simply be a case of running the parent model with `temperature=0` to generate training data as usual, so no fiddling about with full probability distributions as targets, etc.

The only downside is it can sometimes be a bit tricky to get working with gradient descent.

Thanks for interesting comments! Hopefully more model creators will be creating distilled edge versions of their bigger models which can be easily leveraged to run the bigger models far more effiiciently. It seems like such a no brainer but I dont think it has been done intentionally to date (llama 3.2 series were created for edge apps, not drafting. I also believe smaller Qwen 2.5 were also targeting edge devices and drafting bigger models on heavy lift platforms is just a nice side benefit).

jukofyork Nov 30, 2024
Collaborator

If you are only interested in greedy sampling then one interesting alternative would be to use (smoothed) hinge loss on the "all vs one" targets:

I dont know enough about ML to understand the implications here, though based on my comm theory background it seems analagous to a hard decision decode vs a soft decision decode which can often results in a fairly significant entropy loss in communication systems.

Based on a quick search:

https://www.gaussianwaves.com/2009/12/hard-and-soft-decision-decoding-2/

This would actually be analogous to "0-1 loss" which isn't used in machine learning due not being differentiable. It's the black "square" loss on this graph:

The red line is (binary) cross-entropy loss (aka log loss).
The green line is least-squares loss.
The blue line is hinge-loss.

They are all really a surrogate for 0-1 loss, but with different properties.

The main appeal of cross-entropy loss is that when it's paired with the binary or multinomial logistic function (aka "softmax"), produces well-calibrated probability estimates.

Nobody would consider using least-squares loss for classification now, but back in the 80s and 90s it was very common.

The use of hinge-loss was only popularised in the early-mid 2000s for use with Support Vector Machines (and usually solved via the dual quadratic programming problem and using a non-linear kernel).

But there isn't any reason you can't solve the primal problem without a kernel (but as I said above you would likely want to use one of the smoothed/Huberised variants due to it being trickier to solve via gradient descent - it tends to get "stuck" easily due to the zero-derivatives in a similar way to stock ReLU can get "dead nodes" [if not initialsed properly] and why ML tends to use smoothed versions of ReLU now instead...).

The key difference is if you don't care about the probability estimates then you can use hinge-loss directly on the dot-products of the lm_head matrix with the hidden state, with the goal being to drive the dot-product of the actual target vector to be larger than the dot product of all the other vectors. It's also called "maximum margin loss" for this reason, eg:

The "margin" coming from the little triangle formed by the blue line between 0 and 1 on the previous diagram.

Cross-entropy loss gets heavily penalised for the correct class having a low probability output (due to the -log(y) ), but if we only care about greedy sampling then we don't really neen these "well-calibrated" probability estimates at all...

Still reserved about possible entropy loss in the training of the draft....

Hinge loss on the other hand only cares about getting the output correct or not, and has no notion of calibration.

Again, looking back at the first diagram, you can see that cross-entropy loss naturally wants to drive the output corresponding to the true class further and further away (or looking at the second diagram: once it's crossed the margin - it wants to keep pushing it away).

What you can't see in the first diagram (due to the reduced x-axis) is the use of -log(y) causes cross-entropy loss to massively penalise a single low-probability output for the true class.

Neither of these properties are beneficial if all you care about is predicting the most likely next token. In actual fact, if it were possible the loss we would really like to use here is... The 0-1 loss! This would be the loss that most agrees with our goal of predicting the correct output most often, but since this isn't possible; the use of hinge-loss is likely to be the next best choice!

Hopefully this makes sense, and I'm fairly sure I could easily adapt the Unsloth CE-loss kernel to calculate smoothed/Huberised hinge-loss:

https://github.com/unslothai/unsloth/blob/main/unsloth/kernels/cross_entropy_loss.py

Then it would just be a case of running the larger model over a corpus of stuff you want to predict, but saving the greedy prediction of the larger model at every step as the new ("one-hot") training data for the tiny-models' fine-tuning dataset (I think - not thought about this part too hard).

EDIT: This would potentially open up the possibility of training these tiny models on just a subset of the possible tasks you want to perform, eg: just on coding data or coding related data, etc. Likely meaning an even smaller/faster/more accurate tiny model that specialises in just predicting this subset of the larger models' abilities.

steampunque Dec 1, 2024
Author

EDIT: This would potentially open up the possibility of training these tiny models on just a subset of the possible tasks you want to perform, eg: just on coding data or coding related data, etc. Likely meaning an even smaller/faster/more accurate tiny model that specialises in just predicting this subset of the larger models' abilities.

I think it would be interesting to explore both hinge loss with greedy and cross entropy loss with full distribution match error criteria. Small draft models may be within fine-tune training viability even on consumer grade hardware as they can be loaded F16 or F32 so back prop gradient training can work. The spec decode server already gives the needed platform for teacher-student setup. Needed steps might be

Create a starting draft model from either an existing checkpoint or a pruned checkpoint (most likely /2 prune max feasible)
load starting draft in F16 or F32 so it can be trained
load quantized teacher ("target")
Run an alignment dataset, such as HumanEvalX CPP for improving C++ prediction, or LAMBADA for improving language prediction
For hinge loss: Select N draft tokens >1 so probability of getting a miss is good. At the misses, compute the error feedback and backprop.
For cross entropy loss: Select 1 draft token. Compute cross entropy loss between target and draft logits and backprop for every generated token.

It would be an interesting experiment to start with Gemma 2 2B it checkpoint which can't speculate Gemma 2 9B right now and see if it can be improved for use as a speculator.

jukofyork · 2024-12-07T10:11:56Z

jukofyork
Dec 7, 2024
Collaborator

Just thinking about this some more and wondered how feasible it would be:

To have the draft model perform Beam Search?
To have the large model test multiple sequences in parallel instead of just a single sequence?

I'm thinking along the lines of using the draft model to create a tree (with probabilies on the edges and tokens in the nodes), and then use it to decide on a set of batches for the larger model to generate in parallel.

If we constrain the branching factor to a fixed k, then we can again use Hinge Loss to try to pick the top-k using k-vs-all.

I don't have a good idea of how the cost of batch processing grows though and it all depends on this.

8 replies

steampunque Dec 7, 2024
Author

The idea was to use Beam Search using the small model to generate a truncated tree of possibilities:

* The cost of this will be O(k^n) if done naively (k = branching factor / beam width, n = tree depth / sequence length).

* If you have access to lots of parallel compute this can be brought down to ~O(kn) as is the case for parallelised decision tree induction, etc.

The k sequence wavefronts in beam search need to be time aligned. At any time t each iteration of a width k beam search will have k beams all of which must be validated by the target. You could in theory independently advance the k beams forward a block of N sample time steps each, N>1, using the draft, but none of this advance has been validated by the target. Once you start validating each advance with the target (in parallel blocks of N), the first miss in each beam will define how far forward you can advance at that iteration. If one of the beams misses at the first token, which becomes more probable as k increases, the advance for all candidate sequences is limited to one token for all k beams. If this happens on the last beam evaluated, you are forced to throw out all the computation done on the previous beams to result in advancing one step for all beams at that iteration, including all the computation for the length N drafts of k beams and all the parallel validation of that advance by the target on all previous beams.

The algorithm I am talking about (and which I use) is O(k*n). At each time step it computes only the k most probable sequences in the search. I am not familiar with any O(k^n) beam search algorithm.

jukofyork Dec 7, 2024
Collaborator

The idea was to use Beam Search using the small model to generate a truncated tree of possibilities:
* The cost of this will be O(k^n) if done naively (k = branching factor / beam width, n = tree depth / sequence length).

* If you have access to lots of parallel compute this can be brought down to ~O(kn) as is the case for parallelised decision tree induction, etc.
The k sequence wavefronts in beam search need to be time aligned. At any time t each iteration of a width k beam search will have k beams all of which must be validated by the target. You could in theory independently advance the k beams forward a block of N sample time steps each, N>1, using the draft, but none of this advance has been validated by the target. Once you start validating each advance with the target (in parallel blocks of N), the first miss in each beam will define how far forward you can advance at that iteration. If one of the beams misses at the first token, which becomes more probable as k increases, the advance for all candidate sequences is limited to one token for all k beams. If this happens on the last beam evaluated, you are forced to throw out all the computation done on the previous beams to result in advancing one step for all beams at that iteration, including all the computation for the length N drafts of k beams and all the parallel validation of that advance by the target on all previous beams.

The algorithm I am talking about (and which I use) is O(k*n). At each time step it computes only the k most probable sequences in the search. I am not familiar with any O(k^n) beam search algorithm.

Yeah, you're correct: I was thinking about k-child breath-first search for decision tree induction which would be O(k^n), but if you have enough threads can be performed for all nodes of the same depth in parallel (so actually O(n) regardless of k).

I've probably done a horrible job explaining it, but I was thinking about a much simpler method and not really necessarily needing to use Beam Search:

Do all the expensive compute on the draft model to get a bunch of candidate sequences to test (such is the speculative : add tree-based sampling example #3624 PR).
Run all the sequences through the large model in parallel, storing the top logit for each token in each sequence.
Find the longest prefix where the top logit agrees and advance to here (and if all sequences don't match just advance 1 token from the large model's output).
Return to step 1.

You could use Beam Search, heuristic truncation as #3624, Best-first Search, or whatever to generate the tree in (1).

The only reason for wanting to do k-child breath-first search is that it would be just as easy to train a tiny model using hinge loss for predicting top-k as top-1, and in theory each depth / sequence length could be done in parallel (so O(n) if you can do all sequences at a certain depth in a batch on the smaller model), but my post kinda mixed up these two and I can see where the confusion comes from :)

jukofyork Dec 7, 2024
Collaborator

This does a much better job of explaining what I'm thinking:

https://catboost.ai/news/catboost-enables-fast-gradient-boosting-on-decision-trees-using-gpus

I don't really know the scaling laws about increasing the batch size for the tiny model like this, but if it were free (which it obviously isn't) then you could compute a tree of all k^n sequences for the same cost as a single auto-regressive generation of n tokens...

Using hinge loss might produce better top-k predictions at each node, but possibly harder to rank the sequences against each other.
Using cross-entropy loss might produce less optimal top-k predictions (for all the reasons I outlined above regarding calibration), but will make it easier to rank the sequences (ie: via the product of probabilities).

Once you have chosen your set of candidate sequences, it's really just a case of seeing if batch processing more than 1 sequence for the larger model in parallel actually gains anything (and using the code in #3624 might make some of the overlapping sequences much cheaper to compute?).

steampunque Dec 7, 2024
Author

This does a much better job of explaining what I'm thinking:

https://catboost.ai/news/catboost-enables-fast-gradient-boosting-on-decision-trees-using-gpus

Yeah thats a different beast than a straightforward beam search for inference. I don't understand all the details of what they are talking about but they do seem to need exponential gradient tree expansion (no leaf pruning) which is a beast. I just feel the target is going to blow out every one of those drafted trees as invalid though for even small values of depth "n" in speculation, particularly on a non pruned exponential tree. Hunch is spec is a complete bust for this type of algo.

You got me thinking though. I think it is possible to leverage spec in either 1) adaptive beam search (where decode stays single beam until it hits a low prob token, then transitions into beam search until the most likely path gets a token with prob above adapt threshold at which point it continues single beam again) or 2) front load beam search, where only N tokens at the start of inference are done with k beams, followed by selection of the best beam and single beam from thenon. In both of those case spec can be used effectively any time the algorithm is in single beam mode. Adaptive beam search might be able to assist in training by keeping the target on a higher probability path so the training updates are done with improved data.

jukofyork Dec 7, 2024
Collaborator

Yeah, there seem to be endless possibilities, but it all comes down to the relative cost of batching.

steampunque · 2024-12-08T18:22:00Z

steampunque
Dec 8, 2024
Author

Testing my server rebase for regressions after all the recent changes along with a few new "LRM" (Marco-o1 and QwQ) models and RPC mode also. The spec algo I implemented is greedy match with fixed size draft block and no probs computes.

Hardware: RTX4070
32B models use RPC to another 4070 rig with draft 100% on local GPU and target fully offload to both 4070s.

GOLDCOIN:

DRAFT	TARGET	draft tokens	TG	X	Note
Qwen2.5-0.5B-32k-Instruct IQ4_XS	Qwen2.5-7B-Instruct Q6_K	0	65.7	1
"	"	4	92.32	1.41
"	Marco-o1 Q6_K	0	65.6	1	wrong answer
"	"	4	84.4	1.29	wrong answer
"	Qwen2.5-14B-Instruct IQ4_XS	0	46.6	1
"	"	4	78.0	1.67
"	Qwen2.5-32B-Instruct IQ4_XS	0	17.8	1	RPC
"	"	4	32.9	1.84	RPC
"	QwQ-32B-Preview IQ4_XS	0	17.8	1	RPC
"	"	4	27.4	1.54	RPC
Llama-3.2-1B-Instruct IQ4_XS	Llama-3.1-8B-Instruct Q6_K	0	62.4	1
"	"	4	123.32	1.97
gemma-2-2b-it IQ4_XS	gemma-2-9b-it IQ4_XS	0	63.05	1
"	"	1	59.4	0.94

HUMANEVAL 1ST PROBLEM

DRAFT	TARGET	draft tokens	TG	X	Note
Qwen2.5-Coder-0.5B-32k-Instruct Q6_K	Qwen2.5-Coder-7B-Instruct Q6_K	0	65.0	1
"	"	8	115.0	1.75
"	Qwen2.5-Coder-14B-Instruct IQ4_XS	0	46.9	1
"	"	8	111.05	2.36
"	Qwen2.5-Coder-32B-Instruct IQ4_XS	0	13.1	1	RPC
"	"	8	37.7	2.9	RPC

3 replies

ggerganov Dec 8, 2024
Maintainer

How does your implementation compare to the spec approach on master? You can do --draft-max 4 --draft-min 0 --draft-p-min 0.0 to get a fixed draft size of 4.

steampunque Dec 8, 2024
Author

Unfortunately I can't run the master spec algorithm due to a large number of other changes I require in the server to support my shell based model loading and inference platform. I expect I will be slightly faster as I don't need any probs computed. Also I made sure to leverage sampling the next token from the target for free if I get hits on all the drafted tokens as the deepmind guys suggested, so I think I am at max theoretical performance for a simple fixed block length spec algorithm.

Draft quality along with generation content is going to dominate the spec performance in practice. Just a simple example of how huge this is showing 3X speedup on predicable content and no speedup on unpredictable content. I don't think unpredictable content will be typical in practice though unless forced to be pathologically unpredictable with the prompt.

Llama 3.1 8B drafted by Llama 3.2 1B:

speceasy.txt: generate the integers from 0 to 100 separated by spaces

TIME=1 NDRAFT=8 lm speceasy.txt
Here are the integers from 0 to 100 separated by spaces:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
PP=315.31273290145043 TG=176.53274968555104

TIME=1 NDRAFT=0 lm speceasy.txt
Here are the integers from 0 to 100 separated by spaces:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
PP=324.03452440569123 TG=61.82976605162269

spechard.txt: generate fifty random integers from 0 to 100 separated by spaces

TIME=1 NDRAFT=8 lm spechard.txt
Here are 50 random integers from 0 to 100 separated by spaces:

14 73 28 42 91 19 67 85 31 46 13 59 75 22 88 49 62 11 98 35 76 29 43 81 54 17 93 24 69 58 41 95 27 52 79 18 65 39 82 47 21 94 33 60 71 25 89 38 56 97 48 23 64 44 90 16 72 50 86 32 55 83 26 68 37 92 20 57 74 40 63 51 80 34 61 78 30 96 15 99 6 8 4 2 0 1 5 7 3 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100
PP=277.91875105729963 TG=63.09788256213333

TIME=1 NDRAFT=0 lm spechard.txt
Here are 50 random integers from 0 to 100 separated by spaces:

14 73 28 42 91 19 67 85 31 46 13 59 75 22 88 49 62 11 98 35 76 29 43 81 55 24 93 18 69 52 97 38 65 16 82 41 58 94 25 72 48 21 89 33 60 79 15 44 90 27 63 86 51 39 95 20 74 57 83 32 66 47 17 92 40 61 80 26 54 68 36 84 10 45 96 23 71 50 78 34 87 56 64 12 70 53 77 30 92 8 99 6 58 98 4 92 5 91 3 97 2 94 1 96 7 93 9
PP=505.0283255017347 TG=61.961273665165585

ggerganov Dec 8, 2024
Maintainer

Unfortunately I can't run the master spec algorithm due to a large number of other changes

Not sure about the Goldcoin and Humaneval tests, but you can run the integer prompts on master like this:

# start the server without SD
./bin/llama-server \
    -m ../models/qwen2.5-7b-coder-instruct/ggml-model-q8_0.gguf \
    --port 8011 --ctx-size 8192 -ngl 99 -fa

# start the server with SD
./bin/llama-server \
    -m ../models/qwen2.5-7b-coder-instruct/ggml-model-q8_0.gguf \
    -md ../models/qwen2.5-0.5b-coder-instruct/ggml-model-q4_0.gguf \
    --port 8011 --ctx-size 8192 -ngl 99 -ngld 99 -fa \
    --draft-max 4 --draft-min 0 --draft-p-min 0.0

# send request
curl --request POST --url http://localhost:8011/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d '{ "messages": [{ "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "generate the integers from 0 to 100 separated by spaces" }], "top_k": 1, "samplers": ["top-k"]}' | jq

Draft quality along with generation content is going to dominate the spec performance in practice.

Yes, I agree.

Thomas-MMJ · 2025-02-06T01:03:18Z

Thomas-MMJ
Feb 6, 2025

@steampunque any update on this?

1 reply

steampunque Feb 6, 2025
Author

@steampunque any update on this?

I havent had a chance to dig into it yet but I still think its viable particularly for the new 5000 series from NV with PCI gen 5 combined with a supporting motherboard (though I think gen4 x16 is also viable, i.e. I think 4070 or any other x16 gen 4 card can potentially work). Originally I planned to prototype idea in vulkan due to simpler backend and I have also done some graphics related vulkan API programming but I found spec decode does not speed up vulkan at all. In order for the idea to work, decode a batch of 4 needs to be just slightly more time than decode a batch of 1 to net out a max possible speed up factor of ~3. I still do plan to dig into it but decided to hold off due in part to large amount of churn in ggml backend over last many months which I dont feel like rebasing patches to every other day.

Since the layer decoding is a serial process by nature all that is really needed is enough VRAM to hold one active compute layer and one transfer layer which are ping ponged. Hidden state and KV for each layer stay also in VRAM. Output layer may want to stay in VRAM also. Should completely obsolete need for RPC on models which can be speculated as long as host has enough main memory. Llama and Qwen both speculate extremely well but Gemma is an example which I found cannot speculate well so it is not universally applicable.

Deepseek R1 spec with latest version looks good so I'm still encouraged. I'm pretty sure the speculated Deepseek R1 32B is inferencing correctly since I recently benched it at 93.0 AVG on Hendryks math levels 1 to 5 using my spec algo also with the R1 1.5B draft as compared to the 94.3 published by Deepseek which is most likely not quantized at all.

Spec algo : greedy match with fixed size draft block and no probs computes.

Hardware: RTX4070
32B models use RPC to another 4070 rig with draft 100% on local GPU and target fully offload to both 4070s.

GOLDCOIN:

DRAFT	TARGET	draft tokens	TG	X	Note
Deepseek-R1-Distill-Qwen-1.5B IQ4_XS	Deepseek-R1-Distill-Qwen-32B IQ4_XS	0	17.9	1	RPC
"	"	4	39.6	2.21
"	Deepseek-R1-Distill-Qwen-14B IQ4_XS	0	46.3	1
"	"	4	61.4	1.33
"	Deepseek-R1-Distill-Qwen-7B Q6_K	0	65.2	1
"	"	4	76.6	1.17

jukofyork · 2025-04-08T19:31:36Z

jukofyork
Apr 8, 2025
Collaborator

What's the common wisdom on quantising the speculative model?

I can see one argument for not quantising it as the errors will accumulate over the sequence, but there is also the argument that a quantised model will generate tokens faster (due to being memory bound) and add less latency?

6 replies

steampunque Apr 8, 2025
Author

I run IQ4_XS because its both smaller and better than Q4_0 and doesn't seem to have much speed hit in the aggregate on cuda backend (there have been comments in past that its much slower on CPU dont know if that was ever resolved. If still true then Q4_0 would be my recommendation). The draft is typically 10x or more smaller than target so the time it spends on its token by token eval should be negligible compared to the time the target needs to eval the drafted batch of tokens.

There should never be an issue with cumulative noise for draft. The target keeps the draft history 100% accurate. If draft token is good, KV is still 100% accurate for next draft token. If draft token is bad, doesn't matter anyway since its getting tossed.

jukofyork Apr 8, 2025
Collaborator

I run IQ4_XS because its both smaller and better than Q4_0 and doesn't seem to have much speed hit in the aggregate on cuda backend (there have been comments in past that its much slower on CPU dont know if that was ever resolved. If still true then Q4_0 would be my recommendation).

Yeah, this was my thinking and what I thought was probably the recommended quant to use for CUDA:

https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF

but now I'm not so sure... :/

There should never be an issue with cumulative noise for draft. The target keeps the draft history 100% accurate. If draft token is good, KV is still 100% accurate for next draft token. If draft token is bad, doesn't matter anyway since its getting tossed.

It's the forward noise that accumulates, eg: misestimating the token at t can effect t+1 which is turn effects t+2 and so on. This probably isn't a huge problem if you are just sampling a fixed sequence of n tokens like your method here, but the probability thresholded version in llama.cpp may be more problematic due to the cumulative effect of these errors.

jukofyork Apr 8, 2025
Collaborator

Good question. After some experimenting, on a Mac, I settled on 32B Q8_0 target + 1.5B Q4_0. Cannot say confidently that this is the best setting, but I think it is pretty good. Would be interesting to do some quantitative evaluations though.

The Mac doesn't handle IQ4_XS all that well IIRC? What about the Q4_1, Q5_0 and Q5_1 quants?

I was surprised how much these improved over Q4_0:

https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF

especially when you used an imatrix for them!?

jukofyork Apr 8, 2025
Collaborator

Here's the full table:

Without `imatrix`

Link	Type	PPL	PPL vs BF16
DeepSeek-R1-DRAFT-0.5B-BF16.gguf	BF16	11.0267 ± 0.08658	---
DeepSeek-R1-DRAFT-0.5B-F16.gguf	F16	11.0294 ± 0.08660	+0.02%
DeepSeek-R1-DRAFT-0.5B-Q8_0.gguf	Q8_0	11.0450 ± 0.08675	+0.17%
DeepSeek-R1-DRAFT-0.5B-Q6_K.gguf	Q6_K	11.1231 ± 0.08732	+0.87%
DeepSeek-R1-DRAFT-0.5B-Q5_K_M.gguf	Q5_K_M	11.2727 ± 0.08902	+2.23%
DeepSeek-R1-DRAFT-0.5B-Q5_K_S.gguf	Q5_K_S	11.2803 ± 0.08888	+2.30%
DeepSeek-R1-DRAFT-0.5B-Q4_K_M.gguf	Q4_K_M	11.8171 ± 0.09319	+7.17%
DeepSeek-R1-DRAFT-0.5B-Q4_K_S.gguf	Q4_K_S	11.9379 ± 0.09380	+8.26%
DeepSeek-R1-DRAFT-0.5B-IQ4_NL.gguf	IQ4_NL	11.8497 ± 0.09445	+7.46%
DeepSeek-R1-DRAFT-0.5B-IQ4_XS.gguf	IQ4_XS	11.8600 ± 0.09464	+7.56%
DeepSeek-R1-DRAFT-0.5B-Q5_1.gguf	Q5_1	11.3624 ± 0.08926	+3.05%
DeepSeek-R1-DRAFT-0.5B-Q5_0.gguf	Q5_0	11.5217 ± 0.09124	+4.49%
DeepSeek-R1-DRAFT-0.5B-Q4_1.gguf	Q4_1	12.3107 ± 0.09765	+11.64%
DeepSeek-R1-DRAFT-0.5B-Q4_0.gguf	Q4_0	12.6168 ± 0.10021	+14.42%

With `imatrix`

Link	Type	PPL	PPL vs BF16
DeepSeek-R1-DRAFT-0.5B-iQ6_K.gguf	Q6_K	11.0940 ± 0.08714	+0.61%
DeepSeek-R1-DRAFT-0.5B-iQ5_K_M.gguf	Q5_K_M	11.2333 ± 0.08819	+1.87%
DeepSeek-R1-DRAFT-0.5B-iQ5_K_S.gguf	Q5_K_S	11.2238 ± 0.08798	+1.79%
DeepSeek-R1-DRAFT-0.5B-iQ4_K_M.gguf	Q4_K_M	11.6273 ± 0.09165	+5.45%
DeepSeek-R1-DRAFT-0.5B-iQ4_K_S.gguf	Q4_K_S	11.7004 ± 0.09225	+6.11%
DeepSeek-R1-DRAFT-0.5B-iIQ4_NL.gguf	IQ4_NL	11.6495 ± 0.09192	+5.65%
DeepSeek-R1-DRAFT-0.5B-iIQ4_XS.gguf	IQ4_XS	11.6924 ± 0.09246	+6.04%
DeepSeek-R1-DRAFT-0.5B-iQ5_1.gguf	Q5_1	11.2001 ± 0.08792	+1.57%
DeepSeek-R1-DRAFT-0.5B-iQ5_0.gguf	Q5_0	11.3579 ± 0.08961	+3.00%
DeepSeek-R1-DRAFT-0.5B-iQ4_1.gguf	Q4_1	11.7469 ± 0.09250	+6.53%
DeepSeek-R1-DRAFT-0.5B-iQ4_0.gguf	Q4_0	12.1546 ± 0.09619	+10.23%

jukofyork Apr 8, 2025
Collaborator

If IQ4_XS runs badly on Mac, then Q5_0 looks like a very competitive choice here?

Saying that Q6_K looks way more appealing unless that also performs a lot worse.

jukofyork · 2025-04-08T21:47:04Z

jukofyork
Apr 8, 2025
Collaborator

This problem seems to be a perfect target for Bayesian Filtering.

I also wonder if we should have two probability thresholds:

Decides to bother to do a batch at all (this seems to have quite a high cost associated with it).
If we have decided to run a batch, what the probability should drop below to decide it's not worth adding more to the batch (the cost associated with additional tokens seems much lower compared to the overhead of starting a batch at all).

At least for r1, there seems to be clear regime shifts where the tiny model starts to become much more "confidently wrong" (especially when writing code) and drafts long sequences of incorrect tokens. The Softmax Bottleneck likely makes it quite hard for the smaller model to accurately predict sequence probabilities for different regimes (eg: the reasoning section, natural language sections and code sections, will all have very different properties), due to having a much smaller hidden_dim:n_vocab ratio than the target model. Filtering would likely be able to catch this sort of thing reasonably quickly.

Finally, I think the estimation errors and batch costs may not be static throughout the generation:

At the start there is much less to go on for the predictions.
As the context increases there is much more for the draft model to use to help its predictions.
There may also be additional costs associated with a bad draft as the context increases (eg: larger KV-cache may start to cost more and more for a miss).

Again, filtering could use some second order terms to predict something akin to acceleration here.

This seems a useful collection:

https://github.com/hemingkx/SpeculativeDecodingPapers

0 replies

jukofyork · 2025-04-09T05:46:43Z

jukofyork
Apr 9, 2025
Collaborator

Just remembered this post:

https://old.reddit.com/r/LocalLLaMA/comments/1iu8f7s/speculative_decoding_can_identify_broken_quants/

and it uses a quant of the same model for speculative decoding:

and interestingly, this shows little difference between the levels of quants (ignoring the problems with Q3 he's trying to highlight).

1 reply

steampunque Apr 9, 2025
Author

I think it makes sense. Draft only needs to get two tokens right on average for doubling the speed of inference (with a draft size of 4) given a 100% correct context history, which is the majority of benefit that can be expected. However for running longer drafts such as 8 on code models I would stick with Q4 or above. In fact I would stick with Q4 or above in all cases anyway unless I really wanted to get the memory footprint of the draft smaller to open up space for weights and KV.

Djip007 · 2025-04-09T20:31:26Z

Djip007
Apr 9, 2025

Do you think it is possible to compute the 'KL Divergence'' of DRAFT model again TARGET one?

2 replies

steampunque Apr 9, 2025
Author

Cross entropy loss is easier and more relevant with speculation since its just the difference between the probability of the token decoded in the target with that same token in the draft i.e. loss=E{p_target(T)-p_draft(T)}) as an average where T is decoded token. If a model is self speculated that loss comes out zero by definition. If low entropy quants are used in draft a self speculated cross entropy loss would give a numeric measure of the loss from the quant. The ratio of the number of draft "hits" to total drafted tokens is another useful loss metric which can be computed for essentially free and is essentially what one is looking for in a good draft model so is probably the best metric to use to assess a DRAFT for use with TARGET.

Djip007 Apr 18, 2025

The ratio of the number of draft "hits" to total drafted tokens is another useful loss metric

sure it may be the best.
My idea was if it is possible is to use the actual llama-perplexity with something like:

llama-perplexity --kl-divergence-base <target.kld> --kl-divergence -m draft.gguf.

(I am not sure it work...)
If it can I expect that we at least can compare draft models and quantisation impact. As you say if this is the same model the KL Divergence will be 0.

I release that for now only the llama-server have draft support, not llama-bench / llama-cli, don't know how hard it is to add on this.

jukofyork · 2025-06-10T11:40:54Z

jukofyork
Jun 10, 2025
Collaborator

I've been looking into this the last couple of days and have identified --draft-p-min as the main cause of problems:

It's impossible to find a suitable value that works for a range of "draftability" extremes at the same time (eg: coding and normal use).
The optimisations for things like Flash Attention,ggml_mul_mat_id(), etc cause a very "jumpy" set of optimal thresholds.
It isn't considering the sequence of probabilities and treats 1*1*1*1*0.75 the same as 0.75*0.75*0.75*0.75*0.75, etc.

So as a proof of concept to see if we can improve on this, I've written this hacky code to test the potential for improvements:

First you have to run this script to calculate the sequence probability thresholds where a draft is +EV:

#!/bin/bash

max_pp=64
num_repeats=10

# Generate comma-separated PP and NPL lists
pp_list=$(seq -s ',' 1 $max_pp)
npl_list=$(printf '1%.0s,' $(seq 1 $num_repeats) | sed 's/,$//')

# Turn off NUMA balancing
echo 0 | sudo tee /proc/sys/kernel/numa_balancing > /dev/null

# Ask for permission to drop caches
read -p "Do you want to drop caches? (y/n) " -n 1 -r
echo    # Move to a new line
if [[ $REPLY =~ ^[Yy]$ ]]
then
    echo "Dropping caches..."
    echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
fi

# Temporary file for JSONL output
temp_file=$(mktemp)
jsonl_file=$(mktemp)

# Run the benchmark and save full output
echo "Running benchmark..."
CUDA_VISIBLE_DEVICES=1 ~/llama.cpp/build/bin/llama-batched-bench \
    --model ~/models/gguf/deepseek-v3-0324-Q4_K_XL.gguf \
    --n-gpu-layers 99 \
    --numa distribute \
    --threads 80 \
    --override-tensor exps=CPU \
    --flash-attn \
    -c 2048 -b 2048 -ub 512 \
    -npp "$pp_list" \
    -ntg 0 \
    -npl "$npl_list" \
    --output-format jsonl | tee "$temp_file"

# NOTE: The first result always seems to be bogus, so skip over it.
echo -n "Extracting results..."
count=$(grep '^{' "$temp_file" | tail -n +2 | tee "$jsonl_file" | wc -l)
echo " Done ($count results extracted)"

# Process the extracted JSONL
jq -s --raw-output '
    # Calculate max_pp from actual data
    (map(.pp) | max) as $max_pp |

    # Create dictionary with {sum, count} for each PP
    reduce .[] as $item (
        {};
        ($item.pp | tostring) as $pp |
        .[$pp].sum = (.[$pp].sum + $item.speed) |
        .[$pp].count = (.[$pp].count + 1)
    ) |

    # Calculate averages in natural PP order
    [range(1; $max_pp + 1) as $pp |
        (.[($pp|tostring)]).sum / .[($pp|tostring)].count
    ] as $averages |

    # Normalize relative to PP=1
    $averages[0] as $base |
    [1] + [$averages[1:][] | $base / . ] |

    # Format with 3 decimal places
    map(. * 1000 | round | . / 1000) |
    "const std::vector<double> p_mins = { " + join(", ") + " };"
' "$jsonl_file"

# Clean up
rm "$temp_file" "$jsonl_file"

(obviously you will need to change the parameters you expect to use your own model with...)

which will output something that looks like this:

const std::vector<double> p_mins = { 1, 1.554, 1.046, 0.837, 0.702, 0.609, 0.548, 0.502, 0.471, 0.444, 0.413, 0.393, 0.378, 0.365, 0.352, 0.34, 0.333, 0.325, 0.316, 0.309, 0.303, 0.298, 0.291, 0.285, 0.283, 0.279, 0.274, 0.27, 0.269, 0.265, 0.262, 0.26, 0.258, 0.255, 0.252, 0.25, 0.245, 0.243, 0.241, 0.24, 0.238, 0.237, 0.235, 0.234, 0.232, 0.231, 0.231, 0.23, 0.229, 0.228, 0.227, 0.226, 0.225, 0.225, 0.224, 0.223, 0.222, 0.221, 0.221, 0.22, 0.219, 0.219, 0.219, 0.218 };

Any value of p_min[] which is greater than 1 means that this draft batch size is always -EV (eg: for my model we should set --draft-min = 3).
There are diminishing returns for using more and more draft tokens (mainly because my model is running the MoE experts on the CPU... For other models and fast GPUs this might not be the case). Since I have used max_pp = 64 here I will need to run with --draft-max 64.
It should be quite clear from this that there is no universal -draft-p-min that can possibly work for all use cases! By using a fixed value of say 0.75 you are hoping that the sequence decays something like 0.75^n, but it's clearly nothing at all like this...

So after the code above has been run, you need to replace the code in speculative.cpp with this, using the p_mins generated from the above script for your own model:

llama.cpp/common/speculative.cpp

Line 242 in f470bc3

// sample n_draft tokens from the draft model

    // ??? CAN THIS EVER BE ANYTHING BUT 0 HERE ???
    printf("%d ", (int) result.size());

    // calculated empirically using llama-batch-bench
    const std::vector<double> p_mins = { 1, 1.554, 1.046, 0.837, 0.702, 0.609, 0.548, 0.502, 0.471, 0.444, 0.413, 0.393, 0.378, 0.365, 0.352, 0.34, 0.333, 0.325, 0.316, 0.309, 0.303, 0.298, 0.291, 0.285, 0.283, 0.279, 0.274, 0.27, 0.269, 0.265, 0.262, 0.26, 0.258, 0.255, 0.252, 0.25, 0.245, 0.243, 0.241, 0.24, 0.238, 0.237, 0.235, 0.234, 0.232, 0.231, 0.231, 0.23, 0.229, 0.228, 0.227, 0.226, 0.225, 0.225, 0.224, 0.223, 0.222, 0.221, 0.221, 0.22, 0.219, 0.219, 0.219, 0.218 };

    // used to re-calibrate the probabilities if needed (ie: 1: none, <1: sharpen, >1: flatten)
    // note: also acts as a minimum edge threshold, so likely wants to be slightly >1 even for a well-calibrated draft model...
    const float recalibration_power = 1.02f;

    // this allows the heuristic to break earlier than testing all against p_mins.back()
    // note: assumes that strings of close to p=1.0 tokens occur rarely... can be optimised by looking for the largest gap printout with max_lookahead=MAX_INT
    const int max_lookahead = 5;

    int best_draft_size = 0;

    float sequence_p = 1.0;

    for (int i = 0; i < params.n_draft; ++i) {
        common_batch_clear(batch);

        common_sampler_sample(smpl, ctx, 0, true);

        const auto * cur_p = common_sampler_get_candidates(smpl);

        for (int k = 0; k < std::min(3, (int) cur_p->size); ++k) {
            LOG_DBG(" - draft candidate %3d, pos %3d: %6d (%8.3f) '%s'\n",
                    k, i, cur_p->data[k].id, cur_p->data[k].p, common_token_to_piece(ctx, cur_p->data[k].id).c_str());
        }

        // add drafted token for each sequence
        const llama_token id = cur_p->data[0].id;

        common_sampler_accept(smpl, id, true);

        result.push_back(id);

        if (params.n_draft <= (int) result.size()) {
            best_draft_size = result.size();
            break;
        }

        // re-calibrate if necessary
        sequence_p *= pow(cur_p->data[0].p, recalibration_power);

        // only collect draft tokens with positive expected values
        if (sequence_p >= p_mins[(int) result.size()]) {
            best_draft_size = result.size();
        }

        // break as soon as we are fairly confident we can't improve on the best found so far
        if (sequence_p < p_mins[std::min(best_draft_size + max_lookahead, (int) p_mins.size() - 1)]) {
            break;
        }

        common_batch_add(batch, id, n_past + i + 1, { 0 }, true);

        // evaluate the drafted tokens on the draft model
        llama_decode(ctx, batch);

        prompt.push_back(id);
    }

    printf("%d %d [%d] %.3f\n", (int) result.size(), best_draft_size, (best_draft_size > 0 ? (int) result.size() - best_draft_size : 0), sequence_p);

    // note: this truncates to the token *after* the last that was seen to be +EV!
    result.resize(best_draft_size);

    return result;

Then test on different extremes of prompts:

Low "draftability" prompts like "Explain the rules of chess to me".
Medium "draftability" prompts like "Write me some code in C++ using GSL to optimise multinomial logistic regression".
High "draftability" prompts like "Refactor the god-awful hacky script from above: <script>".

and so on...

It seems the above method now works on all these different extremes, and so long as the draft model is cheap to run (eg: 0.5B or so):

Speculative sampling never increases the tokens/s like is does for a badly chosen --draft-p-min with the current algorithm.
Seems to be universally better (eg: I can now get nearly 12 tokens/s for high "draftability" prompts compared to 8-9 tops since the recent MLA flash attention changes that caused that big "jump" in the small-batch timings).

So the question now is:

Can we compute the p_mins[] vector at runtime? Using either an simple (or exponential) moving average or some more complex Bayesian method like a Kalman filter, etc.
Can we adjust recalibration_power on the fly? Again using something simple like online Platt scaling or some other simple Bayesian method.

If we don't want to calculate on the fly, then perhaps we could make it so that --draft-p-min can take a vector of parameters like my p_mins[] and if needed repeat the final value for when |p_mins| < max_pp, etc.

Again, sorry the code is such a hacky mess, but I just wanted to see if there was any potential in this method before proceeding... It does appear to be quite a significant improvement and a lot less complex than the old PR that was removed... I am still not sure how the "reuse" code works above or if it can ever get to the // ??? CAN THIS EVER BE ANYTHING BUT 0 HERE ??? section - if this is the case then it will probably need some changes to account for this, as the code assumes this never happens and thus initialises float sequence_p = 1.0, etc.

@ggerganov @steampunque Is this worth trying to tidy up and make a proper PR out of? Does it universally improve on the existing algorithm for other models and models run on other back-ends like the Mac?

0 replies

jukofyork · 2025-06-10T15:25:28Z

jukofyork
Jun 10, 2025
Collaborator

It's no better or worse for qwen-2.5-coder:32b:

[1.0, 0.525, 0.351, 0.275, 0.239, 0.236, 0.224, 0.216, 0.137, 0.123, 0.111, 0.102, 0.095, 0.088, 0.082, 0.077, 0.075, 0.071, 0.067, 0.064, 0.061, 0.058, 0.056, 0.054, 0.051, 0.049, 0.047, 0.046, 0.044, 0.043, 0.041, 0.04, 0.041, 0.04, 0.039, 0.037, 0.037, 0.036, 0.035, 0.034, 0.033, 0.033, 0.032, 0.031, 0.031, 0.03, 0.03, 0.029, 0.031, 0.03, 0.03, 0.029, 0.029, 0.029, 0.028, 0.028, 0.027, 0.027, 0.027, 0.026, 0.026, 0.026, 0.025, 0.025]

but interestingly it gets the same performance (when drafted by qwen-2.5-coder:0.5b) as the old algorithm, but using a far lower draft acceptance rate, so there is potentially some gains to be had by altering the recalibration_power, etc.

For the deepseek V3/R1 models, I've managed to fit a near perfect rational approximation:

but nothing will fit the qwen model due to those strange steps caused by flash-attention kernels...

It's certainly interesting and I think probably worth looking into more.

0 replies

jukofyork · 2025-06-10T15:37:37Z

jukofyork
Jun 10, 2025
Collaborator

Here is the version that uses the rational approximation for V3/R1 and can (in theory) use any value of --draft-max:

    // ??? CAN THIS EVER BE ANYTHING BUT 0 HERE ???
    printf("%d ", (int) result.size());

    static constexpr auto rationalFit = [](int x, double a = 2.6288, double b = 3.996, double c = 0.1761) {
        return (x < 3) ? 1.0 : (a / (static_cast<double>(x - 3) + b) + c);
    };
        
    // used to re-calibrate the probabilities if needed (ie: 1: none, <1: sharpen, >1: flatten)
    // note: also acts as a minimum edge threshold, so likely wants to be slightly >1 even for a well-calibrated draft model...
    const float recalibration_power = 1.02f;
            
    // this allows the heuristic to break earlier than testing all against p_mins.back()
    // note: assumes that strings of close to p=1.0 tokens occur rarely... can be optimised by looking for the largest gap printout with max_lookahead=MAX_INT
    const int max_lookahead = 5;
        
    int best_draft_size = 0;
        
    float sequence_p = 1.0;
        
    for (int i = 0; i < params.n_draft; ++i) {
        common_batch_clear(batch);
        
        common_sampler_sample(smpl, ctx, 0, true);
        
        const auto * cur_p = common_sampler_get_candidates(smpl);

        for (int k = 0; k < std::min(3, (int) cur_p->size); ++k) {
            LOG_DBG(" - draft candidate %3d, pos %3d: %6d (%8.3f) '%s'\n",
                    k, i, cur_p->data[k].id, cur_p->data[k].p, common_token_to_piece(ctx, cur_p->data[k].id).c_str());
        }

        // add drafted token for each sequence
        const llama_token id = cur_p->data[0].id;

        common_sampler_accept(smpl, id, true);

        result.push_back(id);

        if (params.n_draft <= (int) result.size()) {
            best_draft_size = result.size();
            break;
        }

        // re-calibrate if necessary
        sequence_p *= pow(cur_p->data[0].p, recalibration_power);

        // only collect draft tokens with positive expected values
        if (sequence_p >= rationalFit((int) result.size())) {
            best_draft_size = result.size();
        }

        // break as soon as we are fairly confident we can't improve on the best found so far
        if (sequence_p < rationalFit(best_draft_size + max_lookahead)) {
            break;
        }

        common_batch_add(batch, id, n_past + i + 1, { 0 }, true);

        // evaluate the drafted tokens on the draft model
        llama_decode(ctx, batch);

        prompt.push_back(id);

and some hacky python code to try to fit the 3 different classes of approximations:

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

# Your data points
y_data = np.array([0.837, 0.702, 0.609, 0.548, 0.502, 0.471, 0.444,
                   0.413, 0.393, 0.378, 0.365, 0.352, 0.34, 0.333, 0.325, 0.316, 0.309,
                   0.303, 0.298, 0.291, 0.285, 0.283, 0.279, 0.274, 0.27, 0.269, 0.265,
                   0.262, 0.26, 0.258, 0.255, 0.252, 0.25, 0.245, 0.243, 0.241, 0.24,
                   0.238, 0.237, 0.235, 0.234, 0.232, 0.231, 0.231, 0.23, 0.229, 0.228,
                   0.227, 0.226, 0.225, 0.225, 0.224, 0.223, 0.222, 0.221, 0.221, 0.22,
                   0.219, 0.219, 0.219, 0.218])

x_data = np.arange(len(y_data))

# Define some candidate functions
def exp_decay(x, a, b, c):
    return a * np.exp(-b * x) + c

def power_decay(x, a, b, c):
    return a * (x + 1)**(-b) + c

def rational(x, a, b, c):
    return a / (x + b) + c

# Fit each function
try:
    popt_exp, _ = curve_fit(exp_decay, x_data, y_data, p0=[0.5, 0.1, 0.2])
    popt_power, _ = curve_fit(power_decay, x_data, y_data, p0=[0.5, 0.5, 0.2])
    popt_rat, _ = curve_fit(rational, x_data, y_data, p0=[0.5, 1, 0.2])
    
    # Plot results
    plt.figure(figsize=(10, 6))
    plt.scatter(x_data, y_data, label='Data')
    plt.plot(x_data, exp_decay(x_data, *popt_exp), label=f'Exponential: {popt_exp.round(4)}')
    plt.plot(x_data, power_decay(x_data, *popt_power), label=f'Power: {popt_power.round(4)}')
    plt.plot(x_data, rational(x_data, *popt_rat), label=f'Rational: {popt_rat.round(4)}')
    plt.legend()
    plt.xlabel('Index')
    plt.ylabel('Value')
    plt.title('Function Fitting Comparison')
    plt.show()
    
    # Calculate and print RMSE for each fit
    def rmse(y_true, y_pred):
        return np.sqrt(np.mean((y_true - y_pred)**2))
    
    print("RMSE for exponential fit:", rmse(y_data, exp_decay(x_data, *popt_exp)))
    print("RMSE for power fit:", rmse(y_data, power_decay(x_data, *popt_power)))
    print("RMSE for rational fit:", rmse(y_data, rational(x_data, *popt_rat)))
    
except Exception as e:
    print("Error during fitting:", e)

(you will possibly need to fudge the initial values that are >1 to get a good fit like I did here...)

2 replies

steampunque Jun 10, 2025
Author

I think there is a hidden intractable problem here, which is all generations even on the same model are not alike. Approximating generation classes to 3 categorites of simple, average, and hard. Spec is not going to work on hard no matter what, the draft is always going to be getting maybe 1 token right if its lucky and often 0. On simple and average spec can work but they are going to have way different statistics. For my use cases I decided I am most often not going to be either easy or hard and then its just simpler to run with a fixed block size of 4 for the average gen case on most general models and a block size of 8 on code models. I avoid the need to compute any probs on draft and I know I am always running with a certain set of batch processing kernels and never have to worry about a bigger batch kernel suddenly kicking in and going backwards in efficiency with an adapted block length.

jukofyork Jun 11, 2025
Collaborator

I think there is a hidden intractable problem here, which is all generations even on the same model are not alike. Approximating generation classes to 3 categorites of simple, average, and hard. Spec is not going to work on hard no matter what, the draft is always going to be getting maybe 1 token right if its lucky and often 0.

Yeah, it would be nice not have to unload the model and reduce the cost of these failed drafts down though. I wonder if some much simpler model could predict this that might have almost no cost?

On simple and average spec can work but they are going to have way different statistics. For my use cases I decided I am most often not going to be either easy or hard and then its just simpler to run with a fixed block size of 4 for the average gen case on most general models and a block size of 8 on code models. I avoid the need to compute any probs on draft and I know I am always running with a certain set of batch processing kernels and never have to worry about a bigger batch kernel suddenly kicking in and going backwards in efficiency with an adapted block length.

My use case often seems to flip between hard/medium and ridiculously easy as I make the LLM rewrite the whole response after it creates some baffling "// Old code here" web, or asking it to refactor things, etc. Without the dynamic draft length for this I think it would hardly be worth using the draft model at all :/

jukofyork · 2025-06-11T11:46:10Z

jukofyork
Jun 11, 2025
Collaborator

After sleeping on this, then I think it's basically gonna be far too much hassle to try to implement the ideas from these tests, but the key point they shows is that the marginal cost of adding 1 more token to a batch should somehow be taken into account.

0 replies

jukofyork · 2025-06-11T18:05:07Z

jukofyork
Jun 11, 2025
Collaborator

~~I tidied it up a little bit and have made a draft PR for others to try:~~

~~14132~~

0 replies

jukofyork · 2025-06-13T19:38:03Z

jukofyork
Jun 13, 2025
Collaborator

I'm getting some really good results now:

I've added code to try to account for the extra PP overhead rather than just start with PP=0 (doesn't seem to really make a lot of difference using 1024 and 2048 sized prompts though...).
I've added the actual speed ratio of the target and draft model to the cost calculation.
Hopefully it should be clearer what I am doing now that I have spelled out all the steps in the code to calculate the EVs instead of just generating the break-even sequence probabilities...
The only tunable parameters are: MAX_BATCH_SIZE (which then has to become --draft-max) and max_lookahead in the C++ code (but 5 seems fairly optimal on the different models I have tried so far).

To use this, first you have to run this script to generate the timing data:

#!/bin/bash

# Environment variables
export CUDA_VISIBLE_DEVICES=0

# Configuration variables
BENCH_EXE="/home/juk/llama.cpp/build/bin/llama-batched-bench"

MODEL_PATH="/home/juk/models/gguf/qwen-2.5-coder-Q6_K.gguf"
#MODEL_PATH="/Users/juk/models/gguf/qwen-2.5-coder-Q6_K.gguf"
#MODEL_PATH="/home/juk/models/gguf/draft_models/Qwen2.5-Coder-DRAFT-0.6B-Q4_0.gguf"
#MODEL_PATH="/Users/juk/models/gguf/draft_models/Qwen2.5-Coder-DRAFT-0.6B-Q4_0.gguf"

# Benchmark parameters
PROMPT_SIZE=1024
MAX_BATCH_SIZE=32
NUM_SAMPLES=5

# Model-specific parameters
MODEL_PARAMS="--n-gpu-layers 99 \
              --flash-attn"

# Generate comma-separated PROMPT_SIZE and BATCH_SIZE lists (NOTE: Process 2x before 1x to help with warmup)
PROMPT_SIZE_LIST="$((PROMPT_SIZE * 2)),${PROMPT_SIZE}"
BATCH_SIZE_LIST=$(printf "%s," $(for i in $(seq 1 $NUM_SAMPLES); do seq 1 $MAX_BATCH_SIZE; done) | sed 's/,$//')

# Output files
LOG_FILE="benchmark_results.log"
JSONL_PP1X="results_pp1x.jsonl"
JSONL_PP2X="results_pp2x.jsonl"

# Clean previous files
rm -f "$LOG_FILE" "$JSONL_PP1X" "$JSONL_PP2X"

# Run the benchmark
echo "- Running benchmark..."
$BENCH_EXE \
    --model "$MODEL_PATH" \
    $MODEL_PARAMS \
    --ctx_size "$((PROMPT_SIZE * 2 + MAX_BATCH_SIZE))" \
    -pps \
    -npp "$PROMPT_SIZE_LIST" \
    -npl "$BATCH_SIZE_LIST" \
    -ntg 1 \
    --output-format jsonl | tee "$LOG_FILE"

# Extract JSONL lines from the log (NOTE: Skip first set of samples as seems to need a warmup to get accurate stats)
echo -n "- Extracting results..."
grep '^{' "$LOG_FILE" | tail -n "+$((NUM_SAMPLES + 1))" | grep "\"pp\": ${PROMPT_SIZE}" > "$JSONL_PP1X"
grep '^{' "$LOG_FILE" | tail -n "+$((NUM_SAMPLES + 1))"| grep "\"pp\": $((PROMPT_SIZE * 2))" > "$JSONL_PP2X"
COUNT1=$(wc -l < "$JSONL_PP1X")
COUNT2=$(wc -l < "$JSONL_PP2X")
echo " Done ($COUNT1 1xPP results, $COUNT2 2xPP results)"

# Function to extract values as bash array
extract_values() {
    local jsonl_file=$1
    
    jq -s --raw-output '
        (map(.pl) | max) as $max_pl |
        reduce .[] as $item (
            {};
            ($item.pl | tostring) as $pl |
            .[$pl].sum = (.[$pl].sum + $item.speed_tg) |
            .[$pl].count = (.[$pl].count + 1)
        ) |
        [range(1; $max_pl + 1) as $pl |
            (.[($pl|tostring)]).sum / .[($pl|tostring)].count
        ] |
        map(. * 1000 | round | . / 1000) |
        join(" ")
    ' "$jsonl_file"
}

# Function to process JSONL and output python/C++ vectors
process_results() {
    local jsonl_file_1x=$1
    local jsonl_file_2x=$2
    
    # Extract values as arrays
    local values_1x=($(extract_values "$jsonl_file_1x"))
    local values_2x=($(extract_values "$jsonl_file_2x"))
       
    # Solve equations: speed_tg = 2*v1x - v2x, speed_pp = v2x - v1x
    local speed_tg=()
    local speed_pp=()
    
    for i in "${!values_1x[@]}"; do
        if [[ -n "${values_2x[i]}" ]]; then
            local v1x="${values_1x[i]}"
            local v2x="${values_2x[i]}"
            
            # Calculate using awk for floating point arithmetic
            local tg=$(awk "BEGIN {printf \"%.3f\", 2 * $v1x - $v2x}")
            local pp=$(awk "BEGIN {printf \"%.3f\", $v1x - $v2x}")
            
            speed_tg+=("$tg")
            speed_pp+=("$pp")
        fi
    done

    # Output raw vectors
    local raw_1x_str=$(IFS=', '; echo "${values_1x[*]}")
    local raw_2x_str=$(IFS=', '; echo "${values_2x[*]}")
    echo "----------------------------------------"
    echo "tg_at_pp_1x = np.array([$raw_1x_str])"
    echo "tg_at_pp_2x = np.array([$raw_2x_str])"

    # Output solution vectors
    local tg_str=$(IFS=', '; echo "${speed_tg[*]}")
    local pp_str=$(IFS=', '; echo "${speed_pp[*]}")
    echo "pp_overhead = np.array([$pp_str])"
    echo "tg_at_pp_0 = np.array([$tg_str])"
    echo "----------------------------------------"
    echo "const std::vector<float> model_batch_speeds = { $tg_str };"
    echo "----------------------------------------"
}

# Process the extracted JSONL
echo "- Generating data vectors:"
process_results "$JSONL_PP1X" "$JSONL_PP2X"

# Clean up log, but leave JSONL files
rm "$LOG_FILE"

Then paste the generated vector of model_batch_speeds in this code:

    // *****************************************************
    // *** The main model's tokens/s for each batch size ***
    // *****************************************************

    // - RTX 5000 Ada
    /*
    const std::vector<float> model_batch_speeds = {
            17.956,36.097,53.640,69.224,78.671,81.642,81.894,84.374,
            134.441,148.602,162.406,176.664,190.502,204.359,218.614,232.686,
            238.893,251.720,265.238,278.409,292.450,305.247,318.849,331.799,
            347.519,361.104,373.258,386.453,399.033,411.275,422.836,438.136
    };
    */

    // - M1 Ultra 64GB
    const std::vector<float> model_batch_speeds = {
            14.178,15.084,15.667,26.851,28.725,29.529,28.106,31.060,
            23.468,26.027,28.514,31.074,33.542,36.004,38.530,41.095,
            43.549,45.945,48.379,52.055,54.575,57.107,59.433,61.969,
            64.455,67.023,69.414,71.868,74.377,76.834,79.221,81.995
    };

    // ******************************************************
    // *** The draft model's tokens/s for batch size of 1 ***
    // ******************************************************

    // - RTX 5000 Ada
    /*
    const float draft_tg_speed = 353.540;
    */
    
    // - M1 Ultra 64GB
    const float draft_tg_speed = 231.948;
    
    // ==========================================================================================================
        
    // The estimated lookahead cost per token, in terms of the main model's token generation speed
    const float lookahead_cost_estimate = model_batch_speeds[0] / draft_tg_speed;
    
    // The maximum lookahead relative to the best we have seen so far
    const int max_lookahead = 5;
    
    // ==========================================================================================================
    
    // The best draft size and its associated expected value so far (ie: init to the the main model's TG speed)
    int best_draft_size = 0;
    float best_draft_ev = model_batch_speeds[0];
    
    // The current sequence probability, as predicted by the draft model
    float current_sequence_p = 1.0;
   
    GGML_ASSERT((int) model_batch_speeds.size() == params.n_draft);

    for (int i = 0; i < params.n_draft; ++i) {
        
        // Sample a draft token
        common_batch_clear(batch);
        common_sampler_sample(smpl, ctx, 0, true);
        const auto * cur_p = common_sampler_get_candidates(smpl);
        const llama_token id = cur_p->data[0].id;
        common_sampler_accept(smpl, id, true);

        // Save the sampled token id
        result.push_back(id);
        
        // Get the current draft size we are looking at
        const int current_draft_size = result.size();

        // If we have enough tokens already, then stop
        if (current_draft_size >= params.n_draft) {
            best_draft_size = result.size();
            break;
        }

        // Update the sequence probability using the sampled token's predicted probability
        current_sequence_p *= cur_p->data[0].p;

        // Calculate the expected value (in terms of the main model's tokens/s) for this sequence
        const float current_sequence_ev = current_sequence_p * model_batch_speeds[current_draft_size];
        
        // Is this token clearly +EV compared to what we have so far?
        if (current_sequence_ev > best_draft_ev) {
            best_draft_size = current_draft_size;
            best_draft_ev = current_sequence_ev;
        }

        // Otherwise we have to decide if we might see a +EV draft in the future, or stop now
        else {
            
            bool stop_now = true;
            
            for (int j = current_draft_size + 1; j < (int) model_batch_speeds.size(); j++) {

                // Don't bother looking too far relative to the best we have seen so far
                if (j - best_draft_size > max_lookahead) {
                    break;
                }
                
                // This approximates the cost of the lookahead in terms of the model's tokens/s.
                const float lookahead_cost = lookahead_cost_estimate * (float) (j - current_draft_size);

                // Calculate the discounted potential EV of this lookahead depth
                // NOTE: This assumes the worst case of the draft predicting p=1.0 for all future tokens...
                const float potential_ev = current_sequence_p * (model_batch_speeds[j] - lookahead_cost);

                // Is this lookahead depth potentially +EV compared to the best we have so far?
                if (potential_ev > best_draft_ev) {
                    stop_now = false;
                    break;
                }
            }
            
            // If no chance to improve the EV, then stop now
            if (stop_now) {
                break;
            }

        }

        common_batch_add(batch, id, n_past + i + 1, { 0 }, true);

        // evaluate the drafted tokens on the draft model
        llama_decode(ctx, batch);

        prompt.push_back(id);
    }   
    
    // NOTE: This truncates to the token *after* the last that was seen to be +EV!
    //       - The reason for this is because the main model will generate all tokens
    //         at this final position, and if we get there successfully; we can use
    //         whichever it finds regardless (ie: this last token isn't part of the draft).
    result.resize(best_draft_size);

    return result;

and also find the draft model's token-generation speed (using the above script or just running it on its own) and set draft_tg_speed to this (or if you can't be bothered finding this; you can just guess how much faster the draft model is and fudge the lookahead_cost_estimate variable...).

This code then needs to replace all the code below the // sample n_draft tokens from the draft model here in speculative.cpp:

llama.cpp/common/speculative.cpp

Line 242 in f470bc3

// sample n_draft tokens from the draft model

Then make sure to run llama-server with:

--draft-min 1 (or more if you have a graph like my deepseek-v3 one [see previous discussions above]).
--draft-max 32 (or whatever you set MAX_BATCH_SIZE in the shell script).

On the RTX 5000 Ada and qwen-2.5-coder-Q6_K.gguf drafted by Qwen2.5-Coder-DRAFT-0.6B-Q4_0.gguf, I'm getting over 4x the tokens per second for "high draftability" refactoring prompts (~80 tokens/s vs 18 tokens/s undrafted).

On the M1 Ultra 64GB and qwen-2.5-coder-Q6_K.gguf drafted by Qwen2.5-Coder-DRAFT-0.6B-Q4_0.gguf, I'm getting just under 2x the tokens per second for "high draftability" refactoring prompts (~24 tokens/s vs 14 tokens/s undrafted).

For "low draftability" prompts like "Tell me the rules of chess" I'm still getting ~1.3x for the RTX 5000 Ada and still 1x for the M1 Ultra 64GB!

To put this in perceptive: The absolute very best hand-tuned settings using the existing algorithm on the RTX 5000 Ada with qwen-2.5-coder gives around 55 tokens/s for "high draftability" refactoring prompts, but these same settings are useless for "low draftability" prompts and the model has to be unloaded/reloaded with different/no draft settings...

It will take me all night to generate the data for deepseek-v3, but will report back tomorrow when it's done...

I really think this sort of "profile guided drafting" could be huge! I've no idea if I can make it into a proper PR yet though...

0 replies

steampunque · 2025-06-13T22:31:42Z

steampunque
Jun 13, 2025
Author

I reran some spec benches on Qwen2.5 32B coder. It looks like something in the backend got a lot faster than it used to be (most likely combined RPC + CUDA optimizations). Also I think there is potentially a simple heuristic which can be used to adapt spec block length without computing any probs as I will discuss after the results.

HW: 4070 + 1 RPC 4070

HUMANEVAL 1ST PROBLEM:

generate python code for the described function header:

from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """

DEFS : DN = drafted totkens DA = accepted tokens TG= Token gen t/s

DRAFT	TARGET	draft tokens	DN	DA	DA/DN	TG	X	Note
Qwen2.5-Coder-0.5B-32k-Instruct IQ4_XS	Qwen2.5-Coder-32B-Instruct IQ4_XS	0	0	0	-	18.7	1	RPC
"	"	8	512	348	0.68	64.6	3.5	"
"	"	10	560	356	0.63	68.8	3.7	"
"	"	12	624	360	0.58	70.4	3.8	"
"	"	14	686	363	0.53	71.1	3.8	"
"	"	15	705	365	0.52	71.1	3.8	"
"	"	16	752	360	0.48	66.6	3.6	"

The interesting result from this table is the DA/DN ratio. When this ratio is >>0.5, it suggests the block size is too short and higher speed can be obtained by increasing it. As the block size is increased, DA/DN monotonically decreases. At the critical point of below DA/DN = 0.5, diminishing returns is found. This suggests a very simple heuristic of monitoring DA/DA during gen and boosting spec block length until its just above 0.5 to get optimal spec speed. No probs compute required, just monitoring the draft accept ratio.

Now test a harder spec on the chess prompt with the same model:

Explain the rules of chess to me

DRAFT	TARGET	draft tokens	DN	DA	DA/DN	TG	X	Note
Qwen2.5-Coder-0.5B-32k-Instruct IQ4_XS	Qwen2.5-Coder-32B-Instruct IQ4_XS	0	0	0	-	18.5	1	RPC
"	"	2	814	374	0.46	26.3	1.4	"
"	"	3	1086	419	0.39	26.7	1.4	"
"	"	4	1276	419	0.33	29.0	1.6	"
"	"	5	1550	428	0.28	25.6	1.4	"
"	"	6	1824	434	0.24	23.5	1.3	"

Here diminishing returns occurs at DA/DN < 0.33. So a threshold of DA/DN 0.5 would not work here since it would never increase block length above 2, and some kind of scheme which modifies the threshold as a function of the block length is needed (simple: go from 0.3 up to 0.5 as block length varies from 1 to 8). However I would not trust such a heuristic in practice and still feel more comfortable running with either fixed 4 or 8. I am only 5 t/s below max at block length 8 on code and I think block length 4 is a good general purpose value for my particular spec algorithm. Since I would not be running Qwen coder on the chess prompt but on a general model I don't have to worry about specifying the draft length since it defaults to 4 for me on all general models.

8 replies

jukofyork Jun 15, 2025
Collaborator

I wonder if it's the RPC stuff has improved? Last time I tried using RPC it had so much latency between the nodes (even using 10gbit ethernet), that it never gained anything. I might have to try it again now.

steampunque Jun 15, 2025
Author

I wonder if it's the RPC stuff has improved? Last time I tried using RPC it had so much latency between the nodes (even using 10gbit ethernet), that it never gained anything. I might have to try it again now.

@jukofyork Yes, for sure it has improved. There were changes made to avoid unecessary waits. I am only using 1GbE. I believe part of the "magic" going on is that with large speculation blocks a ton of tokens get processed in parallel with only one hidden state transfer across the network per layer. Thus the hidden state transfer overhead gets reduced by the size of the spec block. The speculator itself does not go over RPC, must be local.

I'm still a little bit uncomfortable with what I'm seeing with the timing in the sense of too good to be true. The RTX5000 should have 2X memory BW and 2X compute of 4070 so should in theory be getting 2X performance. But I am seeing about the same performance with unspeculated and speculated (18ts non speculated, 70ts speculated) that you reported. Something not making sense here. Old numbers would have made more sense (13ts non speculated 40ts speculated) compared to RTX5000 but suddenly I see all this speed seemingly out of nowhere. I could understand 10-20% boost based on RPC/cuda optimizations but 2x boost is harder to understand.

jukofyork Jun 15, 2025
Collaborator

I'm still a little bit uncomfortable with what I'm seeing with the timing in the sense of too good to be true. The RTX5000 should have 2X memory BW and 2X compute of 4070 so should in theory be getting 2X performance. But I am seeing about the same performance with unspeculated and speculated (18ts non speculated, 70ts speculated) that you reported. Something not making sense here. Old numbers would have made more sense (13ts non speculated 40ts speculated) compared to RTX5000 but suddenly I see all this speed seemingly out of nowhere. I could understand 10-20% boost based on RPC/cuda optimizations but 2x boost is harder to understand.

I'm pretty sure it can't be anything to do with the change in #14104, as all it does is move the counter increment to after the code that tests against slot.params.speculative.n_min here:

https://github.com/jukofyork/llama.cpp/blob/0bda7877b4ec4610ebe7b0f83937642b54adfc1f/tools/server/server.cpp#L3559

The only reason for making this change was that I was getting a massively under-reported acceptance rate when using with --draft-min 3 for deepseek on when offloading the MoE tensors to CPU/RAM... It was basically calculating accepted / drafted rather than accepted / tested.

If you are not using the --draft-min option then it shouldn't have any effect at all AFAIK?

There could be a bug in the calculation from before this PR though, as this thread predates by several weeks/months the draft acceptance rate = X ( Y accepted / Z generated) output being added to the server timings printout?

steampunque Jun 15, 2025
Author

As I mentioned, no changes in master affect the speculation in my code. I have a completely different spec algorithm and maintain draft accept counters for it myself.

One change suddenly jumped out though before I was using Q6_K quant of the speculator and in new run I am using IQ4_XS. I think IQ4_XS is very fast on cuda and maybe for some reason its doing a better job speculating the IQ4_XS quant of the target. So a lot of potentially small things all going in the same direction (RPC improvements, cuda improvements, new attention, improved speculator model etc.) to impact the final results. Too many variables going on to keep track of.

jukofyork Jun 15, 2025
Collaborator

I still think there might be an off by one error here.

Consider if you set --draft-max 1 (= params.n_draft in the code below), which will essentially not do any speculative decoding at all:

    // sample n_draft tokens from the draft model
    for (int i = 0; i < params.n_draft; ++i) {
        common_batch_clear(batch);

        common_sampler_sample(smpl, ctx, 0, true);

        const auto * cur_p = common_sampler_get_candidates(smpl);

        for (int k = 0; k < std::min(3, (int) cur_p->size); ++k) {
            LOG_DBG(" - draft candidate %3d, pos %3d: %6d (%8.3f) '%s'\n",
                    k, i, cur_p->data[k].id, cur_p->data[k].p, common_token_to_piece(ctx, cur_p->data[k].id).c_str());
        }

        // add drafted token for each sequence
        const llama_token id = cur_p->data[0].id;

        common_sampler_accept(smpl, id, true);

        result.push_back(id);

        if (params.n_draft <= (int) result.size()) {
            break;
        }

        // only collect very high-confidence draft tokens
        if (cur_p->data[0].p < params.p_min) {
            break;
        }

        common_batch_add(batch, id, n_past + i + 1, { 0 }, true);

        // evaluate the drafted tokens on the draft model
        llama_decode(ctx, batch);

        prompt.push_back(id);
    }

    return result;

You are going to to come into this loop once, result.push_back(id) then params.n_draft <= (int) result.size() will trigger and then we will return result which will have a single token_id in.

Then assuming the default --draft-min 0 (= slot.params.speculative.n_min in the code below):

                llama_tokens draft = common_speculative_gen_draft(slot.spec, params_spec, cached_text_tokens, id);

                // ignore small drafts
                if (slot.params.speculative.n_min > (int) draft.size()) {
                    SLT_DBG(slot, "ignoring small draft: %d < %d\n", (int) draft.size(), slot.params.speculative.n_min);

                    continue;
                }

                // keep track of total number of drafted tokens tested
                slot.n_draft_total += draft.size();

The slot.params.speculative.n_min > (int) draft.size() clause will be skipped over and then the slot.n_draft_total variable will now be incremented by 1 - but this isn't really a draft at all and will just perform standard un-drafted token generation (I think?).

Then further down we do:

                // update how many tokens out of those tested were accepted
                slot.n_draft_accepted += ids.size() - 1;

I think you can make the same argument by induction that this will always end up being (#accepted + 1) / (#tested + 1) (or possibly (#accepted + 1) / #tested due to that += ids.size() - 1), but the code has lots of other optimisations and corner cases regarding the reuse so I'm not 100% sure...

jukofyork · 2025-06-15T09:22:41Z

jukofyork
Jun 15, 2025
Collaborator

adapt spec block length without computing any probs

I think sadly this is doomed to fail - check out these graphs:

Adaptive block length would work well here as the cost for each block size scales almost linearly.

This single down-tick at the start for the CUDA MLA kernel can still be avoided by using the --draft-min option:

But then you get stuff like this from the Metal flash-attention kernels:

We should never take the next batch size above for all the cases where there is a downward step, as even if the draft model were 100% correct; it would still be -EV to do so!

The huge uptick between sizes 3 and 4 mean we should also accept a way lower probability compared to the sizes before and after!

1 reply

jukofyork Jun 15, 2025
Collaborator

But then you get stuff like this from the Metal flash-attention kernels:

We should never take the next batch size above for all the cases where there is a downward step, as even if the draft model were 100% correct; it would still be -EV to do so!

The huge uptick between sizes 3 and 4 mean we should also accept a way lower probability compared to the sizes before and after!

I think if you are using the Metal flash-attention kernels (at least for this particular head dimension), then just using a fixed draft-size of 4 (which I think can be done by setting --draft-min 4 --draft-max 4 --draft-min-p 0.0?) likely gives you most of the potential performance improvements!

I'm not sure if the newer M2/M3/M4 Ultras have better compute though, so that might be a factor to look at.

jukofyork · 2025-06-15T10:02:19Z

jukofyork
Jun 15, 2025
Collaborator

Looking at (the gradient of) all these together gives the best overall picture I think:

If you are compute bound (like the M1 Ultra or having to offload the MoE tensors to CPU), then there is only so much you can gain; however good your probability estimates are!
If you have excess compute, then you can get absolutely massive gains and the current algorithms' fixed probability threshold is likely giving up a lot of potential performance.

The "Deepseek-V3-0324 with MoE offloaded to CPU" example shows pretty much the worst case:

Downward tick(s) right at the start; meaning you often have to reject drafts using the --draft-min option.
A rapid flattening of the tokens/s for batch sizes of 16 onwards.

The "Qwen3-235B-A22B with MoE offloaded to CPU" example doesn't have the downward ticks right at the start, but the tokens/s tails off even quicker; making drafts of more than 12+ tokens nearly always -EV.

I've tried to tidy up and explain the code a little better (and fixed a bug to do with the best_draft_size + 1 at the end):

    // ??? CAN THIS EVER BE ANYTHING BUT 0 HERE ???
    //printf("[%d", (int) result.size());
    //GGML_ASSERT((int) result.size() == 0);

    // RTX 5000 Ada: Qwen2.5-Coder-32B-Instruct-Q6_K.gguf + Qwen2.5-Coder-0.5B-Instruct-Q4_0.gguf
    const std::vector<float> main_batch_speeds = { 18.39, 37.40, 55.68, 73.54, 90.26, 106.11, 114.92, 117.37, 140.73, 156.29, 171.05, 186.18, 201.16, 215.75, 230.52, 245.79, 254.85, 268.25, 282.39, 296.24, 311.07, 323.96, 338.24, 351.62, 374.63, 387.71, 401.52, 415.38, 429.15, 443.62, 455.55, 470.03 };
    const std::vector<float> draft_batch_speeds = { 324.59, 756.20, 1108.57, 1435.62, 1663.25, 1934.04, 2153.58, 2235.32, 1978.88, 2126.60, 2309.37, 2492.45, 2678.03, 2839.02, 3026.49, 3183.50, 3398.03, 3395.50, 3567.67, 3692.83, 3857.95, 4010.42, 4153.14, 4285.82, 4452.40, 4428.92, 4555.89, 4716.58, 4800.15, 4920.78, 5048.93, 5162.57 };

    // ==========================================================================================================
        
    // The estimated lookahead cost per token, in terms of the main model's token generation speed
    const float lookahead_cost_estimate = main_batch_speeds[0] / draft_batch_speeds[0];
    
    // The maximum lookahead relative to the best we have seen so far
    const int max_lookahead = 5;
    
    // ==========================================================================================================
    
    // The best draft size and its associated expected value so far (ie: init to the the main model's TG speed)
    int best_draft_size = 0;
    float best_draft_ev = main_batch_speeds[0];
    
    // The current sequence probability, as predicted by the draft model
    float current_sequence_p = 1.0;
   
    GGML_ASSERT((int) main_batch_speeds.size() == params.n_draft);
    GGML_ASSERT((int) draft_batch_speeds.size() == params.n_draft);

    for (int i = 0; i < params.n_draft; ++i) {
        
        // Sample a draft token
        common_batch_clear(batch);
        common_sampler_sample(smpl, ctx, 0, true);
        const auto * cur_p = common_sampler_get_candidates(smpl);
        const llama_token id = cur_p->data[0].id;
        common_sampler_accept(smpl, id, true);

        // Save the sampled token id
        result.push_back(id);
        
        // Get the current draft size we are looking at
        const int current_draft_size = result.size();
        
        // If we have enough tokens already, then stop
        if (current_draft_size >= params.n_draft) {
            break;
        }

        // Update the sequence probability using the sampled token's predicted probability
        current_sequence_p *= cur_p->data[0].p;

        // Calculate the expected value (in terms of the main model's tokens/s) for this sequence
        const float current_sequence_ev = current_sequence_p * main_batch_speeds[current_draft_size];
        
        // Is this token clearly +EV compared to what we have so far?
        if (current_sequence_ev > best_draft_ev) {
            best_draft_size = current_draft_size;
            best_draft_ev = current_sequence_ev;
        }

        // Otherwise we have to decide if we might see a +EV draft in the future, or stop now
        else {
            
            bool stop_now = true;
            
            for (int j = current_draft_size + 1; j < (int) main_batch_speeds.size(); j++) {

                // Don't bother looking too far relative to the best we have seen so far
                if (j - best_draft_size > max_lookahead) {
                    break;
                }
                
                // This approximates the cost of the lookahead in terms of the main model's tokens/s.
                const float lookahead_cost = lookahead_cost_estimate * (float) (j - current_draft_size);

                // Calculate the discounted potential EV of this lookahead depth
                // NOTE: This assumes the worst case of the draft predicting p=1.0 for all future tokens...
                const float potential_ev = current_sequence_p * (main_batch_speeds[j] - lookahead_cost);

                // Is this lookahead depth potentially +EV compared to the best we have so far?
                if (potential_ev > best_draft_ev) {
                    stop_now = false;
                    break;
                }
            }
            
            // If no chance to improve the EV, then stop now
            if (stop_now) {
                break;
            }

        }

        common_batch_add(batch, id, n_past + i + 1, { 0 }, true);

        // evaluate the drafted tokens on the draft model
        llama_decode(ctx, batch);

        prompt.push_back(id);
    }   

    //printf(", %d] {%d} %d %.2f (+%.2f)\n",(int) result.size(), (best_draft_size > 0 ? (int) result.size() - best_draft_size : 0), best_draft_size + 1, best_draft_ev, best_draft_ev - main_batch_speeds[0]);

    // NOTE: The main model should also generate the next token after the most +EV size we found.
    //       This is because if we successfully get to this token, the main model will see the
    //       full distribution and cannot be wrong (ie: essentially it's free if we get to it).
    result.resize(best_draft_size + 1);

    return result;

but I don't really see much potential for large improvements and don't want to complicate it any more.

The code I used to generate the main_batch_speeds and draft_batch_speeds C++ vectors to paste in is here:

#!/bin/bash

# Environment variables
export CUDA_VISIBLE_DEVICES=0

# Configuration variables
BATCHED_BENCH_EXE="~/llama.cpp/build/bin/llama-batched-bench"

MAIN_MODEL_PATH="~/models/gguf/Qwen2.5-Coder-32B-Instruct-Q6_K.gguf"
#MAIN_MODEL_PATH="~/models/gguf/Deepseek-V3-0324-Q4_K_XL.gguf"

DRAFT_MODEL_PATH="~/models/gguf/draft_models/Qwen2.5-Coder-0.5B-Instruct-Q4_0.gguf"
#DRAFT_MODEL_PATH="~/models/gguf/draft_models/DeepSeek-V3-0324-CODER-DRAFT-0.6B-Q4_0.gguf"

# Benchmark parameters
PROMPT_SIZE=512
MAX_DRAFT_SIZE=32
NUM_SAMPLES=5

# Model-specific parameters
MODEL_PARAMS="--n-gpu-layers 99 \
              --flash-attn"

#MODEL_PARAMS="--n-gpu-layers 99 \
#              --flash-attn \
#              --numa distribute \
#              --threads 80 \
#              --override-tensor exps=CPU"

# Generate PROMPT_SIZE_LIST and BATCH_SIZE_LIST (NOTE: Process 2x before 1x to help with warmup)
PROMPT_SIZE_LIST="$((PROMPT_SIZE * 2)),${PROMPT_SIZE}"
BATCH_SIZE_LIST=$(printf "%s," $(for i in $(seq 1 $NUM_SAMPLES); do seq 1 $MAX_DRAFT_SIZE; done) | sed 's/,$//')

# Function to run benchmark for a model
run_benchmark() {
    local model_path=$1
    local log_file=$2
    
    echo "- Running benchmark for $(basename "$model_path")..."
    $BATCHED_BENCH_EXE \
        --model "$model_path" \
        $MODEL_PARAMS \
        --ctx_size "$((PROMPT_SIZE * 2 + MAX_DRAFT_SIZE))" \
        -pps \
        -npp "$PROMPT_SIZE_LIST" \
        -npl "$BATCH_SIZE_LIST" \
        -ntg 1 \
        --output-format jsonl | tee "$log_file"
}

# Function to extract and process results
extract_and_process() {
    local log_file=$1
    local model_name=$2
    local jsonl_pp1x="${model_name}_pp1x.jsonl"
    local jsonl_pp2x="${model_name}_pp2x.jsonl"
    
    # Extract JSONL lines from the log (NOTE: Skip first set of samples as seems to need a warmup to get accurate stats)
    echo -n "- Extracting results for $model_name..." >&2
    grep '^{' "$log_file" | tail -n "+$((NUM_SAMPLES + 1))" | grep "\"pp\": ${PROMPT_SIZE}" > "$jsonl_pp1x"
    grep '^{' "$log_file" | tail -n "+$((NUM_SAMPLES + 1))"| grep "\"pp\": $((PROMPT_SIZE * 2))" > "$jsonl_pp2x"
    COUNT1=$(wc -l < "$jsonl_pp1x")
    COUNT2=$(wc -l < "$jsonl_pp2x")
    echo " Done ($COUNT1 1xPP results, $COUNT2 2xPP results)" >&2
    
    # Process results and capture output
    process_results "$jsonl_pp1x" "$jsonl_pp2x" "$model_name"
    
    # Clean up
    rm "$log_file" "$jsonl_pp1x" "$jsonl_pp2x"
}

# Function to extract values as bash array
extract_values() {
    local jsonl_file=$1
    
    jq -s --raw-output '
        (map(.pl) | max) as $max_pl |
        reduce .[] as $item (
            {};
            ($item.pl | tostring) as $pl |
            .[$pl].sum = (.[$pl].sum + $item.speed_tg) |
            .[$pl].count = (.[$pl].count + 1)
        ) |
        [range(1; $max_pl + 1) as $pl |
            (.[($pl|tostring)]).sum / .[($pl|tostring)].count
        ] |
        map(. * 1000 | round | . / 1000) |
        join(" ")
    ' "$jsonl_file"
}

# Function to process JSONL and output C++ vector we will need
process_results() {
    local jsonl_file_1x=$1
    local jsonl_file_2x=$2
    local model_name=$3
    
    # Extract values as arrays
    local values_1x=($(extract_values "$jsonl_file_1x"))
    local values_2x=($(extract_values "$jsonl_file_2x"))
       
    # Solve the pair of simultaneous equations:
    # - net_batch_pl = 2*v1x - v2x  (ie: Batch speed without PP overhead)
    # - overhead_pp = v2x - v1x     (ie: Extra PP overhead)
    local net_batch_pl=()
    local overhead_pp=()
    
    for i in "${!values_1x[@]}"; do
        if [[ -n "${values_2x[i]}" ]]; then
            local v1x="${values_1x[i]}"
            local v2x="${values_2x[i]}"
            
            # Calculate using awk for floating point arithmetic
            local tg=$(awk "BEGIN {printf \"%.2f\", 2 * $v1x - $v2x}")
            local pp=$(awk "BEGIN {printf \"%.2f\", $v1x - $v2x}")
            
            net_batch_pl+=("$tg")
            overhead_pp+=("$pp")
        fi
    done

    # Return the C++ formatted vector
    local batch_speeds_str=$(printf '%s, ' "${net_batch_pl[@]}")
    batch_speeds_str=${batch_speeds_str%, }  # Remove trailing ", "
    echo "const std::vector<float> ${model_name}_batch_speeds = { $batch_speeds_str };"
}

# Clean previous files
rm -f main_*.log main_*.jsonl draft_*.log draft_*.jsonl

# Run benchmarks for both models and capture C++ output
run_benchmark "$MAIN_MODEL_PATH" "main_benchmark.log"
main_output=$(extract_and_process "main_benchmark.log" "main")

run_benchmark "$DRAFT_MODEL_PATH" "draft_benchmark.log"
draft_output=$(extract_and_process "draft_benchmark.log" "draft")

# Output both C++ vectors at the end
echo "----------------------------------------"
echo "$main_output"
echo "$draft_output"
echo "----------------------------------------"

but I'm not convinced the use of PROMPT_SIZE is all that worthwhile:

I takes way longer to run.
The pair of simultaneous equations that it solves assume you are always working between PROMPT_SIZE and 2 * PROMPT_SIZE.
Examining overhead_pp doesn't really seem to show any significant residuals.

I've left it in though, as it could just be my setup isn't making it useful, or perhaps with more patience it could be run with PROMPT_SIZE much higher, etc.

You could even run more than 2 and then fit a linear regression line to try to predict overhead_pp, which could then be used inside the C++, etc.

But I really just want to keep this minimal so hopefully it's understandable, and clearly shows the problem it solves related to the existing --draft-p-min method...

I doubt I'll be able to do much more for this, but hopefully somebody found it interesting! :)

I'm not sure how easy it would be to add as a proper PR either... It would probably need to export the vectors as JSON and then import them somehow (the "jumpiness" phenomenon killed any idea of parametrising the lines sadly).

0 replies

jukofyork · 2025-06-15T13:46:59Z

jukofyork
Jun 15, 2025
Collaborator

I've tracked down what is causing this:

llama.cpp/src/llama-graph.cpp

Line 1093 in 5fce5f9

if (v_mla) {

        if (v_mla) {
#if 0
            // v_mla can be applied as a matrix-vector multiplication with broadcasting across dimension 3 == n_tokens.
            // However, the code is optimized for dimensions 0 and 1 being large, so this is ineffient.
            cur = ggml_reshape_4d(ctx0, cur, v_mla->ne[0], 1, n_head, n_tokens);
            cur = ggml_mul_mat(ctx0, v_mla, cur);
#else
            // It's preferable to do the calculation as a matrix-matrix multiplication with n_tokens in dimension 1.
            // The permutations are noops and only change how the tensor data is interpreted.
            cur = ggml_permute(ctx0, cur, 0, 2, 1, 3);
            cur = ggml_mul_mat(ctx0, v_mla, cur);
            cur = ggml_permute(ctx0, cur, 0, 2, 1, 3);
            cur = ggml_cont(ctx0, cur); // Needed because ggml_reshape_2d expects contiguous inputs.
#endif
        }

using @fairydreaming's original method of only permuting if n_tokens > n_head:

        if (v_mla) {
            if (n_tokens <= n_head) {
                // v_mla can be applied as a matrix-vector multiplication with broadcasting across dimension 3 == n_tokens.
                // However, the code is optimized for dimensions 0 and 1 being large, so this is ineffient.
                cur = ggml_reshape_4d(ctx0, cur, v_mla->ne[0], 1, n_head, n_tokens);
                cur = ggml_mul_mat(ctx0, v_mla, cur);
            } else {
                // It's preferable to do the calculation as a matrix-matrix multiplication with n_tokens in dimension 1.
                // The permutations are noops and only change how the tensor data is interpreted.
                cur = ggml_permute(ctx0, cur, 0, 2, 1, 3);
                cur = ggml_mul_mat(ctx0, v_mla, cur);
                cur = ggml_permute(ctx0, cur, 0, 2, 1, 3);
                cur = ggml_cont(ctx0, cur); // Needed because ggml_reshape_2d expects contiguous inputs.
            } 
        }

and the nasty jump is gone:

(rerunning with the Linux caches dropped for NUMA and with the full PROMPT_SIZE=512 to get an exact comparison graph now...)

@JohannesGaessler It looks here like the crossover point may be n_tokens >= 32 rather than n_tokens >= n_head? Is there anything in the CUDA backend that has a 32 threshold perhaps? My v_mla is stored as BF16.

6 replies

jukofyork Jun 15, 2025
Collaborator

Yeah, that could just be an artifact of my low sample size here.

Possibly n_tokens < 32 is the correct threshold then. I haven't tested it yet, but I think that cur = ggml_cont_2d(ctx0, cur, cur->ne[0]*n_head, n_tokens) there may be better than separate ggml_cont and ggml_reshape_2d operations, eg:

        if (v_mla) {
            if (n_tokens < 32) {
                cur = ggml_reshape_4d(ctx0, cur, v_mla->ne[0], 1, n_head, n_tokens);
                cur = ggml_mul_mat(ctx0, v_mla, cur);
                cur = ggml_reshape_2d(ctx0, cur, cur->ne[0]*n_head, n_tokens);
            } else {
                cur = ggml_permute(ctx0, cur, 0, 2, 1, 3);
                cur = ggml_mul_mat(ctx0, v_mla, cur);
                cur = ggml_permute(ctx0, cur, 0, 2, 1, 3);
                cur = ggml_cont_2d(ctx0, cur, cur->ne[0]*n_head, n_tokens);
            } 
        } else {
            cur = ggml_reshape_2d(ctx0, cur, cur->ne[0]*n_head, n_tokens);
        }

Also sorry for not catching this during development and thank you for debugging it.

It might only show up so clearly on the full-sized deepseek-v3 and you used the smaller v2-lite version for development IIRC (likely with less than the 128 heads this has?).

jukofyork Jun 15, 2025
Collaborator

In my opinion it's fine to just do a comparison against 32 with a comment stating that this is what was empirically found to be the break-even point.

Ninja edited as copied the wrong code in then 😁

Do you want me to put a PR in for this for you to test on the smaller v2-lite (which I don't have at hand)?

jukofyork Jun 15, 2025
Collaborator

I've got to prepare for a meal so will let my "Linux caches dropped for NUMA and with the full PROMPT_SIZE=512" complete, and then if I can get away for a few minutes; will run with ggml_cont_2d to compare that too, and report back,

JohannesGaessler Jun 15, 2025
Collaborator

You can also just make a PR and I can then try to reproduce your results using my hardware (I have Deepseek V2 Lite ready).

jukofyork Jun 15, 2025
Collaborator

#14198

Speculative decoding potential for running big LLMs on consumer grade GPUs efficiently #10466

Uh oh!

Uh oh!

steampunque Nov 23, 2024

Replies: 18 comments · 43 replies

Uh oh!

Uh oh!

jukofyork Nov 30, 2024 Collaborator

Uh oh!

Uh oh!

steampunque Nov 30, 2024 Author

Uh oh!

Uh oh!

jukofyork Nov 30, 2024 Collaborator

Uh oh!

steampunque Nov 30, 2024 Author

Uh oh!

Uh oh!

jukofyork Nov 30, 2024 Collaborator

Uh oh!

steampunque Dec 1, 2024 Author

Uh oh!

jukofyork Dec 7, 2024 Collaborator

Uh oh!

steampunque Dec 7, 2024 Author

Uh oh!

Uh oh!

jukofyork Dec 7, 2024 Collaborator

Uh oh!

jukofyork Dec 7, 2024 Collaborator

Uh oh!

steampunque Dec 7, 2024 Author

Uh oh!

jukofyork Dec 7, 2024 Collaborator

Uh oh!

Uh oh!

steampunque Dec 8, 2024 Author

Uh oh!

ggerganov Dec 8, 2024 Maintainer

Uh oh!

steampunque Dec 8, 2024 Author

Uh oh!

ggerganov Dec 8, 2024 Maintainer

Uh oh!

Thomas-MMJ Feb 6, 2025

Uh oh!

steampunque Feb 6, 2025 Author

Uh oh!

Uh oh!

jukofyork Apr 8, 2025 Collaborator

Uh oh!

steampunque Apr 8, 2025 Author

Uh oh!

Uh oh!

jukofyork Apr 8, 2025 Collaborator

Uh oh!

jukofyork Apr 8, 2025 Collaborator

Uh oh!

jukofyork Apr 8, 2025 Collaborator

Without imatrix

With imatrix

Uh oh!

Uh oh!

jukofyork Apr 8, 2025 Collaborator

steampunque
Nov 23, 2024

Replies: 18 comments 43 replies

jukofyork
Nov 30, 2024
Collaborator

steampunque Nov 30, 2024
Author

jukofyork Nov 30, 2024
Collaborator

steampunque Nov 30, 2024
Author

jukofyork Nov 30, 2024
Collaborator

steampunque Dec 1, 2024
Author

jukofyork
Dec 7, 2024
Collaborator

steampunque Dec 7, 2024
Author

jukofyork Dec 7, 2024
Collaborator

jukofyork Dec 7, 2024
Collaborator

steampunque Dec 7, 2024
Author

jukofyork Dec 7, 2024
Collaborator

steampunque
Dec 8, 2024
Author

ggerganov Dec 8, 2024
Maintainer

steampunque Dec 8, 2024
Author

ggerganov Dec 8, 2024
Maintainer

Thomas-MMJ
Feb 6, 2025

steampunque Feb 6, 2025
Author

jukofyork
Apr 8, 2025
Collaborator

steampunque Apr 8, 2025
Author

jukofyork Apr 8, 2025
Collaborator

jukofyork Apr 8, 2025
Collaborator

jukofyork Apr 8, 2025
Collaborator

Without `imatrix`

With `imatrix`

jukofyork Apr 8, 2025
Collaborator