-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement stochastic speculative sampling #5625
Implement stochastic speculative sampling #5625
Conversation
I recently worked with these files and should be able to review. However, I'm currently attending a scientific conference and will only be available next week. |
From a quick look, I believe the authors are selecting random child nodes from the draft tree at each depth of the tree. While in the proposed implementation, I think you are testing sequentially the drafted tokens from a list. In other words, randomly selecting the child nodes makes sense only for Does that make sense? |
@ggerganov Yes, that is correct. I was referring to cases where n_seq_dft > 1. |
I see now. I think the proposed approach is equivalent to the one in the paper, but I could be missing something.
The So this part can remain the same as it is. Regarding IIUC we no longer have to draft the tokens greedily, but instead apply the described strategy. In this case, I think the |
@ggerganov Understood. I wasn't certain whether using a top-1 or top-k (or p_split) drafting method would affect the equivalence of the output distribution to the target model distribution (I'm still unsure about the mathematical implications), but based on the information from the SpecInfer paper, it seems we can keep it as it is for now. In the meantime, I believe replacing p_accept with a parameter representing the number of draft tokens to sample per decoding step is a sensible decision, given its critical role in influencing the efficiency and performance of the decoding process. I'll make this update and remove the (WIP) tag. edit: Turns out that we already have parameter |
I read the paper and I do not understand how their proposed sampling method can be better than what they call "naive sampling". Fundamentally, if the probability distribution of the sampled tokens is constant, then the probability of the sampled sequence being in the draft tree is also constant. So it doesn't matter what tricks you use for acceptance/rejection, it's not going to make any difference whatsoever. Is there any evidence that quantitatively shows the method from the paper being superior to naive sampling? In #5479 I implemented some code that tests lookup decoding in terms of e.g. acceptance rate on a large text corpus in order to get sufficient statistics. It may make sense to implement something like this for speculative decoding as well.
In order to preserve the probability distribution of the LLM outputs the order in which the draft sequences are traversed must be random. This is because the algorithm stops at the first accepted continuation. If the order is non-random this therefore biases the drafting towards those tokens with a lower index (with the exact bias being implementation-specific). |
It looks like this is caused by the randomly selected value r at Regarding the quality, It might be from the fact that the output distribution is not equal to the target distribution due to top-1 drafting one question about the seed: given the same seed, should speculative output the same sequence as main? I'm not sure if that's possible or not at the moment. |
I tried a simple Python script to check whether the method works: #!/usr/bin/env python3
import numpy as np
import random
SAMPLE_SIZE = 10000
VOCAB_SIZE = 10
BRANCHING_RATIO = 4
TREE_DEPTH = 4
X = np.arange(VOCAB_SIZE)
P_LLM = np.exp(-X)
P_LLM /= np.sum(P_LLM)
P_SSM = 1 / (100 - X)
P_SSM /= np.sum(P_SSM)
def sample(probs):
x = np.random.rand()
cumsum = 0.0
for i, p_i in enumerate(probs):
cumsum += p_i
if x < cumsum:
return i
assert False
n_accept_naive = 0
n_accept_spec_infer = 0
for _ in range(SAMPLE_SIZE):
trees = [[0]]
for _ in range(TREE_DEPTH):
trees_new = []
for tree in trees:
sampled_tokens = []
while len(sampled_tokens) < BRANCHING_RATIO:
token = sample(P_SSM)
if token in sampled_tokens:
continue
trees_new.append(tree + [token])
sampled_tokens.append(token)
trees = trees_new
sequence_llm = [0] + [sample(P_LLM) for _ in range(TREE_DEPTH)]
max_match = 0
for tree in trees:
for depth in range(TREE_DEPTH):
if tree[1:depth+1] == sequence_llm[1:depth+1]:
max_match = max(max_match, depth)
n_accept_naive += max_match
n_accept_i = 0
norm = 1.0
while trees and n_accept_i < TREE_DEPTH:
random.shuffle(trees)
token = trees[0][n_accept_i]
if np.random.rand() < (P_LLM[token] / norm) / P_SSM[token]:
trees = list(filter(lambda t: t[n_accept_i] == token, trees))
n_accept_i += 1
norm = 1.0
else:
trees = list(filter(lambda t: t[n_accept_i] != token, trees))
norm -= P_LLM[token]
n_accept_spec_infer += n_accept_i
print(f"naive: {n_accept_naive / SAMPLE_SIZE}")
print(f"SpecInfer: {n_accept_spec_infer / SAMPLE_SIZE}") Edit: the above script has a bug! The results are:
So assuming my code is correct the method does seem to work. What I think is happening is that even though the ultimate probability distribution of the LLM does not change, the conditional probability distribution of the LLM given the SSM results does change. In essence, because the probabilities of those tokens in the tree get scaled up by dividing them by the probabilities that they end up in the tree in the first place, there is a correlation between the tokens sampled by the SSM and the tokens sampled by the LLM. So even though the output distribution doesn't change the probability of the draft being correct does. It's similar to how the Metropolis-Hastings algorithm exploits autocorrelation to increase the rate of convergence over simple Monte Carlo methods. |
Related discussion: flexflow/FlexFlow#1302 I also noticed that my Python script had a bug regarding the normation. This version should be fixed: #!/usr/bin/env python3
import numpy as np
import random
SAMPLE_SIZE = 10000
VOCAB_SIZE = 10
BRANCHING_RATIO = 4
TREE_DEPTH = 4
X = np.arange(VOCAB_SIZE)
P_LLM = np.exp(-X)
P_LLM /= np.sum(P_LLM)
P_SSM = 1 / (100 - X)
P_SSM /= np.sum(P_SSM)
def sample(probs):
x = np.random.rand()
cumsum = 0.0
for i, p_i in enumerate(probs):
cumsum += p_i
if x < cumsum:
return i
assert False
n_accept_naive = 0
n_accept_spec_infer = 0
for _ in range(SAMPLE_SIZE):
trees = [[0]]
for _ in range(TREE_DEPTH):
trees_new = []
for tree in trees:
sampled_tokens = []
while len(sampled_tokens) < BRANCHING_RATIO:
token = sample(P_SSM)
if token in sampled_tokens:
continue
trees_new.append(tree + [token])
sampled_tokens.append(token)
trees = trees_new
sequence_llm = [0] + [sample(P_LLM) for _ in range(TREE_DEPTH)]
max_match = 0
for tree in trees:
for depth in range(TREE_DEPTH):
if tree[1:depth+1] == sequence_llm[1:depth+1]:
max_match = max(max_match, depth)
n_accept_naive += max_match
n_accept_i = 0
p_llm = np.array(P_LLM)
while trees and n_accept_i < TREE_DEPTH:
random.shuffle(trees)
token = trees[0][n_accept_i]
if np.random.rand() < p_llm[token] / P_SSM[token]:
trees = list(filter(lambda t: t[n_accept_i] == token, trees))
n_accept_i += 1
p_llm = np.array(P_LLM)
else:
trees = list(filter(lambda t: t[n_accept_i] != token, trees))
p_llm = np.maximum(0.0, p_llm - P_SSM)
p_llm /= np.sum(p_llm)
n_accept_spec_infer += n_accept_i
print(f"naive: {n_accept_naive / SAMPLE_SIZE}")
print(f"SpecInfer: {n_accept_spec_infer / SAMPLE_SIZE}") Edit: These are the fixed results:
|
examples/speculative/speculative.cpp
Outdated
|
||
const std::string token_str = llama_token_to_piece(ctx_tgt, id); | ||
// GGML_ASSERT(dist_tgt.size() == dist_dft.size()); | ||
for (int s = 0; s < n_seq_dft; ++s) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said before, unless my math is wrong order in which the drafts are iterated over must be random. But in any case, given how tricky this implementation is I think we should not make any changes to the algorithm unless we can confirm that the results stay the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. One way to implement this would be random sorting the order of sequences to evaluate, if there is a computationally cheap method to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we could just select a random sequences index to check for each iteration.
Description of what I did on commit [94f6256]: |
@JohannesGaessler I've implemented random selection of sequences to verify in commit [2ad3f7c] make -j && ./speculative -m models/llama-2-13b.Q5_K_M.gguf --model-draft models/llama-2-13b.Q5_K_M.gguf -p "Simple python quicksort function:\n\`\`\`python\n" -n 200 -e --color --log-enable --temp 1 -np 3 logs show that sequences to verify at each step is selected randomly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been doing some experiments on A100 with CodeLlama instruct. The PR seems to produce correct results and there seems to be some performance gain for temp > 0.0
compared to master
, though my tests are not very extensive.
Some observations from the experiments:
-
We continue to gain most from speculative decoding when using F16 target model. For example, with 34B F16 target + 7B Q4_0 draft, we can get a speedup of up to x3 using
--draft 16
-
Using tree-based speculative decoding (
--np
> 1) seems to slightly improve the performance for F16 target models, but does not help much for quantum models -
With quantum target models, using 34B Q8_0 + 7B Q4_0 we can get up to x1.5 speedup with
--draft 4
. And for 34B Q4_K + 7B Q4_0 I didn't observe significant speed-up from the speculative decoding for different--draft
values
To determine the optimal --draft
value, run the following command and pick the largest value for which the speed scales mostly linearly:
LLAMA_CUBLAS=1 make -j llama-bench && ./llama-bench \
-m models/codellama-34b-instruct/ggml-model-q4_0.gguf \
-m models/codellama-34b-instruct/ggml-model-q4_k.gguf \
-m models/codellama-34b-instruct/ggml-model-q8_0.gguf \
-m models/codellama-34b-instruct/ggml-model-f16.gguf \
-ngl 99 -p 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,32,512
Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 34B Q4_0 | 17.74 GiB | 33.74 B | CUDA | 99 | pp 1 | 44.15 ± 15.32 |
| llama 34B Q4_0 | 17.74 GiB | 33.74 B | CUDA | 99 | pp 2 | 97.81 ± 0.39 |
| llama 34B Q4_0 | 17.74 GiB | 33.74 B | CUDA | 99 | pp 3 | 124.42 ± 1.91 |
| llama 34B Q4_0 | 17.74 GiB | 33.74 B | CUDA | 99 | pp 4 | 146.06 ± 1.11 |
| llama 34B Q4_0 | 17.74 GiB | 33.74 B | CUDA | 99 | pp 5 | 162.86 ± 0.96 |
| llama 34B Q4_0 | 17.74 GiB | 33.74 B | CUDA | 99 | pp 6 | 182.57 ± 1.10 |
| llama 34B Q4_0 | 17.74 GiB | 33.74 B | CUDA | 99 | pp 7 | 195.39 ± 0.64 |
| llama 34B Q4_0 | 17.74 GiB | 33.74 B | CUDA | 99 | pp 8 | 202.36 ± 0.50 |
| llama 34B Q4_0 | 17.74 GiB | 33.74 B | CUDA | 99 | pp 9 | 126.62 ± 0.27 |
| llama 34B Q4_0 | 17.74 GiB | 33.74 B | CUDA | 99 | pp 10 | 140.13 ± 0.45 |
| llama 34B Q4_0 | 17.74 GiB | 33.74 B | CUDA | 99 | pp 11 | 153.46 ± 0.40 |
| llama 34B Q4_0 | 17.74 GiB | 33.74 B | CUDA | 99 | pp 12 | 166.44 ± 0.38 |
| llama 34B Q4_0 | 17.74 GiB | 33.74 B | CUDA | 99 | pp 13 | 149.57 ± 0.15 |
| llama 34B Q4_0 | 17.74 GiB | 33.74 B | CUDA | 99 | pp 14 | 160.48 ± 0.16 |
| llama 34B Q4_0 | 17.74 GiB | 33.74 B | CUDA | 99 | pp 15 | 170.85 ± 0.35 |
| llama 34B Q4_0 | 17.74 GiB | 33.74 B | CUDA | 99 | pp 16 | 181.80 ± 0.40 |
| llama 34B Q4_0 | 17.74 GiB | 33.74 B | CUDA | 99 | pp 32 | 206.25 ± 0.65 |
| llama 34B Q4_0 | 17.74 GiB | 33.74 B | CUDA | 99 | pp 512 | 1638.98 ± 1.68 |
| llama 34B Q4_0 | 17.74 GiB | 33.74 B | CUDA | 99 | tg 128 | 52.20 ± 0.03 |
| llama 34B Q4_K - Medium | 18.83 GiB | 33.74 B | CUDA | 99 | pp 1 | 45.63 ± 0.48 |
| llama 34B Q4_K - Medium | 18.83 GiB | 33.74 B | CUDA | 99 | pp 2 | 73.43 ± 0.25 |
| llama 34B Q4_K - Medium | 18.83 GiB | 33.74 B | CUDA | 99 | pp 3 | 89.17 ± 0.34 |
| llama 34B Q4_K - Medium | 18.83 GiB | 33.74 B | CUDA | 99 | pp 4 | 97.59 ± 0.26 |
| llama 34B Q4_K - Medium | 18.83 GiB | 33.74 B | CUDA | 99 | pp 5 | 100.93 ± 0.27 |
| llama 34B Q4_K - Medium | 18.83 GiB | 33.74 B | CUDA | 99 | pp 6 | 106.41 ± 0.30 |
| llama 34B Q4_K - Medium | 18.83 GiB | 33.74 B | CUDA | 99 | pp 7 | 113.68 ± 0.21 |
| llama 34B Q4_K - Medium | 18.83 GiB | 33.74 B | CUDA | 99 | pp 8 | 105.84 ± 0.16 |
| llama 34B Q4_K - Medium | 18.83 GiB | 33.74 B | CUDA | 99 | pp 9 | 90.76 ± 0.19 |
| llama 34B Q4_K - Medium | 18.83 GiB | 33.74 B | CUDA | 99 | pp 10 | 100.53 ± 0.13 |
| llama 34B Q4_K - Medium | 18.83 GiB | 33.74 B | CUDA | 99 | pp 11 | 110.19 ± 0.13 |
| llama 34B Q4_K - Medium | 18.83 GiB | 33.74 B | CUDA | 99 | pp 12 | 119.58 ± 0.36 |
| llama 34B Q4_K - Medium | 18.83 GiB | 33.74 B | CUDA | 99 | pp 13 | 105.82 ± 0.15 |
| llama 34B Q4_K - Medium | 18.83 GiB | 33.74 B | CUDA | 99 | pp 14 | 113.56 ± 0.17 |
| llama 34B Q4_K - Medium | 18.83 GiB | 33.74 B | CUDA | 99 | pp 15 | 121.41 ± 0.21 |
| llama 34B Q4_K - Medium | 18.83 GiB | 33.74 B | CUDA | 99 | pp 16 | 128.87 ± 0.24 |
| llama 34B Q4_K - Medium | 18.83 GiB | 33.74 B | CUDA | 99 | pp 32 | 148.65 ± 0.15 |
| llama 34B Q4_K - Medium | 18.83 GiB | 33.74 B | CUDA | 99 | pp 512 | 1653.67 ± 4.24 |
| llama 34B Q4_K - Medium | 18.83 GiB | 33.74 B | CUDA | 99 | tg 128 | 45.81 ± 0.03 |
| llama 34B Q8_0 | 33.39 GiB | 33.74 B | CUDA | 99 | pp 1 | 34.67 ± 0.28 |
| llama 34B Q8_0 | 33.39 GiB | 33.74 B | CUDA | 99 | pp 2 | 64.67 ± 0.16 |
| llama 34B Q8_0 | 33.39 GiB | 33.74 B | CUDA | 99 | pp 3 | 92.92 ± 0.37 |
| llama 34B Q8_0 | 33.39 GiB | 33.74 B | CUDA | 99 | pp 4 | 108.26 ± 0.47 |
| llama 34B Q8_0 | 33.39 GiB | 33.74 B | CUDA | 99 | pp 5 | 135.18 ± 0.34 |
| llama 34B Q8_0 | 33.39 GiB | 33.74 B | CUDA | 99 | pp 6 | 134.88 ± 0.75 |
| llama 34B Q8_0 | 33.39 GiB | 33.74 B | CUDA | 99 | pp 7 | 150.72 ± 0.70 |
| llama 34B Q8_0 | 33.39 GiB | 33.74 B | CUDA | 99 | pp 8 | 160.81 ± 0.50 |
| llama 34B Q8_0 | 33.39 GiB | 33.74 B | CUDA | 99 | pp 9 | 80.38 ± 0.24 |
| llama 34B Q8_0 | 33.39 GiB | 33.74 B | CUDA | 99 | pp 10 | 89.21 ± 0.17 |
| llama 34B Q8_0 | 33.39 GiB | 33.74 B | CUDA | 99 | pp 11 | 97.64 ± 0.20 |
| llama 34B Q8_0 | 33.39 GiB | 33.74 B | CUDA | 99 | pp 12 | 106.23 ± 0.21 |
| llama 34B Q8_0 | 33.39 GiB | 33.74 B | CUDA | 99 | pp 13 | 94.00 ± 0.11 |
| llama 34B Q8_0 | 33.39 GiB | 33.74 B | CUDA | 99 | pp 14 | 100.91 ± 0.18 |
| llama 34B Q8_0 | 33.39 GiB | 33.74 B | CUDA | 99 | pp 15 | 107.95 ± 0.13 |
| llama 34B Q8_0 | 33.39 GiB | 33.74 B | CUDA | 99 | pp 16 | 114.81 ± 0.10 |
| llama 34B Q8_0 | 33.39 GiB | 33.74 B | CUDA | 99 | pp 32 | 129.96 ± 0.10 |
| llama 34B Q8_0 | 33.39 GiB | 33.74 B | CUDA | 99 | pp 512 | 1870.12 ± 4.22 |
| llama 34B Q8_0 | 33.39 GiB | 33.74 B | CUDA | 99 | tg 128 | 34.71 ± 0.02 |
| llama 34B F16 | 62.85 GiB | 33.74 B | CUDA | 99 | pp 1 | 19.20 ± 0.71 |
| llama 34B F16 | 62.85 GiB | 33.74 B | CUDA | 99 | pp 2 | 41.40 ± 0.03 |
| llama 34B F16 | 62.85 GiB | 33.74 B | CUDA | 99 | pp 3 | 61.37 ± 0.41 |
| llama 34B F16 | 62.85 GiB | 33.74 B | CUDA | 99 | pp 4 | 81.70 ± 0.30 |
| llama 34B F16 | 62.85 GiB | 33.74 B | CUDA | 99 | pp 5 | 101.00 ± 0.74 |
| llama 34B F16 | 62.85 GiB | 33.74 B | CUDA | 99 | pp 6 | 121.58 ± 0.29 |
| llama 34B F16 | 62.85 GiB | 33.74 B | CUDA | 99 | pp 7 | 140.73 ± 0.56 |
| llama 34B F16 | 62.85 GiB | 33.74 B | CUDA | 99 | pp 8 | 160.49 ± 0.68 |
| llama 34B F16 | 62.85 GiB | 33.74 B | CUDA | 99 | pp 9 | 178.98 ± 0.52 |
| llama 34B F16 | 62.85 GiB | 33.74 B | CUDA | 99 | pp 10 | 198.23 ± 1.23 |
| llama 34B F16 | 62.85 GiB | 33.74 B | CUDA | 99 | pp 11 | 217.93 ± 0.80 |
| llama 34B F16 | 62.85 GiB | 33.74 B | CUDA | 99 | pp 12 | 237.09 ± 0.66 |
| llama 34B F16 | 62.85 GiB | 33.74 B | CUDA | 99 | pp 13 | 256.82 ± 0.78 |
| llama 34B F16 | 62.85 GiB | 33.74 B | CUDA | 99 | pp 14 | 274.49 ± 1.05 |
| llama 34B F16 | 62.85 GiB | 33.74 B | CUDA | 99 | pp 15 | 294.11 ± 1.28 |
| llama 34B F16 | 62.85 GiB | 33.74 B | CUDA | 99 | pp 16 | 312.92 ± 1.07 |
| llama 34B F16 | 62.85 GiB | 33.74 B | CUDA | 99 | pp 32 | 610.32 ± 2.21 |
| llama 34B F16 | 62.85 GiB | 33.74 B | CUDA | 99 | pp 512 | 2429.29 ± 10.59 |
| llama 34B F16 | 62.85 GiB | 33.74 B | CUDA | 99 | tg 128 | 20.04 ± 0.01 |
build: c7613540 (2234)
For example, for the F16 model, we can go up to --draft 16
since the speed for -p 16
is almost 16 times faster than the speed for -p 1
. However, for Q4_K
there is no point in using --draft
more than 2 because the speed does not scale so well. Hence, the poor speculative decoding results for that model
Would be nice if more people give this branch a try and report any issues and/or results
|
||
llama_sample_softmax(ctx_main, &cur_p); | ||
return cur_p; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some way to reuse the code from llama_sampling_sample
and avoid the duplication? Also, this function does not take into account the grammar - is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A solution would be to have llama_sampling_sample
utilize llama_sample_probability_distribution
internally. However, this approach appeared too intrusive to the existing codebase at the time, which led to duplicating the code instead.
+
Yes it seems like I should apply grammar constraints here
* (WIP) Implement stochastic speculative decoding * sample from residual distribution on draft accept failure * fix ggerganov#5657: force greedy sampling with probs when temp is 0 * remove p_accept parameter * fix style * remove unused variables * add srand() in speculative.cpp * replace use of rand() with mt19937 sampling * fixes based on review (@JohannesGaessler) * fix r random generation * randomly select next sequence to verify + fix bug in memory freeing * fix bug in active_seqs sync * fix uniform int distribution initialization * remove warnings from comparison between int and size_t * check grammar in `llama_sample_probability_distribution_impl` * remove malloc code by utilizing vectors * add PR link to README
* fix mul_mat fault in cpy_f32_f16 * rm unused function * add wait() for memcpy * restore ci/run.sh, rename struct defination, fix bug in ggml_sycl_op_mul_mat_sycl * fix format issue * llama : fix segfault from unknown model arch name (#5820) * llama : fix segfault from unknown model arch name * llama : make all LLM maps const This also requires using `std::map::at` instead of its `operator[]` which does not exist for const maps. * llama : name LLM_ARCH_UNKNOWN to "(unknown)" This avoids errors from `std::map::at` when getting the general name of the model architecture. Using "(unknown)" instead of an empty string as per suggestion #5820 (comment) * llama : remove redundant inner const for LLM_TENSOR_NAMES The extra const won't do anything here as const maps return const references to values. Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * llama : remove redundant nullptr check in llm_arch_from_string Since LLM_ARCH_NAMES is a const map, no spurious elements with a NULL name are inserted anymore, so this check is dead code. --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * llama : refactor internal quantization functions (#5830) * scripts : add pod-llama.sh * ggml : IQ3_S improvements (#5829) * iq3_s: somewhat faster AVX2 dot product On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using 16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s. PP-512 increases to 28.5 t/s from 23.8 t/s. * iq3_s: somewhat faster ARM_NEON dot product Still dog slow - 10.7 t/s up from 9.9 t/s. * iq3_s: another small ARM_NEON improvement 10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick that works best on AVX2. * iq3_s: minor improvement on Metal 49.4 t/s -> 50.3 t/s * iq3_s: PPL improvement E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653. * iq3_s: use new grid everywhere * Fix ARM_NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> * convert-hf : make model class definitions self-contained (#5825) * convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (#5821) * ggml : fix IQ3_S AVX implementation (#5834) ggml-ci * llama : add abort_callback to interrupt computation (#5409) * using abort_callback from ggml to stop llama computation * format fix * a brief explaining comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: tests: passkey challenge / self-extend with context shift demo (#5832) * server: tests: add models endpoint scenario * server: /v1/models add some metadata * server: tests: add debug field in context before scenario * server: tests: download model from HF, add batch size * server: tests: add passkey test * server: tests: add group attention params * server: do not truncate prompt tokens if self-extend through group attention is enabled * server: logs: do not truncate log values * server: tests - passkey - first good working value of nga * server: tests: fix server timeout * server: tests: fix passkey, add doc, fix regex content matching, fix timeout * server: tests: fix regex content matching * server: tests: schedule slow tests on master * server: metrics: fix when no prompt processed * server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1 * server: tests: increase timeout for completion * server: tests: keep only the PHI-2 test * server: tests: passkey add a negative test * flake.lock: Update (#5842) Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01) → 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29) → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23) → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * server : init http requests thread pool with --parallel if set (#5836) * ci : schedule slow server tests only on Release or on demand (#5839) * llama : fix llama_copy_state_data with fragmented KV cache (#5840) The row size of the saved states was based on kv_self.head while it should be based on llama_kv_cache_cell_max. Existing session files should still work. * llama : fix llama_kv_cache_cell_max inability to return 1 I've also changed its return type to uint32_t, because this function is always used to set the value of uint32_t variables, and because the index already has this type. * llama : fix state size calculation Some bytes in the state were unaccounted for in llama_get_state_size. Since the logits reserve so much space, it did not cause problems. * gguf-dump : support i-quants (#5841) Co-authored-by: Black_Fox <radekliska@gmail.com> * llama : allow for user specified embedding pooling type (#5849) * allow for user specified pooling type * llama : use enum types over int --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * readme : add API changes section * cuda : fix data race in soft max (#5853) * main : support special tokens as reverse/anti prompt (#5847) * Support special tokens as reverse/anti prompt. * Tokenize antiprompts only once. * main : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * common : use LLAMA_DEFAULT_SEED (#5855) * add some new ops, fix some operators and add batch operations to certain operators. (ggml/747) * cuda: fix group_norm * cuda: add batch inference support for ggml_pad/ggml_upscale * add ggml_arrange * add ggml_timestep_embedding * update ggml_arange/ggml_timestep_embedding tests * cuda: fix im2col * add ggml_arange/ggml_timestep_embbeding support for metal backend * fix some bugs * fix some bugs * Update ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-cuda.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * modify according to the review comments * ggml : fix compile warnings + code style * ggml : normalize compute_forward calls + fix seg fault in debug * minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com> * sync : ggml * add alias for chat template (#5858) * speculative : implement stochastic speculative sampling (#5625) * (WIP) Implement stochastic speculative decoding * sample from residual distribution on draft accept failure * fix #5657: force greedy sampling with probs when temp is 0 * remove p_accept parameter * fix style * remove unused variables * add srand() in speculative.cpp * replace use of rand() with mt19937 sampling * fixes based on review (@JohannesGaessler) * fix r random generation * randomly select next sequence to verify + fix bug in memory freeing * fix bug in active_seqs sync * fix uniform int distribution initialization * remove warnings from comparison between int and size_t * check grammar in `llama_sample_probability_distribution_impl` * remove malloc code by utilizing vectors * add PR link to README * cmake : handle cases where git index is not found in .git (#5844) * Update CMakeLists.txt * Update CMakeLists.txt * ggml : introduce ggml_status (ggml/750) * using enum as an exit code instead of macros * update return type from enum to unsigned int * indentation fix * compound update ggml_compute_exit_code -> ggml_status changed ggml_status from a bit-field type to simple codes ggml_status to string cast * ggml_status to string cast * GGML_CALL was removed Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * sync : ggml ggml-ci * ggml : fix unknown status (#0) * flake : fix * llama : fix embeddings (#5796) * llama : fix embeddings ggml-ci * llama : do not use KV cache for non-causal models ggml-ci * embeddings : fix llama_batch_init arg * llama : add pooling switch * llama : distinguish token vs sequence embeddings ggml-ci * llama : assert pooling tensor * llama : simplify causal mask condition ggml-ci * llama : assert input batch with pooling enabled * readme : update API changes list * nix: static build (#5814) * fix speculative decoding build on windows (#5874) * rebase and rm tailing space --------- Co-authored-by: LiangtaoJin <liang-tao.jin@intel.com> Co-authored-by: compilade <113953597+compilade@users.noreply.github.com> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: Michael Podvitskiy <podvitskiymichael@gmail.com> Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Nindaleth <Nindaleth@users.noreply.github.com> Co-authored-by: Black_Fox <radekliska@gmail.com> Co-authored-by: Douglas Hanley <thesecretaryofwar@gmail.com> Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: DAN™ <dranger003@gmail.com> Co-authored-by: leejet <leejet714@gmail.com> Co-authored-by: Minsoo Cheong <54794500+mscheong01@users.noreply.github.com> Co-authored-by: Dane Madsen <dane_madsen@hotmail.com> Co-authored-by: hutli <6594598+hutli@users.noreply.github.com> Co-authored-by: Jeffrey Quesnelle <emozilla@nousresearch.com>
* (WIP) Implement stochastic speculative decoding * sample from residual distribution on draft accept failure * fix ggerganov#5657: force greedy sampling with probs when temp is 0 * remove p_accept parameter * fix style * remove unused variables * add srand() in speculative.cpp * replace use of rand() with mt19937 sampling * fixes based on review (@JohannesGaessler) * fix r random generation * randomly select next sequence to verify + fix bug in memory freeing * fix bug in active_seqs sync * fix uniform int distribution initialization * remove warnings from comparison between int and size_t * check grammar in `llama_sample_probability_distribution_impl` * remove malloc code by utilizing vectors * add PR link to README
* fix mul_mat fault in cpy_f32_f16 * rm unused function * add wait() for memcpy * restore ci/run.sh, rename struct defination, fix bug in ggml_sycl_op_mul_mat_sycl * fix format issue * llama : fix segfault from unknown model arch name (ggerganov#5820) * llama : fix segfault from unknown model arch name * llama : make all LLM maps const This also requires using `std::map::at` instead of its `operator[]` which does not exist for const maps. * llama : name LLM_ARCH_UNKNOWN to "(unknown)" This avoids errors from `std::map::at` when getting the general name of the model architecture. Using "(unknown)" instead of an empty string as per suggestion ggerganov#5820 (comment) * llama : remove redundant inner const for LLM_TENSOR_NAMES The extra const won't do anything here as const maps return const references to values. Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * llama : remove redundant nullptr check in llm_arch_from_string Since LLM_ARCH_NAMES is a const map, no spurious elements with a NULL name are inserted anymore, so this check is dead code. --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * llama : refactor internal quantization functions (ggerganov#5830) * scripts : add pod-llama.sh * ggml : IQ3_S improvements (ggerganov#5829) * iq3_s: somewhat faster AVX2 dot product On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using 16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s. PP-512 increases to 28.5 t/s from 23.8 t/s. * iq3_s: somewhat faster ARM_NEON dot product Still dog slow - 10.7 t/s up from 9.9 t/s. * iq3_s: another small ARM_NEON improvement 10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick that works best on AVX2. * iq3_s: minor improvement on Metal 49.4 t/s -> 50.3 t/s * iq3_s: PPL improvement E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653. * iq3_s: use new grid everywhere * Fix ARM_NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> * convert-hf : make model class definitions self-contained (ggerganov#5825) * convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (ggerganov#5821) * ggml : fix IQ3_S AVX implementation (ggerganov#5834) ggml-ci * llama : add abort_callback to interrupt computation (ggerganov#5409) * using abort_callback from ggml to stop llama computation * format fix * a brief explaining comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: tests: passkey challenge / self-extend with context shift demo (ggerganov#5832) * server: tests: add models endpoint scenario * server: /v1/models add some metadata * server: tests: add debug field in context before scenario * server: tests: download model from HF, add batch size * server: tests: add passkey test * server: tests: add group attention params * server: do not truncate prompt tokens if self-extend through group attention is enabled * server: logs: do not truncate log values * server: tests - passkey - first good working value of nga * server: tests: fix server timeout * server: tests: fix passkey, add doc, fix regex content matching, fix timeout * server: tests: fix regex content matching * server: tests: schedule slow tests on master * server: metrics: fix when no prompt processed * server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1 * server: tests: increase timeout for completion * server: tests: keep only the PHI-2 test * server: tests: passkey add a negative test * flake.lock: Update (ggerganov#5842) Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01) → 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29) → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23) → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * server : init http requests thread pool with --parallel if set (ggerganov#5836) * ci : schedule slow server tests only on Release or on demand (ggerganov#5839) * llama : fix llama_copy_state_data with fragmented KV cache (ggerganov#5840) The row size of the saved states was based on kv_self.head while it should be based on llama_kv_cache_cell_max. Existing session files should still work. * llama : fix llama_kv_cache_cell_max inability to return 1 I've also changed its return type to uint32_t, because this function is always used to set the value of uint32_t variables, and because the index already has this type. * llama : fix state size calculation Some bytes in the state were unaccounted for in llama_get_state_size. Since the logits reserve so much space, it did not cause problems. * gguf-dump : support i-quants (ggerganov#5841) Co-authored-by: Black_Fox <radekliska@gmail.com> * llama : allow for user specified embedding pooling type (ggerganov#5849) * allow for user specified pooling type * llama : use enum types over int --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * readme : add API changes section * cuda : fix data race in soft max (ggerganov#5853) * main : support special tokens as reverse/anti prompt (ggerganov#5847) * Support special tokens as reverse/anti prompt. * Tokenize antiprompts only once. * main : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * common : use LLAMA_DEFAULT_SEED (ggerganov#5855) * add some new ops, fix some operators and add batch operations to certain operators. (ggml/747) * cuda: fix group_norm * cuda: add batch inference support for ggml_pad/ggml_upscale * add ggml_arrange * add ggml_timestep_embedding * update ggml_arange/ggml_timestep_embedding tests * cuda: fix im2col * add ggml_arange/ggml_timestep_embbeding support for metal backend * fix some bugs * fix some bugs * Update ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-cuda.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * modify according to the review comments * ggml : fix compile warnings + code style * ggml : normalize compute_forward calls + fix seg fault in debug * minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com> * sync : ggml * add alias for chat template (ggerganov#5858) * speculative : implement stochastic speculative sampling (ggerganov#5625) * (WIP) Implement stochastic speculative decoding * sample from residual distribution on draft accept failure * fix ggerganov#5657: force greedy sampling with probs when temp is 0 * remove p_accept parameter * fix style * remove unused variables * add srand() in speculative.cpp * replace use of rand() with mt19937 sampling * fixes based on review (@JohannesGaessler) * fix r random generation * randomly select next sequence to verify + fix bug in memory freeing * fix bug in active_seqs sync * fix uniform int distribution initialization * remove warnings from comparison between int and size_t * check grammar in `llama_sample_probability_distribution_impl` * remove malloc code by utilizing vectors * add PR link to README * cmake : handle cases where git index is not found in .git (ggerganov#5844) * Update CMakeLists.txt * Update CMakeLists.txt * ggml : introduce ggml_status (ggml/750) * using enum as an exit code instead of macros * update return type from enum to unsigned int * indentation fix * compound update ggml_compute_exit_code -> ggml_status changed ggml_status from a bit-field type to simple codes ggml_status to string cast * ggml_status to string cast * GGML_CALL was removed Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * sync : ggml ggml-ci * ggml : fix unknown status (#0) * flake : fix * llama : fix embeddings (ggerganov#5796) * llama : fix embeddings ggml-ci * llama : do not use KV cache for non-causal models ggml-ci * embeddings : fix llama_batch_init arg * llama : add pooling switch * llama : distinguish token vs sequence embeddings ggml-ci * llama : assert pooling tensor * llama : simplify causal mask condition ggml-ci * llama : assert input batch with pooling enabled * readme : update API changes list * nix: static build (ggerganov#5814) * fix speculative decoding build on windows (ggerganov#5874) * rebase and rm tailing space --------- Co-authored-by: LiangtaoJin <liang-tao.jin@intel.com> Co-authored-by: compilade <113953597+compilade@users.noreply.github.com> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: Michael Podvitskiy <podvitskiymichael@gmail.com> Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Nindaleth <Nindaleth@users.noreply.github.com> Co-authored-by: Black_Fox <radekliska@gmail.com> Co-authored-by: Douglas Hanley <thesecretaryofwar@gmail.com> Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: DAN™ <dranger003@gmail.com> Co-authored-by: leejet <leejet714@gmail.com> Co-authored-by: Minsoo Cheong <54794500+mscheong01@users.noreply.github.com> Co-authored-by: Dane Madsen <dane_madsen@hotmail.com> Co-authored-by: hutli <6594598+hutli@users.noreply.github.com> Co-authored-by: Jeffrey Quesnelle <emozilla@nousresearch.com>
* (WIP) Implement stochastic speculative decoding * sample from residual distribution on draft accept failure * fix ggerganov#5657: force greedy sampling with probs when temp is 0 * remove p_accept parameter * fix style * remove unused variables * add srand() in speculative.cpp * replace use of rand() with mt19937 sampling * fixes based on review (@JohannesGaessler) * fix r random generation * randomly select next sequence to verify + fix bug in memory freeing * fix bug in active_seqs sync * fix uniform int distribution initialization * remove warnings from comparison between int and size_t * check grammar in `llama_sample_probability_distribution_impl` * remove malloc code by utilizing vectors * add PR link to README
* fix mul_mat fault in cpy_f32_f16 * rm unused function * add wait() for memcpy * restore ci/run.sh, rename struct defination, fix bug in ggml_sycl_op_mul_mat_sycl * fix format issue * llama : fix segfault from unknown model arch name (ggerganov#5820) * llama : fix segfault from unknown model arch name * llama : make all LLM maps const This also requires using `std::map::at` instead of its `operator[]` which does not exist for const maps. * llama : name LLM_ARCH_UNKNOWN to "(unknown)" This avoids errors from `std::map::at` when getting the general name of the model architecture. Using "(unknown)" instead of an empty string as per suggestion ggerganov#5820 (comment) * llama : remove redundant inner const for LLM_TENSOR_NAMES The extra const won't do anything here as const maps return const references to values. Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * llama : remove redundant nullptr check in llm_arch_from_string Since LLM_ARCH_NAMES is a const map, no spurious elements with a NULL name are inserted anymore, so this check is dead code. --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * llama : refactor internal quantization functions (ggerganov#5830) * scripts : add pod-llama.sh * ggml : IQ3_S improvements (ggerganov#5829) * iq3_s: somewhat faster AVX2 dot product On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using 16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s. PP-512 increases to 28.5 t/s from 23.8 t/s. * iq3_s: somewhat faster ARM_NEON dot product Still dog slow - 10.7 t/s up from 9.9 t/s. * iq3_s: another small ARM_NEON improvement 10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick that works best on AVX2. * iq3_s: minor improvement on Metal 49.4 t/s -> 50.3 t/s * iq3_s: PPL improvement E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653. * iq3_s: use new grid everywhere * Fix ARM_NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> * convert-hf : make model class definitions self-contained (ggerganov#5825) * convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (ggerganov#5821) * ggml : fix IQ3_S AVX implementation (ggerganov#5834) ggml-ci * llama : add abort_callback to interrupt computation (ggerganov#5409) * using abort_callback from ggml to stop llama computation * format fix * a brief explaining comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: tests: passkey challenge / self-extend with context shift demo (ggerganov#5832) * server: tests: add models endpoint scenario * server: /v1/models add some metadata * server: tests: add debug field in context before scenario * server: tests: download model from HF, add batch size * server: tests: add passkey test * server: tests: add group attention params * server: do not truncate prompt tokens if self-extend through group attention is enabled * server: logs: do not truncate log values * server: tests - passkey - first good working value of nga * server: tests: fix server timeout * server: tests: fix passkey, add doc, fix regex content matching, fix timeout * server: tests: fix regex content matching * server: tests: schedule slow tests on master * server: metrics: fix when no prompt processed * server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1 * server: tests: increase timeout for completion * server: tests: keep only the PHI-2 test * server: tests: passkey add a negative test * flake.lock: Update (ggerganov#5842) Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01) → 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29) → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23) → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * server : init http requests thread pool with --parallel if set (ggerganov#5836) * ci : schedule slow server tests only on Release or on demand (ggerganov#5839) * llama : fix llama_copy_state_data with fragmented KV cache (ggerganov#5840) The row size of the saved states was based on kv_self.head while it should be based on llama_kv_cache_cell_max. Existing session files should still work. * llama : fix llama_kv_cache_cell_max inability to return 1 I've also changed its return type to uint32_t, because this function is always used to set the value of uint32_t variables, and because the index already has this type. * llama : fix state size calculation Some bytes in the state were unaccounted for in llama_get_state_size. Since the logits reserve so much space, it did not cause problems. * gguf-dump : support i-quants (ggerganov#5841) Co-authored-by: Black_Fox <radekliska@gmail.com> * llama : allow for user specified embedding pooling type (ggerganov#5849) * allow for user specified pooling type * llama : use enum types over int --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * readme : add API changes section * cuda : fix data race in soft max (ggerganov#5853) * main : support special tokens as reverse/anti prompt (ggerganov#5847) * Support special tokens as reverse/anti prompt. * Tokenize antiprompts only once. * main : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * common : use LLAMA_DEFAULT_SEED (ggerganov#5855) * add some new ops, fix some operators and add batch operations to certain operators. (ggml/747) * cuda: fix group_norm * cuda: add batch inference support for ggml_pad/ggml_upscale * add ggml_arrange * add ggml_timestep_embedding * update ggml_arange/ggml_timestep_embedding tests * cuda: fix im2col * add ggml_arange/ggml_timestep_embbeding support for metal backend * fix some bugs * fix some bugs * Update ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-cuda.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * modify according to the review comments * ggml : fix compile warnings + code style * ggml : normalize compute_forward calls + fix seg fault in debug * minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com> * sync : ggml * add alias for chat template (ggerganov#5858) * speculative : implement stochastic speculative sampling (ggerganov#5625) * (WIP) Implement stochastic speculative decoding * sample from residual distribution on draft accept failure * fix ggerganov#5657: force greedy sampling with probs when temp is 0 * remove p_accept parameter * fix style * remove unused variables * add srand() in speculative.cpp * replace use of rand() with mt19937 sampling * fixes based on review (@JohannesGaessler) * fix r random generation * randomly select next sequence to verify + fix bug in memory freeing * fix bug in active_seqs sync * fix uniform int distribution initialization * remove warnings from comparison between int and size_t * check grammar in `llama_sample_probability_distribution_impl` * remove malloc code by utilizing vectors * add PR link to README * cmake : handle cases where git index is not found in .git (ggerganov#5844) * Update CMakeLists.txt * Update CMakeLists.txt * ggml : introduce ggml_status (ggml/750) * using enum as an exit code instead of macros * update return type from enum to unsigned int * indentation fix * compound update ggml_compute_exit_code -> ggml_status changed ggml_status from a bit-field type to simple codes ggml_status to string cast * ggml_status to string cast * GGML_CALL was removed Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * sync : ggml ggml-ci * ggml : fix unknown status (#0) * flake : fix * llama : fix embeddings (ggerganov#5796) * llama : fix embeddings ggml-ci * llama : do not use KV cache for non-causal models ggml-ci * embeddings : fix llama_batch_init arg * llama : add pooling switch * llama : distinguish token vs sequence embeddings ggml-ci * llama : assert pooling tensor * llama : simplify causal mask condition ggml-ci * llama : assert input batch with pooling enabled * readme : update API changes list * nix: static build (ggerganov#5814) * fix speculative decoding build on windows (ggerganov#5874) * rebase and rm tailing space --------- Co-authored-by: LiangtaoJin <liang-tao.jin@intel.com> Co-authored-by: compilade <113953597+compilade@users.noreply.github.com> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: Michael Podvitskiy <podvitskiymichael@gmail.com> Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Nindaleth <Nindaleth@users.noreply.github.com> Co-authored-by: Black_Fox <radekliska@gmail.com> Co-authored-by: Douglas Hanley <thesecretaryofwar@gmail.com> Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: DAN™ <dranger003@gmail.com> Co-authored-by: leejet <leejet714@gmail.com> Co-authored-by: Minsoo Cheong <54794500+mscheong01@users.noreply.github.com> Co-authored-by: Dane Madsen <dane_madsen@hotmail.com> Co-authored-by: hutli <6594598+hutli@users.noreply.github.com> Co-authored-by: Jeffrey Quesnelle <emozilla@nousresearch.com>
* (WIP) Implement stochastic speculative decoding * sample from residual distribution on draft accept failure * fix ggerganov#5657: force greedy sampling with probs when temp is 0 * remove p_accept parameter * fix style * remove unused variables * add srand() in speculative.cpp * replace use of rand() with mt19937 sampling * fixes based on review (@JohannesGaessler) * fix r random generation * randomly select next sequence to verify + fix bug in memory freeing * fix bug in active_seqs sync * fix uniform int distribution initialization * remove warnings from comparison between int and size_t * check grammar in `llama_sample_probability_distribution_impl` * remove malloc code by utilizing vectors * add PR link to README
* fix mul_mat fault in cpy_f32_f16 * rm unused function * add wait() for memcpy * restore ci/run.sh, rename struct defination, fix bug in ggml_sycl_op_mul_mat_sycl * fix format issue * llama : fix segfault from unknown model arch name (ggerganov#5820) * llama : fix segfault from unknown model arch name * llama : make all LLM maps const This also requires using `std::map::at` instead of its `operator[]` which does not exist for const maps. * llama : name LLM_ARCH_UNKNOWN to "(unknown)" This avoids errors from `std::map::at` when getting the general name of the model architecture. Using "(unknown)" instead of an empty string as per suggestion ggerganov#5820 (comment) * llama : remove redundant inner const for LLM_TENSOR_NAMES The extra const won't do anything here as const maps return const references to values. Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * llama : remove redundant nullptr check in llm_arch_from_string Since LLM_ARCH_NAMES is a const map, no spurious elements with a NULL name are inserted anymore, so this check is dead code. --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * llama : refactor internal quantization functions (ggerganov#5830) * scripts : add pod-llama.sh * ggml : IQ3_S improvements (ggerganov#5829) * iq3_s: somewhat faster AVX2 dot product On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using 16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s. PP-512 increases to 28.5 t/s from 23.8 t/s. * iq3_s: somewhat faster ARM_NEON dot product Still dog slow - 10.7 t/s up from 9.9 t/s. * iq3_s: another small ARM_NEON improvement 10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick that works best on AVX2. * iq3_s: minor improvement on Metal 49.4 t/s -> 50.3 t/s * iq3_s: PPL improvement E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653. * iq3_s: use new grid everywhere * Fix ARM_NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> * convert-hf : make model class definitions self-contained (ggerganov#5825) * convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (ggerganov#5821) * ggml : fix IQ3_S AVX implementation (ggerganov#5834) ggml-ci * llama : add abort_callback to interrupt computation (ggerganov#5409) * using abort_callback from ggml to stop llama computation * format fix * a brief explaining comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: tests: passkey challenge / self-extend with context shift demo (ggerganov#5832) * server: tests: add models endpoint scenario * server: /v1/models add some metadata * server: tests: add debug field in context before scenario * server: tests: download model from HF, add batch size * server: tests: add passkey test * server: tests: add group attention params * server: do not truncate prompt tokens if self-extend through group attention is enabled * server: logs: do not truncate log values * server: tests - passkey - first good working value of nga * server: tests: fix server timeout * server: tests: fix passkey, add doc, fix regex content matching, fix timeout * server: tests: fix regex content matching * server: tests: schedule slow tests on master * server: metrics: fix when no prompt processed * server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1 * server: tests: increase timeout for completion * server: tests: keep only the PHI-2 test * server: tests: passkey add a negative test * flake.lock: Update (ggerganov#5842) Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01) → 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29) → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23) → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * server : init http requests thread pool with --parallel if set (ggerganov#5836) * ci : schedule slow server tests only on Release or on demand (ggerganov#5839) * llama : fix llama_copy_state_data with fragmented KV cache (ggerganov#5840) The row size of the saved states was based on kv_self.head while it should be based on llama_kv_cache_cell_max. Existing session files should still work. * llama : fix llama_kv_cache_cell_max inability to return 1 I've also changed its return type to uint32_t, because this function is always used to set the value of uint32_t variables, and because the index already has this type. * llama : fix state size calculation Some bytes in the state were unaccounted for in llama_get_state_size. Since the logits reserve so much space, it did not cause problems. * gguf-dump : support i-quants (ggerganov#5841) Co-authored-by: Black_Fox <radekliska@gmail.com> * llama : allow for user specified embedding pooling type (ggerganov#5849) * allow for user specified pooling type * llama : use enum types over int --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * readme : add API changes section * cuda : fix data race in soft max (ggerganov#5853) * main : support special tokens as reverse/anti prompt (ggerganov#5847) * Support special tokens as reverse/anti prompt. * Tokenize antiprompts only once. * main : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * common : use LLAMA_DEFAULT_SEED (ggerganov#5855) * add some new ops, fix some operators and add batch operations to certain operators. (ggml/747) * cuda: fix group_norm * cuda: add batch inference support for ggml_pad/ggml_upscale * add ggml_arrange * add ggml_timestep_embedding * update ggml_arange/ggml_timestep_embedding tests * cuda: fix im2col * add ggml_arange/ggml_timestep_embbeding support for metal backend * fix some bugs * fix some bugs * Update ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-cuda.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * modify according to the review comments * ggml : fix compile warnings + code style * ggml : normalize compute_forward calls + fix seg fault in debug * minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com> * sync : ggml * add alias for chat template (ggerganov#5858) * speculative : implement stochastic speculative sampling (ggerganov#5625) * (WIP) Implement stochastic speculative decoding * sample from residual distribution on draft accept failure * fix ggerganov#5657: force greedy sampling with probs when temp is 0 * remove p_accept parameter * fix style * remove unused variables * add srand() in speculative.cpp * replace use of rand() with mt19937 sampling * fixes based on review (@JohannesGaessler) * fix r random generation * randomly select next sequence to verify + fix bug in memory freeing * fix bug in active_seqs sync * fix uniform int distribution initialization * remove warnings from comparison between int and size_t * check grammar in `llama_sample_probability_distribution_impl` * remove malloc code by utilizing vectors * add PR link to README * cmake : handle cases where git index is not found in .git (ggerganov#5844) * Update CMakeLists.txt * Update CMakeLists.txt * ggml : introduce ggml_status (ggml/750) * using enum as an exit code instead of macros * update return type from enum to unsigned int * indentation fix * compound update ggml_compute_exit_code -> ggml_status changed ggml_status from a bit-field type to simple codes ggml_status to string cast * ggml_status to string cast * GGML_CALL was removed Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * sync : ggml ggml-ci * ggml : fix unknown status (#0) * flake : fix * llama : fix embeddings (ggerganov#5796) * llama : fix embeddings ggml-ci * llama : do not use KV cache for non-causal models ggml-ci * embeddings : fix llama_batch_init arg * llama : add pooling switch * llama : distinguish token vs sequence embeddings ggml-ci * llama : assert pooling tensor * llama : simplify causal mask condition ggml-ci * llama : assert input batch with pooling enabled * readme : update API changes list * nix: static build (ggerganov#5814) * fix speculative decoding build on windows (ggerganov#5874) * rebase and rm tailing space --------- Co-authored-by: LiangtaoJin <liang-tao.jin@intel.com> Co-authored-by: compilade <113953597+compilade@users.noreply.github.com> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: Michael Podvitskiy <podvitskiymichael@gmail.com> Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Nindaleth <Nindaleth@users.noreply.github.com> Co-authored-by: Black_Fox <radekliska@gmail.com> Co-authored-by: Douglas Hanley <thesecretaryofwar@gmail.com> Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: DAN™ <dranger003@gmail.com> Co-authored-by: leejet <leejet714@gmail.com> Co-authored-by: Minsoo Cheong <54794500+mscheong01@users.noreply.github.com> Co-authored-by: Dane Madsen <dane_madsen@hotmail.com> Co-authored-by: hutli <6594598+hutli@users.noreply.github.com> Co-authored-by: Jeffrey Quesnelle <emozilla@nousresearch.com>
Closes #5384
The implementation deviates slightly from the specified method by traversing draft sequences sequentially (from 0 to ~) instead of randomly selecting them (𝑠 ∼ rand(ℋ)). I would appreciate your feedback on whether this alteration should be corrected.
Additionally, a new method
llama_sampling_probability_distribution
has been introduced in sampling.h to retrieve the probability distribution of the target model for use in residual distribution calculations. While it's noted that the distribution could be obtained throughctx_sampling->cur
, it's important to maintain consistency with the distribution when executed through ./main, considering factors such as penalties.To achieve a comprehensive implementation of stochastic speculative decoding, it's essential to incorporate stochastic drafting for sampling drafts. Feedback is welcomed on how existing parameters like
p_split
andp_accept
should be integrated with stochastic drafting. Once this is clarified, I will refine the drafting code and remove the (WIP) from the PR title.Seeking validation on the implementation. Kindly provide feedback on any identified issues or concerns. Thank you.