Skip to content

server: Fixed speculative decoding stats to use #accepted \ #tested rather than #accepted \ #drafted #14104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 10, 2025

Conversation

jukofyork
Copy link
Collaborator

Fixes #14048

The existing printout of "draft acceptance rate" is misleading, as it is counting those tokens that were skipped due to the --draft-min setting:

                // keep track of total number of tokens generated in the draft
                slot.n_draft_total += draft.size();

                // ignore small drafts
                if (slot.params.speculative.n_min > (int) draft.size()) {
                    SLT_DBG(slot, "ignoring small draft: %d < %d\n", (int) draft.size(), slot.params.speculative.n_min);

                    continue;
                }

This PR just moves the count increment to after this test so that "draft acceptance rate" becomes #accepted \ #tested rather than #accepted \ #drafted.

As I said ion the other thread, we could add a separate #accepted \ #drafted stat, but this too would be misleading as there is code in speculative.cpp to reuse the drafted tokens...

@jukofyork jukofyork requested a review from ngxson as a code owner June 10, 2025 13:02
@jukofyork jukofyork added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 10, 2025
@jukofyork jukofyork merged commit 3a12db2 into ggml-org:master Jun 10, 2025
80 of 88 checks passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jun 10, 2025
* origin/master:
llama : support GEGLU for jina-bert-v2 (ggml-org#14090)
vulkan: force device 0 in CI (ggml-org#14106)
Fixed spec timings to: accepted/tested instead of accepted/drafted (ggml-org#14104)
sync : ggml
ggml : fix weak alias win32 (whisper/0)
Vulkan: Don't default to CPU device (like llvmpipe), even if no other device is available, to allow fallback to CPU backend (ggml-org#14099)
rpc : nicer error messages for RPC server crash (ggml-org#14076)
sync : ggml
Add in-build ggml::ggml ALIAS library (ggml/1260)
metal : use less stack memory in FA kernel (ggml-org#14088)
kv-cache : fix shift and defrag logic (ggml-org#14081)
llama : allow building all tests on windows when not using shared libs (ggml-org#13980)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Speculative Decoding "acceptance rate" should not count drafts that were skipped via the " ignore small drafts" clause
2 participants