Skip to content

Results for K2-Vendor-Verifier w/llama.cpp, Q8_0 quant

Notifications You must be signed in to change notification settings

usrlocalben/k2vv-llamacpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Kimi-K2-Vendor-Verifier: llama.cpp, Q8_0

Moonshot AI recently released K2-Vendor-Verifier, a solution for verifying Kimi K2 implementations.

A sample dataset of chats is supplied (samples.jsonl) with measured results for Moonshot AI's reference platform.

Similarity is based on frequency of next-turn completion stop-reasons, JSON decode errors, schema checks, etc.

Result: For the n=2,000 dataset and original formula, the similarity is 95.49%

Model

Model is Kimi-K2-Instruct-0905, HF commit 94a4053, 2025-10-22.

Curiously, K2VV's contributors include a "-preview" suffix when referring to the model.

Test setup

The Git history for K2VV is chaotic. The benchmark values (README.md) and dataset (samples.jsonl) are not in sync depending on the commit.

Change in analysis

The authors have given two different methods to compute the similarity values.

  1. Until commit 91a154a, the README describes a formula using Euclidian distance from the reference platform's measurements.
  2. From commit 91a154a, the README describes a method using the mean value of various frequency ratios.

Dataset / Reference commit badness

The initial commit up to 1cbe205 gives reference values for a n=2000 sample dataset, but the included samples.jsonl is incomplete. In commit f0e5198, the README.md is updated to a n=4000 sample result, but samples.jsonl is updated to (presumably) the n=2000 dataset. In commit 91a154a, README is updated to the newer method, and the n=4000 sample set is posted.

Additionally, the n=4000 set describes a change in the tool-call invocations.

I happened to clone K2VV at f0e5198, and did not notice these issues until later.

The results here are based on the original Euclidian-distance formula and the n=2000 dataset.

  • Dataset (samples.jsonl) is from f0e5198
  • Reference values (README.md) is from 1cbe205

Curiously, the 2000-sample dataset contains duplicate rows:

% wc -l samples.jsonl
2000 samples.jsonl
% sort samples.jsonl | uniq | wc -l
1979

My output results.jsonl includes 1,985 rows. K2VV doesn't provide any commentary wrt. duplicates, but does include a method of matching results with input rows by value, so I speculate the authors are aware of this.

Implementation details

Conversion to GGUF

1. HF/safetensors to GGUF (BF16)

Conversion to GGUF done with llama.cpp (851553e) convert_hf_to_gguf.py, patched to support FP8. FP8 support taken from evshiron's fork as promoted in various quantizing threads (e.g. [ik_llama #258]ikawrakow/ik_llama.cpp#258))

Conversion gives BF16 GGUF image, ~2TB.

2. GGUF to GGUF (Q8_0)

Quantization to Q8_0 was done with ik_llama llama-quantize, with only Q8_0 quantization. (No custom tensor overrides etc.)

This gives a ~1TB GGUF image, which I split into chunks w/llama-gguf-split.

3. Chat template trouble.

The latest chat_template.jinja from HF is not compatible with llama.cpp. Tools are stamped out with tojson using a parameter that is not supported. I patched chat_template.jinja to remove the problematic parameter. It's unclear what this parameter is for (I can't find it in current Jinja/Minja docs) but I speculate either the parameter has no effect (':' and ',' are the standard separator chars afterall) or avoids whitespace (': ' -> ':').

--- chat_template.jinja 2025-10-28 14:32:49.869564619 -0500
+++ ../chat_template_fixed.jinja        2025-10-29 09:18:41.848487773 -0500
@@ -15,7 +15,7 @@


 {%- if tools -%}
-  <|im_system|>tool_declare<|im_middle|>{{ tools | tojson(separators=(',', ':')) }}<|im_end|>
+  <|im_system|>tool_declare<|im_middle|>{{ tools | tojson }}<|im_end|>
 {%- endif -%}
 {% for message in messages %}
   {%- if loop.first and messages[0]['role'] != 'system' -%}

Hardware

Hardware: 2x EPYC 9115, NPS1, 1.5TB DDR5 (12 DIMMs/socket) GPU Offload: 1x NVIDIA RTX 6000 Pro, CUDA 13.0

llama.cpp prep

llama.cpp commit 31c511a96 (2025-10-31)

build:

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_NATIVE=ON \
  -DGGML_CCACHE=OFF \
  -DGGML_CUDA=ON \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_SCHED_MAX_COPIES=1 \
  -DGGML_VULKAN=OFF \
  -DGGML_OPENMP=ON \
  -DLLAMA_CURL=OFF

cmake --build build --config Release -j16 --target llama-server

output exe

% ./build/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes
version: 6904 (31c511a96)
built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu

invocation:

echo 3 | sudo tee /proc/sys/vm/drop_caches  # numa distribute protocol

LLAMA_SET_ROWS=1 ~/llama.cpp/build/bin/llama-server \
    -t 28 \
    -c 131072 \
    -np 1 \
    -b 4096 -ub 4096 \
    -fa on \
    --numa distribute \
    -ngl 999 --cpu-moe \
    --cache-ram 200000 \
    --host 0.0.0.0 \
    --port 9090 \
    --jinja \
    --chat-template-file /path/to/Kimi-K2-Instruct-0905/chat_template_fixed.jinja \
    -m /path/to/Kimi-K2-Instruct-0905/Kimi-K2-0905-Q8_0-00001-of-00023.gguf

example prompt run:

slot launch_slot_: id  0 | task 352475 | processing task
slot update_slots: id  0 | task 352475 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 1171
slot update_slots: id  0 | task 352475 | n_tokens = 184, memory_seq_rm [184, end)
slot update_slots: id  0 | task 352475 | prompt processing progress, n_tokens = 1171, batch.n_tokens = 987, progress = 1.000000
slot update_slots: id  0 | task 352475 | prompt done, n_tokens = 1171, batch.n_tokens = 987
slot print_timing: id  0 | task 352475 |
prompt eval time =   29386.74 ms /   987 tokens (   29.77 ms per token,    33.59 tokens per second)
       eval time =    8047.66 ms /   104 tokens (   77.38 ms per token,    12.92 tokens per second)
      total time =   37434.40 ms /  1091 tokens
srv  log_server_r: request: POST /chat/completions 10.3.0.107 200
slot      release: id  0 | task 352475 | stop processing: n_tokens = 1274, truncated = 0
srv  update_slots: all slots are idle

Execution

python tool_calls_eval.py samples.jsonl \
  --model kimi-k2-0905-preview \
  --base-url "http://x.y.z.w:9090"
  --api-key x
  --concurrency 1
  --output results_q8_llama.jsonl
  --summary summary.json
  --incremental

I ran the set in various batches over time. Total compute time was not tracked, but it's roughly 50s/sample or just over 24 hours. I modified the test tool as described in Pull #16.

Results

Baseline results (README.md from 1cbe205)

baseline = {
    "finish_stop": 1437,
    "finish_tool_calls": 522,
    "finish_others": 41,
    "schema_validation_error_count": 0,
    "successful_tool_call_count": 522,
}

Results (summary.json computed by K2VV):

{
  "model": "kimi-k2-0905-preview",
  "success_count": 2000,
  "failure_count": 0,
  "finish_stop": 1387,
  "finish_tool_calls": 575,
  "finish_others": 38,
  "finish_others_detail": {
    "length": 38
  },
  "schema_validation_error_count": 0,
  "successful_tool_call_count": 575
}
# my code, described in README but not provided by tool, 
similarity = (1 - (math.sqrt(sum((baseline[k] - summary[k])**2
                                      for k in baseline.keys())) / 2000.0))

similarity = 0.9549

Bonus: partial results for ik_llama + Q4,Q4,Q6 MoE

I started running my usual setup, Q8-offload + gate/up/down=Q4/Q4/Q6 w/ik_llama. The preliminary results gave a very dim outlook since I had not yet noticed the README.md/samples.jsonl discrepancies noted above. Given the poor results, I switched to Q8 to build a baseline before any quants. I also switched to mainline llama.cpp as I intended to compare llama.cpp implementations as well.

Here is n=419 vs. linear scaled-down reference values.

{
  "model": "kimi-k2-0905-preview",
  "success_count": 419,
  "failure_count": 0,
  "finish_stop": 298,
  "finish_tool_calls": 121,
  "finish_others": 0,
  "finish_others_detail": {},
  "schema_validation_error_count": 1,
  "successful_tool_call_count": 120
}
similarity score: 99.09

Additional note about K2 GGUFs

At the time K2 was new, at least two different people had concurrent WiP towards conversion scripts and llama.cpp support. I recall much of the noise in the WiP development was around the tokenizer, but some of the details are fuzzy now. Comparing some of the common GGUFs available, there is variation in tokens, both string/num table and special-token assignments. I first noticed this when trying to use jukofyork's DRAFT models, which would not bind with e.g. ubergarm's quants due to special-token mismatches.

Unsloth's GGUFs include dubious "chat-template fixes," which AFAIK are not described in detail, and are perhaps inconsistent with trying to reproduce the reference platform.

Additionally, the OEM (Moonshot AI) has updated the tokenizer and chat-template at various times since the model's initial release. I don't get the impression that GGUF vendors produce updates in repsonse to upstream changes.

The build-from-source approach described above is in repsponse to this situation.

About

Results for K2-Vendor-Verifier w/llama.cpp, Q8_0 quant

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published