Moonshot AI recently released K2-Vendor-Verifier, a solution for verifying Kimi K2 implementations.
A sample dataset of chats is supplied (samples.jsonl) with measured results for Moonshot AI's reference platform.
Similarity is based on frequency of next-turn completion stop-reasons, JSON decode errors, schema checks, etc.
Result: For the n=2,000 dataset and original formula, the similarity is 95.49%
Model is Kimi-K2-Instruct-0905, HF commit 94a4053, 2025-10-22.
Curiously, K2VV's contributors include a "-preview" suffix when referring to the model.
The Git history for K2VV is chaotic. The benchmark values (README.md) and dataset (samples.jsonl) are not in sync depending on the commit.
The authors have given two different methods to compute the similarity values.
- Until commit 91a154a, the README describes a formula using Euclidian distance from the reference platform's measurements.
- From commit 91a154a, the README describes a method using the mean value of various frequency ratios.
The initial commit up to 1cbe205 gives reference values for a n=2000 sample dataset, but the included samples.jsonl is incomplete. In commit f0e5198, the README.md is updated to a n=4000 sample result, but samples.jsonl is updated to (presumably) the n=2000 dataset. In commit 91a154a, README is updated to the newer method, and the n=4000 sample set is posted.
Additionally, the n=4000 set describes a change in the tool-call invocations.
I happened to clone K2VV at f0e5198, and did not notice these issues until later.
The results here are based on the original Euclidian-distance formula and the n=2000 dataset.
- Dataset (samples.jsonl) is from f0e5198
- Reference values (README.md) is from 1cbe205
Curiously, the 2000-sample dataset contains duplicate rows:
% wc -l samples.jsonl
2000 samples.jsonl
% sort samples.jsonl | uniq | wc -l
1979
My output results.jsonl includes 1,985 rows. K2VV doesn't provide any commentary wrt. duplicates, but does include a method of matching results with input rows by value, so I speculate the authors are aware of this.
Conversion to GGUF done with llama.cpp (851553e) convert_hf_to_gguf.py, patched to support FP8. FP8 support taken from evshiron's fork as promoted in various quantizing threads (e.g. [ik_llama #258]ikawrakow/ik_llama.cpp#258))
Conversion gives BF16 GGUF image, ~2TB.
Quantization to Q8_0 was done with ik_llama llama-quantize, with only Q8_0 quantization. (No custom tensor overrides etc.)
This gives a ~1TB GGUF image, which I split into chunks w/llama-gguf-split.
The latest chat_template.jinja from HF is not compatible with llama.cpp. Tools are stamped out with tojson using a parameter that is not supported. I patched chat_template.jinja to remove the problematic parameter. It's unclear what this parameter is for (I can't find it in current Jinja/Minja docs) but I speculate either the parameter has no effect (':' and ',' are the standard separator chars afterall) or avoids whitespace (': ' -> ':').
--- chat_template.jinja 2025-10-28 14:32:49.869564619 -0500
+++ ../chat_template_fixed.jinja 2025-10-29 09:18:41.848487773 -0500
@@ -15,7 +15,7 @@
{%- if tools -%}
- <|im_system|>tool_declare<|im_middle|>{{ tools | tojson(separators=(',', ':')) }}<|im_end|>
+ <|im_system|>tool_declare<|im_middle|>{{ tools | tojson }}<|im_end|>
{%- endif -%}
{% for message in messages %}
{%- if loop.first and messages[0]['role'] != 'system' -%}
Hardware: 2x EPYC 9115, NPS1, 1.5TB DDR5 (12 DIMMs/socket) GPU Offload: 1x NVIDIA RTX 6000 Pro, CUDA 13.0
llama.cpp commit 31c511a96 (2025-10-31)
build:
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_NATIVE=ON \
-DGGML_CCACHE=OFF \
-DGGML_CUDA=ON \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_SCHED_MAX_COPIES=1 \
-DGGML_VULKAN=OFF \
-DGGML_OPENMP=ON \
-DLLAMA_CURL=OFF
cmake --build build --config Release -j16 --target llama-server
output exe
% ./build/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes
version: 6904 (31c511a96)
built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu
invocation:
echo 3 | sudo tee /proc/sys/vm/drop_caches # numa distribute protocol
LLAMA_SET_ROWS=1 ~/llama.cpp/build/bin/llama-server \
-t 28 \
-c 131072 \
-np 1 \
-b 4096 -ub 4096 \
-fa on \
--numa distribute \
-ngl 999 --cpu-moe \
--cache-ram 200000 \
--host 0.0.0.0 \
--port 9090 \
--jinja \
--chat-template-file /path/to/Kimi-K2-Instruct-0905/chat_template_fixed.jinja \
-m /path/to/Kimi-K2-Instruct-0905/Kimi-K2-0905-Q8_0-00001-of-00023.gguf
example prompt run:
slot launch_slot_: id 0 | task 352475 | processing task
slot update_slots: id 0 | task 352475 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 1171
slot update_slots: id 0 | task 352475 | n_tokens = 184, memory_seq_rm [184, end)
slot update_slots: id 0 | task 352475 | prompt processing progress, n_tokens = 1171, batch.n_tokens = 987, progress = 1.000000
slot update_slots: id 0 | task 352475 | prompt done, n_tokens = 1171, batch.n_tokens = 987
slot print_timing: id 0 | task 352475 |
prompt eval time = 29386.74 ms / 987 tokens ( 29.77 ms per token, 33.59 tokens per second)
eval time = 8047.66 ms / 104 tokens ( 77.38 ms per token, 12.92 tokens per second)
total time = 37434.40 ms / 1091 tokens
srv log_server_r: request: POST /chat/completions 10.3.0.107 200
slot release: id 0 | task 352475 | stop processing: n_tokens = 1274, truncated = 0
srv update_slots: all slots are idle
python tool_calls_eval.py samples.jsonl \
--model kimi-k2-0905-preview \
--base-url "http://x.y.z.w:9090"
--api-key x
--concurrency 1
--output results_q8_llama.jsonl
--summary summary.json
--incremental
I ran the set in various batches over time. Total compute time was not tracked, but it's roughly 50s/sample or just over 24 hours. I modified the test tool as described in Pull #16.
Baseline results (README.md from 1cbe205)
baseline = {
"finish_stop": 1437,
"finish_tool_calls": 522,
"finish_others": 41,
"schema_validation_error_count": 0,
"successful_tool_call_count": 522,
}
Results (summary.json computed by K2VV):
{
"model": "kimi-k2-0905-preview",
"success_count": 2000,
"failure_count": 0,
"finish_stop": 1387,
"finish_tool_calls": 575,
"finish_others": 38,
"finish_others_detail": {
"length": 38
},
"schema_validation_error_count": 0,
"successful_tool_call_count": 575
}
# my code, described in README but not provided by tool,
similarity = (1 - (math.sqrt(sum((baseline[k] - summary[k])**2
for k in baseline.keys())) / 2000.0))
similarity = 0.9549
I started running my usual setup, Q8-offload + gate/up/down=Q4/Q4/Q6 w/ik_llama. The preliminary results gave a very dim outlook since I had not yet noticed the README.md/samples.jsonl discrepancies noted above. Given the poor results, I switched to Q8 to build a baseline before any quants. I also switched to mainline llama.cpp as I intended to compare llama.cpp implementations as well.
Here is n=419 vs. linear scaled-down reference values.
{
"model": "kimi-k2-0905-preview",
"success_count": 419,
"failure_count": 0,
"finish_stop": 298,
"finish_tool_calls": 121,
"finish_others": 0,
"finish_others_detail": {},
"schema_validation_error_count": 1,
"successful_tool_call_count": 120
}
similarity score: 99.09
At the time K2 was new, at least two different people had concurrent WiP towards conversion scripts and llama.cpp support. I recall much of the noise in the WiP development was around the tokenizer, but some of the details are fuzzy now. Comparing some of the common GGUFs available, there is variation in tokens, both string/num table and special-token assignments. I first noticed this when trying to use jukofyork's DRAFT models, which would not bind with e.g. ubergarm's quants due to special-token mismatches.
Unsloth's GGUFs include dubious "chat-template fixes," which AFAIK are not described in detail, and are perhaps inconsistent with trying to reproduce the reference platform.
Additionally, the OEM (Moonshot AI) has updated the tokenizer and chat-template at various times since the model's initial release. I don't get the impression that GGUF vendors produce updates in repsonse to upstream changes.
The build-from-source approach described above is in repsponse to this situation.