Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/validate-models.bat
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ REM GGUF files start with magic bytes: 0x47 0x47 0x55 0x46 ("GGUF")

setlocal enabledelayedexpansion

set "MODELS=models\codellama-7b.Q2_K.gguf" "models\jina-reranker-v1-tiny-en-Q4_0.gguf" "models\AMD-Llama-135m-code.Q2_K.gguf"
set "MODELS=models\codellama-7b.Q2_K.gguf" "models\jina-reranker-v1-tiny-en-Q4_0.gguf" "models\AMD-Llama-135m-code.Q2_K.gguf" "models\Qwen3-0.6B-Q4_K_M.gguf"

echo Validating model files...
for %%M in (%MODELS%) do (
Expand Down
1 change: 1 addition & 0 deletions .github/validate-models.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ MODELS=(
"models/codellama-7b.Q2_K.gguf"
"models/jina-reranker-v1-tiny-en-Q4_0.gguf"
"models/AMD-Llama-135m-code.Q2_K.gguf"
"models/Qwen3-0.6B-Q4_K_M.gguf"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good: Model validation array updated consistently with Windows batch file. Script will properly detect missing or corrupt model files.

)

echo "Validating model files..."
Expand Down
12 changes: 12 additions & 0 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ env:
RERANKING_MODEL_NAME: "jina-reranker-v1-tiny-en-Q4_0.gguf"
DRAFT_MODEL_URL: "https://huggingface.co/QuantFactory/AMD-Llama-135m-code-GGUF/resolve/main/AMD-Llama-135m-code.Q2_K.gguf"
DRAFT_MODEL_NAME: "AMD-Llama-135m-code.Q2_K.gguf"
REASONING_MODEL_URL: "https://huggingface.co/unsloth/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q4_K_M.gguf"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good: Environment variables for reasoning model added consistently with the existing pattern (RERANKING_MODEL_URL, DRAFT_MODEL_URL). The download URL points to a trusted HuggingFace source.

REASONING_MODEL_NAME: "Qwen3-0.6B-Q4_K_M.gguf"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new reasoning model environment variables are correctly added to the top-level env block and will be available to all five CI jobs. Ensure each job's model download steps use the same curl pattern with proper error handling (--fail flag is present).

permissions:
contents: read
jobs:
Expand Down Expand Up @@ -308,6 +310,8 @@ jobs:
run: curl -L --fail ${RERANKING_MODEL_URL} --create-dirs -o models/${RERANKING_MODEL_NAME}
- name: Download draft model
run: curl -L --fail ${DRAFT_MODEL_URL} --create-dirs -o models/${DRAFT_MODEL_NAME}
- name: Download reasoning model
run: curl -L --fail ${REASONING_MODEL_URL} --create-dirs -o models/${REASONING_MODEL_NAME}
- name: List files in models directory
run: ls -l models/
- name: Validate model files
Expand Down Expand Up @@ -357,6 +361,8 @@ jobs:
run: curl -L --fail ${RERANKING_MODEL_URL} --create-dirs -o models/${RERANKING_MODEL_NAME}
- name: Download draft model
run: curl -L --fail ${DRAFT_MODEL_URL} --create-dirs -o models/${DRAFT_MODEL_NAME}
- name: Download reasoning model
run: curl -L --fail ${REASONING_MODEL_URL} --create-dirs -o models/${REASONING_MODEL_NAME}
- name: List files in models directory
run: ls -l models/
- name: Validate model files
Expand Down Expand Up @@ -417,6 +423,8 @@ jobs:
run: curl -L --fail ${RERANKING_MODEL_URL} --create-dirs -o models/${RERANKING_MODEL_NAME}
- name: Download draft model
run: curl -L --fail ${DRAFT_MODEL_URL} --create-dirs -o models/${DRAFT_MODEL_NAME}
- name: Download reasoning model
run: curl -L --fail ${REASONING_MODEL_URL} --create-dirs -o models/${REASONING_MODEL_NAME}
- name: List files in models directory
run: ls -l models/
- name: Validate model files
Expand Down Expand Up @@ -468,6 +476,8 @@ jobs:
run: curl -L --fail ${RERANKING_MODEL_URL} --create-dirs -o models/${RERANKING_MODEL_NAME}
- name: Download draft model
run: curl -L --fail ${DRAFT_MODEL_URL} --create-dirs -o models/${DRAFT_MODEL_NAME}
- name: Download reasoning model
run: curl -L --fail ${REASONING_MODEL_URL} --create-dirs -o models/${REASONING_MODEL_NAME}
- name: List files in models directory
run: ls -l models/
- name: Validate model files
Expand Down Expand Up @@ -522,6 +532,8 @@ jobs:
run: curl -L --fail $env:RERANKING_MODEL_URL --create-dirs -o models/$env:RERANKING_MODEL_NAME
- name: Download draft model
run: curl -L --fail $env:DRAFT_MODEL_URL --create-dirs -o models/$env:DRAFT_MODEL_NAME
- name: Download reasoning model
run: curl -L --fail $env:REASONING_MODEL_URL --create-dirs -o models/$env:REASONING_MODEL_NAME
- name: List files in models directory
run: ls -l models/
- name: Validate model files
Expand Down
225 changes: 225 additions & 0 deletions src/test/java/net/ladenthin/llama/ReasoningBudgetTest.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
package net.ladenthin.llama;

import java.io.File;
import java.util.Collections;

import net.ladenthin.llama.args.ReasoningFormat;
import net.ladenthin.llama.json.ChatResponseParser;
import org.junit.AfterClass;
import org.junit.Assert;
import org.junit.Assume;
import org.junit.BeforeClass;
import org.junit.Ignore;
import org.junit.Test;

/**
* Integration tests for thinking/reasoning mode using Qwen3-0.6B-Q4_K_M.
*
* <p>These tests require the Qwen3-0.6B-Q4_K_M model (downloaded by CI). The entire
* class is skipped when the model file is absent, matching the pattern used by all
* other model-dependent test classes.
*
* <h2>Confirmed behaviour (Qwen3-0.6B, llama.cpp b9151)</h2>
* <ol>
* <li><b>Thinking is active by default.</b> Qwen3's built-in chat template injects
* {@code <think>} into the prompt before generation starts. No extra kwarg is
* required; the model reasons on every request.</li>
* <li><b>DEEPSEEK reasoning format correctly extracts thinking tokens.</b> Setting
* {@code --reasoning-format deepseek} at model load time causes the server to
* strip the {@code <think>…</think>} block from the response body and surface it
* in {@code reasoning_content}.</li>
* <li><b>{@code reasoning_budget_tokens} is NOT enforced for any model when set
* per-request.</b> The root cause is a bug in
* {@code tools/server/server-common.cpp}, function
* {@code oaicompat_chat_params_parse}: the reasoning-budget block writes
* the model-level default ({@code opt.reasoning_budget}, typically &#x2212;1)
* into {@code llama_params["reasoning_budget_tokens"]} before the generic
* copy loop runs. The copy loop then skips the per-request value from the
* request body because the key already exists
* ({@code !llama_params.contains(item.key())} is false). Result: the
* reasoning-budget sampler is never created (it requires
* {@code reasoning_budget_tokens &#x2265; 0}), and any per-request budget
* has no effect. Parameter serialisation itself is correct — see
* {@code InferenceParametersTest} and the C++ unit tests.</li>
* </ol>
*/
@ClaudeGenerated(
purpose = "Integration tests for Qwen3 thinking-mode extraction and reasoning_budget_tokens " +
"parameter acceptance. Documents the known llama.cpp limitation that budget " +
"enforcement does not work for prompt-injected thinking models."
)
public class ReasoningBudgetTest {

/**
* Generous token budget: Qwen3-0.6B spends up to ~200 tokens thinking before answering.
* 500 is enough for thinking + a short answer on all tested platforms.
*/
private static final int N_PREDICT = 500;

private static LlamaModel model;
private final ChatResponseParser parser = new ChatResponseParser();

@BeforeClass
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good practice: @BeforeClass with Assume.assumeTrue() ensures the test class gracefully skips when the model is unavailable, matching the pattern used by ChatAdvancedTest and other model-dependent tests.

public static void setup() {
Assume.assumeTrue("Reasoning model not found, skipping ReasoningBudgetTest",
new File(TestConstants.REASONING_MODEL_PATH).exists());
int gpuLayers = Integer.getInteger(TestConstants.PROP_TEST_NGL, TestConstants.DEFAULT_TEST_NGL);
model = new LlamaModel(
new ModelParameters()
.setModel(TestConstants.REASONING_MODEL_PATH)
.setCtxSize(2048)
.setGpuLayers(gpuLayers)
.setFit(false)
.setReasoningFormat(ReasoningFormat.DEEPSEEK)
.enableLogTimestamps().enableLogPrefix()
);
}

@AfterClass
public static void tearDown() {
if (model != null) {
model.close();
}
}

/**
* Qwen3 enters thinking mode by default. With {@code reasoning_format=deepseek} set
* at model level, the thinking tokens must appear in {@code reasoning_content} and
* the final answer must appear in {@code content}.
*/
@Test
public void testThinkingDefault_reasoningContentAndAnswerPresent() {
InferenceParameters params = new InferenceParameters("")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good defensive check: setMessages(null, Collections.singletonList(...)) is correct usage per the API (first parameter is system message, which can be null). Clear and follows existing test patterns.

.setMessages(null, Collections.singletonList(new Pair<>("user", "What is 2+2?")))
.setNPredict(N_PREDICT);

String json = model.chatComplete(params);
String reasoningContent = parser.extractChoiceReasoningContent(json);
String content = parser.extractChoiceContent(json);

Assert.assertFalse(
"reasoning_content should be non-empty (Qwen3 thinks by default)",
reasoningContent == null || reasoningContent.trim().isEmpty());
Assert.assertFalse(
"content must not be empty (model must produce an answer after thinking)",
content == null || content.trim().isEmpty());
}

/**
* {@code reasoning_budget_tokens=0} is accepted by the API and the response
* completes without error, but the budget is NOT enforced.
*
* <p><b>Documents current (broken) behaviour.</b> The per-request value is
* silently discarded by a bug in {@code tools/server/server-common.cpp}
* ({@code oaicompat_chat_params_parse}): the reasoning-budget block writes the
* model-level default (&#x2212;1) to {@code llama_params["reasoning_budget_tokens"]}
* before the generic copy loop runs, and the copy loop then skips the user value
* because the key already exists. The reasoning-budget sampler is therefore never
* created, and {@code reasoning_content} remains non-empty.
*
* <p>This assertion will start <b>failing</b> once the llama.cpp bug is fixed —
* that is the signal to remove this test and enable
* {@link #testReasoningBudgetZero_expectedBehavior_suppressesThinking}.
*/
@Test
public void testReasoningBudgetZero_parameterAccepted_thinkingNotSuppressed() {
InferenceParameters params = new InferenceParameters("")
.setMessages(null, Collections.singletonList(new Pair<>("user", "What is 2+2?")))
.setReasoningBudgetTokens(0)
.setNPredict(N_PREDICT);

String json = model.chatComplete(params);

Assert.assertNotNull("Response JSON must not be null", json);

String reasoningContent = parser.extractChoiceReasoningContent(json);
Assert.assertFalse(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion logic here is intentional and well-documented: assertFalse on empty reasoning_content means the assertion passes when reasoning_content is non-empty. This documents the current (undesired) behavior where Qwen3's budget enforcement doesn't work due to prompt-injected thinking.

This is clear from the comment above, but worth noting that when llama.cpp adds support for prompt-prefilled thinking contexts (as mentioned in line 129), this assertion will start failing — and that's the signal to flip it to assertTrue.

"reasoning_content is expected to be present because the per-request " +
"budget is not applied (llama.cpp server-common.cpp copy-loop bug). " +
"If this assertion fails, the bug has been fixed — remove this test and " +
"enable testReasoningBudgetZero_expectedBehavior_suppressesThinking.",
reasoningContent == null || reasoningContent.trim().isEmpty());
}

/**
* Expected correct behaviour after the llama.cpp bug is fixed.
*
* <p><b>Bug:</b> In {@code tools/server/server-common.cpp},
* {@code oaicompat_chat_params_parse} sets
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent implementation: the multi-part Javadoc provides the exact bug location, symptoms, and the minimal 3-line fix. This @ignore test serves as a perfect regression detector — once the server-common.cpp bug is fixed upstream, this test will fail (signal to remove @Ignore and delete the sister test), preventing the fix from being undone.

* {@code llama_params["reasoning_budget_tokens"]} to the model-level default
* ({@code opt.reasoning_budget}, typically &#x2212;1) before the generic copy
* loop runs. The copy loop then skips the per-request value from the request
* body because the key already exists. Result: the sampler is never created
* ({@code reasoning_budget_tokens &#x2265; 0} is required), and budget=0
* has no effect.
*
* <p><b>Fix (server-common.cpp, reasoning budget block):</b>
* Read {@code reasoning_budget_tokens} from the request body <em>before</em>
* writing to {@code llama_params}:
* <pre>
* int reasoning_budget = opt.reasoning_budget;
* if (body.contains("reasoning_budget_tokens")) {
* reasoning_budget = json_value(body, "reasoning_budget_tokens", reasoning_budget);
* }
* if (reasoning_budget == -1 &amp;&amp; body.contains("thinking_budget_tokens")) {
* reasoning_budget = json_value(body, "thinking_budget_tokens", -1);
* }
* </pre>
*
* <p>Once this fix is applied: remove {@code @Ignore}, confirm this test passes,
* and remove
* {@link #testReasoningBudgetZero_parameterAccepted_thinkingNotSuppressed}.
*/
@Ignore("llama.cpp bug: per-request reasoning_budget_tokens is overwritten by model default " +
"in oaicompat_chat_params_parse (server-common.cpp). " +
"See Javadoc for exact fix location and code.")
@Test
public void testReasoningBudgetZero_expectedBehavior_suppressesThinking() {
InferenceParameters params = new InferenceParameters("")
.setMessages(null, Collections.singletonList(new Pair<>("user", "What is 2+2?")))
.setReasoningBudgetTokens(0)
.setNPredict(N_PREDICT);

String json = model.chatComplete(params);
Assert.assertNotNull("Response JSON must not be null", json);

String reasoningContent = parser.extractChoiceReasoningContent(json);
Assert.assertTrue(
"reasoning_content should be empty when budget=0 suppresses thinking, " +
"but was: " + reasoningContent,
reasoningContent == null || reasoningContent.trim().isEmpty());
}

/**
* A positive {@code reasoning_budget_tokens} value is accepted and the call completes
* without error.
*
* <p>The assertion checks that the model produced a non-empty response — either in
* {@code reasoning_content} or {@code content}. On slow or constrained hardware the
* model may exhaust the token budget inside the thinking block and emit an empty
* {@code content}; checking both fields makes the test robust to that behaviour.
*
* <p>See {@link #testReasoningBudgetZero_parameterAccepted_thinkingNotSuppressed} for
* the note on why the budget count itself is not asserted.
*/
@Test
public void testReasoningBudgetPositive_parameterAccepted() {
InferenceParameters params = new InferenceParameters("")
.setMessages(null, Collections.singletonList(
new Pair<>("user", "Think step by step: what is 3 times 7?")))
.setReasoningBudgetTokens(100)
.setNPredict(N_PREDICT);

String json = model.chatComplete(params);
Assert.assertNotNull("Response JSON must not be null", json);

String reasoningContent = parser.extractChoiceReasoningContent(json);
String content = parser.extractChoiceContent(json);
boolean hasReasoning = reasoningContent != null && !reasoningContent.trim().isEmpty();
boolean hasContent = content != null && !content.trim().isEmpty();
Assert.assertTrue(
"model must produce at least some output in reasoning_content or content, " +
"but both were empty",
hasReasoning || hasContent);
}
}
3 changes: 3 additions & 0 deletions src/test/java/net/ladenthin/llama/TestConstants.java
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,7 @@ class TestConstants {
/** Path to the draft model used for speculative decoding tests. */
static final String DRAFT_MODEL_PATH = "models/AMD-Llama-135m-code.Q2_K.gguf";

/** Path to the Qwen3 thinking model used for reasoning budget tests. */
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good — the REASONING_MODEL_PATH constant follows the same pattern as MODEL_PATH and DRAFT_MODEL_PATH. Verify the path "models/Qwen3-0.6B-Q4_K_M.gguf" matches exactly with the REASONING_MODEL_NAME in publish.yml (currently correct).

static final String REASONING_MODEL_PATH = "models/Qwen3-0.6B-Q4_K_M.gguf";

}
Loading