-
Notifications
You must be signed in to change notification settings - Fork 2
Add integration tests for reasoning budget token enforcement #140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
9a040b5
267deb3
a799c2e
30b47fc
14d3120
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,6 +21,8 @@ env: | |
| RERANKING_MODEL_NAME: "jina-reranker-v1-tiny-en-Q4_0.gguf" | ||
| DRAFT_MODEL_URL: "https://huggingface.co/QuantFactory/AMD-Llama-135m-code-GGUF/resolve/main/AMD-Llama-135m-code.Q2_K.gguf" | ||
| DRAFT_MODEL_NAME: "AMD-Llama-135m-code.Q2_K.gguf" | ||
| REASONING_MODEL_URL: "https://huggingface.co/unsloth/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q4_K_M.gguf" | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good: Environment variables for reasoning model added consistently with the existing pattern (RERANKING_MODEL_URL, DRAFT_MODEL_URL). The download URL points to a trusted HuggingFace source. |
||
| REASONING_MODEL_NAME: "Qwen3-0.6B-Q4_K_M.gguf" | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The new reasoning model environment variables are correctly added to the top-level env block and will be available to all five CI jobs. Ensure each job's model download steps use the same curl pattern with proper error handling ( |
||
| permissions: | ||
| contents: read | ||
| jobs: | ||
|
|
@@ -308,6 +310,8 @@ jobs: | |
| run: curl -L --fail ${RERANKING_MODEL_URL} --create-dirs -o models/${RERANKING_MODEL_NAME} | ||
| - name: Download draft model | ||
| run: curl -L --fail ${DRAFT_MODEL_URL} --create-dirs -o models/${DRAFT_MODEL_NAME} | ||
| - name: Download reasoning model | ||
| run: curl -L --fail ${REASONING_MODEL_URL} --create-dirs -o models/${REASONING_MODEL_NAME} | ||
| - name: List files in models directory | ||
| run: ls -l models/ | ||
| - name: Validate model files | ||
|
|
@@ -357,6 +361,8 @@ jobs: | |
| run: curl -L --fail ${RERANKING_MODEL_URL} --create-dirs -o models/${RERANKING_MODEL_NAME} | ||
| - name: Download draft model | ||
| run: curl -L --fail ${DRAFT_MODEL_URL} --create-dirs -o models/${DRAFT_MODEL_NAME} | ||
| - name: Download reasoning model | ||
| run: curl -L --fail ${REASONING_MODEL_URL} --create-dirs -o models/${REASONING_MODEL_NAME} | ||
| - name: List files in models directory | ||
| run: ls -l models/ | ||
| - name: Validate model files | ||
|
|
@@ -417,6 +423,8 @@ jobs: | |
| run: curl -L --fail ${RERANKING_MODEL_URL} --create-dirs -o models/${RERANKING_MODEL_NAME} | ||
| - name: Download draft model | ||
| run: curl -L --fail ${DRAFT_MODEL_URL} --create-dirs -o models/${DRAFT_MODEL_NAME} | ||
| - name: Download reasoning model | ||
| run: curl -L --fail ${REASONING_MODEL_URL} --create-dirs -o models/${REASONING_MODEL_NAME} | ||
| - name: List files in models directory | ||
| run: ls -l models/ | ||
| - name: Validate model files | ||
|
|
@@ -468,6 +476,8 @@ jobs: | |
| run: curl -L --fail ${RERANKING_MODEL_URL} --create-dirs -o models/${RERANKING_MODEL_NAME} | ||
| - name: Download draft model | ||
| run: curl -L --fail ${DRAFT_MODEL_URL} --create-dirs -o models/${DRAFT_MODEL_NAME} | ||
| - name: Download reasoning model | ||
| run: curl -L --fail ${REASONING_MODEL_URL} --create-dirs -o models/${REASONING_MODEL_NAME} | ||
| - name: List files in models directory | ||
| run: ls -l models/ | ||
| - name: Validate model files | ||
|
|
@@ -522,6 +532,8 @@ jobs: | |
| run: curl -L --fail $env:RERANKING_MODEL_URL --create-dirs -o models/$env:RERANKING_MODEL_NAME | ||
| - name: Download draft model | ||
| run: curl -L --fail $env:DRAFT_MODEL_URL --create-dirs -o models/$env:DRAFT_MODEL_NAME | ||
| - name: Download reasoning model | ||
| run: curl -L --fail $env:REASONING_MODEL_URL --create-dirs -o models/$env:REASONING_MODEL_NAME | ||
| - name: List files in models directory | ||
| run: ls -l models/ | ||
| - name: Validate model files | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,225 @@ | ||
| package net.ladenthin.llama; | ||
|
|
||
| import java.io.File; | ||
| import java.util.Collections; | ||
|
|
||
| import net.ladenthin.llama.args.ReasoningFormat; | ||
| import net.ladenthin.llama.json.ChatResponseParser; | ||
| import org.junit.AfterClass; | ||
| import org.junit.Assert; | ||
| import org.junit.Assume; | ||
| import org.junit.BeforeClass; | ||
| import org.junit.Ignore; | ||
| import org.junit.Test; | ||
|
|
||
| /** | ||
| * Integration tests for thinking/reasoning mode using Qwen3-0.6B-Q4_K_M. | ||
| * | ||
| * <p>These tests require the Qwen3-0.6B-Q4_K_M model (downloaded by CI). The entire | ||
| * class is skipped when the model file is absent, matching the pattern used by all | ||
| * other model-dependent test classes. | ||
| * | ||
| * <h2>Confirmed behaviour (Qwen3-0.6B, llama.cpp b9151)</h2> | ||
| * <ol> | ||
| * <li><b>Thinking is active by default.</b> Qwen3's built-in chat template injects | ||
| * {@code <think>} into the prompt before generation starts. No extra kwarg is | ||
| * required; the model reasons on every request.</li> | ||
| * <li><b>DEEPSEEK reasoning format correctly extracts thinking tokens.</b> Setting | ||
| * {@code --reasoning-format deepseek} at model load time causes the server to | ||
| * strip the {@code <think>…</think>} block from the response body and surface it | ||
| * in {@code reasoning_content}.</li> | ||
| * <li><b>{@code reasoning_budget_tokens} is NOT enforced for any model when set | ||
| * per-request.</b> The root cause is a bug in | ||
| * {@code tools/server/server-common.cpp}, function | ||
| * {@code oaicompat_chat_params_parse}: the reasoning-budget block writes | ||
| * the model-level default ({@code opt.reasoning_budget}, typically −1) | ||
| * into {@code llama_params["reasoning_budget_tokens"]} before the generic | ||
| * copy loop runs. The copy loop then skips the per-request value from the | ||
| * request body because the key already exists | ||
| * ({@code !llama_params.contains(item.key())} is false). Result: the | ||
| * reasoning-budget sampler is never created (it requires | ||
| * {@code reasoning_budget_tokens ≥ 0}), and any per-request budget | ||
| * has no effect. Parameter serialisation itself is correct — see | ||
| * {@code InferenceParametersTest} and the C++ unit tests.</li> | ||
| * </ol> | ||
| */ | ||
| @ClaudeGenerated( | ||
| purpose = "Integration tests for Qwen3 thinking-mode extraction and reasoning_budget_tokens " + | ||
| "parameter acceptance. Documents the known llama.cpp limitation that budget " + | ||
| "enforcement does not work for prompt-injected thinking models." | ||
| ) | ||
| public class ReasoningBudgetTest { | ||
|
|
||
| /** | ||
| * Generous token budget: Qwen3-0.6B spends up to ~200 tokens thinking before answering. | ||
| * 500 is enough for thinking + a short answer on all tested platforms. | ||
| */ | ||
| private static final int N_PREDICT = 500; | ||
|
|
||
| private static LlamaModel model; | ||
| private final ChatResponseParser parser = new ChatResponseParser(); | ||
|
|
||
| @BeforeClass | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good practice: @BeforeClass with Assume.assumeTrue() ensures the test class gracefully skips when the model is unavailable, matching the pattern used by ChatAdvancedTest and other model-dependent tests. |
||
| public static void setup() { | ||
| Assume.assumeTrue("Reasoning model not found, skipping ReasoningBudgetTest", | ||
| new File(TestConstants.REASONING_MODEL_PATH).exists()); | ||
| int gpuLayers = Integer.getInteger(TestConstants.PROP_TEST_NGL, TestConstants.DEFAULT_TEST_NGL); | ||
| model = new LlamaModel( | ||
| new ModelParameters() | ||
| .setModel(TestConstants.REASONING_MODEL_PATH) | ||
| .setCtxSize(2048) | ||
| .setGpuLayers(gpuLayers) | ||
| .setFit(false) | ||
| .setReasoningFormat(ReasoningFormat.DEEPSEEK) | ||
| .enableLogTimestamps().enableLogPrefix() | ||
| ); | ||
| } | ||
|
|
||
| @AfterClass | ||
| public static void tearDown() { | ||
| if (model != null) { | ||
| model.close(); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Qwen3 enters thinking mode by default. With {@code reasoning_format=deepseek} set | ||
| * at model level, the thinking tokens must appear in {@code reasoning_content} and | ||
| * the final answer must appear in {@code content}. | ||
| */ | ||
| @Test | ||
| public void testThinkingDefault_reasoningContentAndAnswerPresent() { | ||
| InferenceParameters params = new InferenceParameters("") | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good defensive check: |
||
| .setMessages(null, Collections.singletonList(new Pair<>("user", "What is 2+2?"))) | ||
| .setNPredict(N_PREDICT); | ||
|
|
||
| String json = model.chatComplete(params); | ||
| String reasoningContent = parser.extractChoiceReasoningContent(json); | ||
| String content = parser.extractChoiceContent(json); | ||
|
|
||
| Assert.assertFalse( | ||
| "reasoning_content should be non-empty (Qwen3 thinks by default)", | ||
| reasoningContent == null || reasoningContent.trim().isEmpty()); | ||
| Assert.assertFalse( | ||
| "content must not be empty (model must produce an answer after thinking)", | ||
| content == null || content.trim().isEmpty()); | ||
| } | ||
|
|
||
| /** | ||
| * {@code reasoning_budget_tokens=0} is accepted by the API and the response | ||
| * completes without error, but the budget is NOT enforced. | ||
| * | ||
| * <p><b>Documents current (broken) behaviour.</b> The per-request value is | ||
| * silently discarded by a bug in {@code tools/server/server-common.cpp} | ||
| * ({@code oaicompat_chat_params_parse}): the reasoning-budget block writes the | ||
| * model-level default (−1) to {@code llama_params["reasoning_budget_tokens"]} | ||
| * before the generic copy loop runs, and the copy loop then skips the user value | ||
| * because the key already exists. The reasoning-budget sampler is therefore never | ||
| * created, and {@code reasoning_content} remains non-empty. | ||
| * | ||
| * <p>This assertion will start <b>failing</b> once the llama.cpp bug is fixed — | ||
| * that is the signal to remove this test and enable | ||
| * {@link #testReasoningBudgetZero_expectedBehavior_suppressesThinking}. | ||
| */ | ||
| @Test | ||
| public void testReasoningBudgetZero_parameterAccepted_thinkingNotSuppressed() { | ||
| InferenceParameters params = new InferenceParameters("") | ||
| .setMessages(null, Collections.singletonList(new Pair<>("user", "What is 2+2?"))) | ||
| .setReasoningBudgetTokens(0) | ||
| .setNPredict(N_PREDICT); | ||
|
|
||
| String json = model.chatComplete(params); | ||
|
|
||
| Assert.assertNotNull("Response JSON must not be null", json); | ||
|
|
||
| String reasoningContent = parser.extractChoiceReasoningContent(json); | ||
| Assert.assertFalse( | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The assertion logic here is intentional and well-documented: This is clear from the comment above, but worth noting that when llama.cpp adds support for prompt-prefilled thinking contexts (as mentioned in line 129), this assertion will start failing — and that's the signal to flip it to |
||
| "reasoning_content is expected to be present because the per-request " + | ||
| "budget is not applied (llama.cpp server-common.cpp copy-loop bug). " + | ||
| "If this assertion fails, the bug has been fixed — remove this test and " + | ||
| "enable testReasoningBudgetZero_expectedBehavior_suppressesThinking.", | ||
| reasoningContent == null || reasoningContent.trim().isEmpty()); | ||
| } | ||
|
|
||
| /** | ||
| * Expected correct behaviour after the llama.cpp bug is fixed. | ||
| * | ||
| * <p><b>Bug:</b> In {@code tools/server/server-common.cpp}, | ||
| * {@code oaicompat_chat_params_parse} sets | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Excellent implementation: the multi-part Javadoc provides the exact bug location, symptoms, and the minimal 3-line fix. This @ignore test serves as a perfect regression detector — once the server-common.cpp bug is fixed upstream, this test will fail (signal to remove |
||
| * {@code llama_params["reasoning_budget_tokens"]} to the model-level default | ||
| * ({@code opt.reasoning_budget}, typically −1) before the generic copy | ||
| * loop runs. The copy loop then skips the per-request value from the request | ||
| * body because the key already exists. Result: the sampler is never created | ||
| * ({@code reasoning_budget_tokens ≥ 0} is required), and budget=0 | ||
| * has no effect. | ||
| * | ||
| * <p><b>Fix (server-common.cpp, reasoning budget block):</b> | ||
| * Read {@code reasoning_budget_tokens} from the request body <em>before</em> | ||
| * writing to {@code llama_params}: | ||
| * <pre> | ||
| * int reasoning_budget = opt.reasoning_budget; | ||
| * if (body.contains("reasoning_budget_tokens")) { | ||
| * reasoning_budget = json_value(body, "reasoning_budget_tokens", reasoning_budget); | ||
| * } | ||
| * if (reasoning_budget == -1 && body.contains("thinking_budget_tokens")) { | ||
| * reasoning_budget = json_value(body, "thinking_budget_tokens", -1); | ||
| * } | ||
| * </pre> | ||
| * | ||
| * <p>Once this fix is applied: remove {@code @Ignore}, confirm this test passes, | ||
| * and remove | ||
| * {@link #testReasoningBudgetZero_parameterAccepted_thinkingNotSuppressed}. | ||
| */ | ||
| @Ignore("llama.cpp bug: per-request reasoning_budget_tokens is overwritten by model default " + | ||
| "in oaicompat_chat_params_parse (server-common.cpp). " + | ||
| "See Javadoc for exact fix location and code.") | ||
| @Test | ||
| public void testReasoningBudgetZero_expectedBehavior_suppressesThinking() { | ||
| InferenceParameters params = new InferenceParameters("") | ||
| .setMessages(null, Collections.singletonList(new Pair<>("user", "What is 2+2?"))) | ||
| .setReasoningBudgetTokens(0) | ||
| .setNPredict(N_PREDICT); | ||
|
|
||
| String json = model.chatComplete(params); | ||
| Assert.assertNotNull("Response JSON must not be null", json); | ||
|
|
||
| String reasoningContent = parser.extractChoiceReasoningContent(json); | ||
| Assert.assertTrue( | ||
| "reasoning_content should be empty when budget=0 suppresses thinking, " + | ||
| "but was: " + reasoningContent, | ||
| reasoningContent == null || reasoningContent.trim().isEmpty()); | ||
| } | ||
|
|
||
| /** | ||
| * A positive {@code reasoning_budget_tokens} value is accepted and the call completes | ||
| * without error. | ||
| * | ||
| * <p>The assertion checks that the model produced a non-empty response — either in | ||
| * {@code reasoning_content} or {@code content}. On slow or constrained hardware the | ||
| * model may exhaust the token budget inside the thinking block and emit an empty | ||
| * {@code content}; checking both fields makes the test robust to that behaviour. | ||
| * | ||
| * <p>See {@link #testReasoningBudgetZero_parameterAccepted_thinkingNotSuppressed} for | ||
| * the note on why the budget count itself is not asserted. | ||
| */ | ||
| @Test | ||
| public void testReasoningBudgetPositive_parameterAccepted() { | ||
| InferenceParameters params = new InferenceParameters("") | ||
| .setMessages(null, Collections.singletonList( | ||
| new Pair<>("user", "Think step by step: what is 3 times 7?"))) | ||
| .setReasoningBudgetTokens(100) | ||
| .setNPredict(N_PREDICT); | ||
|
|
||
| String json = model.chatComplete(params); | ||
| Assert.assertNotNull("Response JSON must not be null", json); | ||
|
|
||
| String reasoningContent = parser.extractChoiceReasoningContent(json); | ||
| String content = parser.extractChoiceContent(json); | ||
| boolean hasReasoning = reasoningContent != null && !reasoningContent.trim().isEmpty(); | ||
| boolean hasContent = content != null && !content.trim().isEmpty(); | ||
| Assert.assertTrue( | ||
| "model must produce at least some output in reasoning_content or content, " + | ||
| "but both were empty", | ||
| hasReasoning || hasContent); | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,4 +13,7 @@ class TestConstants { | |
| /** Path to the draft model used for speculative decoding tests. */ | ||
| static final String DRAFT_MODEL_PATH = "models/AMD-Llama-135m-code.Q2_K.gguf"; | ||
|
|
||
| /** Path to the Qwen3 thinking model used for reasoning budget tests. */ | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good — the REASONING_MODEL_PATH constant follows the same pattern as MODEL_PATH and DRAFT_MODEL_PATH. Verify the path "models/Qwen3-0.6B-Q4_K_M.gguf" matches exactly with the REASONING_MODEL_NAME in publish.yml (currently correct). |
||
| static final String REASONING_MODEL_PATH = "models/Qwen3-0.6B-Q4_K_M.gguf"; | ||
|
|
||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good: Model validation array updated consistently with Windows batch file. Script will properly detect missing or corrupt model files.