server : remove n_past #16818

ggerganov · 2025-10-28T14:21:10Z

The n_past counter is redundant and is complicating the logic, so this PR removes it. The information of the number of previous tokens in the sequences is already expressed with slot.prompt.n_tokens(), so no need to have a second way of expressing this information.

Haven't tested yet how this affects mtmd use cases, but the plan is to incorporate the ideas from #15474 (comment) in order to have a correct positional representation. If there is need for additional token position tracking for mrope, it should be implemented in addition to the existing sequence-position logic.

HTTP API Changes

/metrics: Rename llamacpp:n_past_max -> llamacpp:n_tokens_max

ngxson · 2025-10-28T14:45:05Z

Haven't tested yet how this affects mtmd use cases

This should not affect mtmd as the most of the code paths for mtmd expect to use server_token::n_tokens(). But to be safe, I'll do some tests once this PR is ready to be reviewed.

ggerganov · 2025-10-29T10:00:24Z

tools/server/server.cpp

-                llama_memory_seq_add(llama_get_memory(ctx), slot.id, n_keep + n_discard, slot.n_past,        -n_discard);
-
-                // add generated tokens to cache
-                {
-                    llama_tokens new_tokens = slot.prompt.tokens.get_text_tokens(); // copy
-                    for (size_t i = n_keep + n_discard; i < new_tokens.size(); i++) {
-                        new_tokens[i - n_discard] = new_tokens[i];
-                    }
-
-                    new_tokens.resize(slot.prompt.tokens.size() - n_discard);
-                    slot.prompt.tokens.clear();
-                    slot.prompt.tokens.insert(new_tokens);
-                }
+                llama_memory_seq_add(llama_get_memory(ctx), slot.id, n_keep + n_discard, slot.prompt.n_tokens(), -n_discard);

-                slot.n_past -= n_discard;


Attention here. I think this is simply equivalent to keep_first(), but not sure if there was a reason to implement it like this.

Hmm I think it's more like "remove middle" rather than keep_first(). Example:

# n_keep=2, n_discard=3 seq: 0 1 2 3 4 5 6 7 8 9 keep: x x rm: x x x new seq: 0 1 5 6 7 8 9

Also please note that when migrating from using std::vector to using server_tokens, I try to stick to the original code while poly-filling server_tokens with .clear() and .insert(), so that's why the code may looks a bit strange. But yes it's better to refactor this code.

Yes, good catch - will revisit this.

Fixed in 7e60d1c

ggerganov · 2025-10-29T12:09:45Z

tools/server/server.cpp

+                    if (slot.prompt.n_tokens() < slot.task->n_tokens() && input_tokens[slot.prompt.n_tokens()] == LLAMA_TOKEN_NULL) {
                        // process the image
                        int32_t new_n_past;
-                        int32_t res = input_tokens.process_chunk(ctx, mctx, slot.n_past, slot.id, new_n_past);
+                        int32_t res = input_tokens.process_chunk(ctx, mctx, slot.prompt.n_tokens(), slot.id, new_n_past);
                        if (res != 0) {
                            SLT_ERR(slot, "failed to process image, res = %d\n", res);
                            send_error(slot, "failed to process image", ERROR_TYPE_SERVER);
                            slot.release();
                            continue;
                        }

+                        slot.n_prompt_tokens_processed += new_n_past - slot.prompt.n_tokens();
+
                        // add the image chunk to cache
                        {
-                            const auto & chunk = input_tokens.find_chunk(slot.n_past);
+                            const auto & chunk = input_tokens.find_chunk(slot.prompt.n_tokens());
                            slot.prompt.tokens.push_back(chunk.get()); // copy
                        }


Attention here - not sure I understand 100% the logic here, but I think the new version should be the same as the old one.

Please ignore my last comment. Here the n_past is used in context-shifting only. It is safe to replace it with prompt.n_tokens() which should reflect exactly number of positions (and not the number of cells in KV cache). ~~Actually I recommend to rename n_tokens() --> n_pos() for this reason.~~

(Edit: no, we actually want n_tokens() as we want to keep track the cells in KV cache)

While we're not using context shifting with mtmd, I think it's worth noting the diff here:

If the context shifting take place, for example when we have 10 tokens and we remove the first 5, n_past will remain 10 in this case, while prompt.n_tokens() will return 10-5=5

The alternative solution is to:

Make server_tokens::tokens to reflect the actual number of tokens in KV

Calculate the pos for the next token dynamically using a combination of llama_memory_seq_pos_max() and mtmd_input_chunk_get_n_pos()

The main question is: do you think it's OK to call llama_memory_seq_pos_max() in each new token? Does it has a significant impact on performance?

It this approach seems OK, I will push a commit for it.

Nevermind, I started working on the idea but it doesn't work, because sometimes the batch is not yet process so llama_memory_seq_pos_max() does not reflect the correct number of positions.

I think the more intuitive way now is to simply add a std::vector<llama_pos> to server_token. I'll push a commit to illustrate the idea (we can always revert it if it's not the desired solution)

Maybe we can add a new const method pos_next() to server_tokens that loops over all chunks and determines the next position based on the contents. If !mtmd then pos_next() simply returns n_tokens(). Otherwise, it has to analyze all mtmd chunks and take into account the mtmd_input_chunk_get_n_pos().

This way, we don't have to add extra state to server_tokens.

Please take a look --> 1e1b4af

(Tested with Gemma 3 and Qwen 2.5 VL)

I think we can avoid the pos state in server_tokens. The logic is quite simple:

If we don't have mtmd chunks then pos_next() returns n_tokens()

If we have a single mtmd chunk that occupies T tokens and uses P positions, then pos_next() returns n_tokens() - T + P

For N + 1 mtmd chunks - similar logic: n_tokens() - T0 - T1 ... - TN + P0 + P1 + ... + PN

This should significantly simplify the logic and state management in server_tokens.

I implemented this idea - hopefully it works correctly. I think I understand better now the logic of the mtmd chunks. There are a few ideas to refactor and improve the implementation. Will do so in follow-up PRs.

ggerganov · 2025-10-29T12:11:55Z

@ngxson For mtmd, I mainly tested this with Gemma and the LightOnOCR - would be nice to make do some extra testing.

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

ngxson · 2025-10-30T14:01:40Z

Actually I suggest this patch which also fix an issue with llama_memory_seq_rm

I tested it with Qwen 2.5 VL, the test contains 2 steps:

Step 1: submit a prompt with 2 images A and B
Step 2: submit a prompt with only image A, we expect to rm the cache for only image B

If llama_memory_seq_rm removes the tokens by index instead of pos, it will remove less tokens than we actually need

ggerganov · 2025-10-30T14:06:49Z

Yes, applying this change.

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

ggerganov · 2025-10-30T14:52:24Z

I think this should be good to merge now. Let me know if you spot any additional issues. If CI is green, will merge a bit later today.

ngxson

I did some more tests with Qwen 2.5 VL and Gemma 3. I can confirm that it's working correctly now. Thanks!

github-actions bot added examples server labels Oct 28, 2025

Base automatically changed from gg/context-no-pad to master October 28, 2025 18:19

ggerganov force-pushed the gg/server-fix-n-past branch from 1709bfa to 545df93 Compare October 28, 2025 18:44

ggerganov commented Oct 29, 2025

View reviewed changes

ggerganov marked this pull request as ready for review October 29, 2025 12:09

ggerganov requested a review from ngxson as a code owner October 29, 2025 12:09

ggerganov and others added 5 commits October 30, 2025 15:43

server : remove n_past

5bd32bb

server : replace slot.n_prompt_tokens() with slot.task->n_tokens()

51405ae

server : fixes + clean-up

9d97d1d

cont : fix context shift

278fbdd

server : add server_tokens::pos_next()

beb5c03

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

server : fix pos_next() usage

ef1646e

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

ggerganov force-pushed the gg/server-fix-n-past branch from 80e3672 to ef1646e Compare October 30, 2025 14:38

ngxson approved these changes Oct 30, 2025

View reviewed changes

ggerganov merged commit b52edd2 into master Oct 30, 2025
68 of 70 checks passed

ggerganov deleted the gg/server-fix-n-past branch October 30, 2025 16:43

ggerganov mentioned this pull request Oct 30, 2025

changelog : llama-server REST API #9291

Open

server : remove n_past #16818

server : remove n_past #16818

Conversation

ggerganov commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

HTTP API Changes

Uh oh!

ngxson commented Oct 28, 2025

Uh oh!

ggerganov Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

ngxson Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Oct 29, 2025

Uh oh!

ngxson commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Oct 30, 2025

Uh oh!

ggerganov commented Oct 30, 2025

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggerganov commented Oct 28, 2025 •

edited

Loading

ggerganov Oct 29, 2025 •

edited

Loading

ngxson Oct 29, 2025 •

edited

Loading

ngxson Oct 29, 2025 •

edited

Loading

ngxson Oct 30, 2025 •

edited

Loading

ggerganov Oct 30, 2025 •

edited

Loading

ngxson commented Oct 30, 2025 •

edited

Loading