Skip to content

Commit ff3a7c8

Browse files
committed
Optimization (decode): treat KV slot exhaustion (code 1) as a recoverable return value
- Updated the `decode` wrapper to explicitly return `1` instead of raising a `RuntimeError` when `llama_decode` indicates no KV slots are available. - Aligned Python API behavior with the underlying C++ contract, treating code 1 as a recoverable signal rather than a fatal crash. - Enabled upper-level caller loops (like `eval`) to gracefully handle VRAM fragmentation via dynamic batch halving without relying on clumsy try-except block string parsing. - Retained strict `RuntimeError` exceptions for truly fatal backend failures (e.g., codes -1, -2, -3). - Added comprehensive docstrings detailing return codes and exception scenarios. Signed-off-by: JamePeng <jame_peng@sina.com>
1 parent 1a5b3d6 commit ff3a7c8

File tree

1 file changed

+30
-6
lines changed

1 file changed

+30
-6
lines changed

llama_cpp/_internals.py

Lines changed: 30 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -529,19 +529,43 @@ def encode(self, batch: LlamaBatch):
529529
if return_code != 0:
530530
raise RuntimeError(f"llama_encode returned {return_code}")
531531

532-
def decode(self, batch: LlamaBatch):
532+
def decode(self, batch: 'LlamaBatch') -> int:
533+
"""
534+
Evaluate the batch of tokens using the transformer model.
535+
536+
This method executes the forward pass. If the KV cache is heavily fragmented
537+
or out of space, it may return 1, indicating the caller should try to reduce
538+
the batch size or evict idle sequences.
539+
540+
Returns:
541+
0: Success.
542+
1: No KV slot available (Recoverable). The caller should implement a
543+
fallback strategy, such as reducing the batch size and retrying.
544+
545+
Raises:
546+
RuntimeError: If a fatal, non-recoverable error occurs during decoding
547+
(e.g., negative error codes or invalid batch structures).
548+
"""
533549
return_code = llama_cpp.llama_decode(self.ctx, batch.batch)
534550

535551
if return_code == 0:
536-
return
552+
return 0
553+
554+
# 1 means "No KV slot available".
555+
# We explicitly return 1 instead of raising an exception so that the caller
556+
# can gracefully handle it via dynamic batch sizing (batch_size //= 2).
557+
elif return_code == 1:
558+
return 1
537559

560+
# Any other code indicates a fatal failure.
538561
error_map = {
539-
1: "No KV slot available: try reducing batch size or increasing context window",
540-
2: "Decoding aborted",
541-
-1: "Invalid input batch",
562+
2: "Decoding aborted by user callback",
563+
-1: "Invalid input batch (e.g. n_tokens == 0 or exceeding capacity)",
564+
-2: "Could not allocate space for the compute graph (VRAM exhausted)",
565+
-3: "Graph computation failed internally",
542566
}
543567

544-
msg = error_map.get(return_code, "Fatal internal error")
568+
msg = error_map.get(return_code, "Unknown fatal internal error")
545569
raise RuntimeError(f"llama_decode failed (code {return_code}): {msg}")
546570

547571
def set_n_threads(self, n_threads: int, n_threads_batch: int):

0 commit comments

Comments
 (0)