Skip to content

Tags: RecursiveRabbit/koboldcpp

Tags

v1.0-attention-optimized

Toggle v1.0-attention-optimized's commit message
KoboldCpp Attention Extraction - Optimized Release v1.0

STABLE WORKING VERSION - Tested and verified on Qwen 7B

This release provides production-ready attention extraction for Halo Weave
with optimized bandwidth usage and zero server-side performance penalty.

KEY FEATURES:
✅ Unconditional attention extraction (hooked at process_ubatch core)
✅ Push-model architecture (tokens paired with attention atomically)
✅ Optimized bandwidth: 28KB per token (28x reduction from original)
✅ No generation slowdown (sends Layer 27 once, not 28 times)
✅ Fast client-side aggregation (<1ms per token)

VERIFIED PERFORMANCE:
- Generation speed: 25.52 T/s (Qwen 7B Q8_0)
- Prompt processing: 2948 T/s
- Client processing: 0.58ms/token (400x faster than v0.1)
- Total overhead: ~180ms for 308 tokens

COMPATIBILITY:
- Works with Halo Weave frontend (no changes needed)
- SSE streaming via /api/extra/generate/stream
- Non-streaming via /api/v1/generate
- Antislop sampling compatible
- Format: "per_layer" with shape [1, 28, 256]

BUILD INSTRUCTIONS:
make LLAMA_CUBLAS=1 -j$(nproc)

API USAGE:
POST /api/extra/generate/stream
{
  "prompt": "Your prompt here",
  "max_length": 50,
  "output_attentions": true
}

KNOWN CHARACTERISTICS:
- Captures Layer 27 attention patterns (repeated across all "layers")
- Client aggregates 28 heads to single attention vector
- Suitable for brightness-based context pruning

Tested on: Ubuntu 22.04, CUDA 12.1, RTX 4090
Date: 2025-12-18
Commit: c47f6f7