Tags: RecursiveRabbit/koboldcpp
Tags
KoboldCpp Attention Extraction - Optimized Release v1.0
STABLE WORKING VERSION - Tested and verified on Qwen 7B
This release provides production-ready attention extraction for Halo Weave
with optimized bandwidth usage and zero server-side performance penalty.
KEY FEATURES:
✅ Unconditional attention extraction (hooked at process_ubatch core)
✅ Push-model architecture (tokens paired with attention atomically)
✅ Optimized bandwidth: 28KB per token (28x reduction from original)
✅ No generation slowdown (sends Layer 27 once, not 28 times)
✅ Fast client-side aggregation (<1ms per token)
VERIFIED PERFORMANCE:
- Generation speed: 25.52 T/s (Qwen 7B Q8_0)
- Prompt processing: 2948 T/s
- Client processing: 0.58ms/token (400x faster than v0.1)
- Total overhead: ~180ms for 308 tokens
COMPATIBILITY:
- Works with Halo Weave frontend (no changes needed)
- SSE streaming via /api/extra/generate/stream
- Non-streaming via /api/v1/generate
- Antislop sampling compatible
- Format: "per_layer" with shape [1, 28, 256]
BUILD INSTRUCTIONS:
make LLAMA_CUBLAS=1 -j$(nproc)
API USAGE:
POST /api/extra/generate/stream
{
"prompt": "Your prompt here",
"max_length": 50,
"output_attentions": true
}
KNOWN CHARACTERISTICS:
- Captures Layer 27 attention patterns (repeated across all "layers")
- Client aggregates 28 heads to single attention vector
- Suitable for brightness-based context pruning
Tested on: Ubuntu 22.04, CUDA 12.1, RTX 4090
Date: 2025-12-18
Commit: c47f6f7