Create a code completion model that runs locally on consumer hardware (16GB M4 MacBook Air) and rivals commercial products like Cursor.
Single developer, $1,000 budget.
Status: Eval harness and baseline evals done. VS Code extension built and validated. Latency analysis complete. Training next.
A small model doesn't need general knowledge, instruction following, tool use, or "helpful assistant" behavior. By specializing entirely on one narrow task -- code completions capped at 32 tokens -- it could theoretically match SOTA models 60x its size.
Models: Qwen2.5-Coder family (all sizes share the same 151K-token vocabulary)
| Model | Params | Role |
|---|---|---|
| Qwen2.5-Coder-0.5B | 500M | Primary inference model |
| Qwen2.5-Coder-1.5B | 1.5B | Quality upgrade (speculative decoding target) |
| Qwen2.5-Coder-32B | 32B | Teacher for distillation |
Originally planned to use Google's Gemma family, but discovered during testing that their tokenizer vocabularies are incompatible -- speculative decoding requires draft and target to share exact token IDs. Pivoted to Qwen2.5-Coder, where all model sizes share the same 151K-token vocabulary. Also allows us to skip continued pre-training since Qwen models are already code-specialized.
Built the full eval and IDE infrastructure before training to validate assumptions in real-world usage. Key findings:
- 2x quality headroom from 0.5B to 32B teacher on eval benchmarks, and 1.4x on real IDE acceptance rate. Training is worthwhile.
- Latency, not quality, is the primary bottleneck. Real-world latency averaged 2,092ms (vs 176ms in lab benchmarks) because prompt eval dominates at 82% of request time. Speculative decoding only helps the generation phase, which is already fast.
- The core quality problem is context following. The 0.5B model generates from memory instead of using the cross-file context already provided in the prompt. Training needs to teach context copying, and whether 0.5B has the capacity for this is the key open question.
Originally planned to use speculative decoding with 0.5B as draft and 1.5B as target. These findings shifted the strategy: optimize the 0.5B model first (3.5x faster prompt eval than 1.5B), and only add the 1.5B if needed.
| Model | Params | RepoBench EM | RepoBench ES | HumanEval P@1 | Role |
|---|---|---|---|---|---|
| Qwen2.5-Coder-0.5B | 500M | 0.175 | 0.397 | 0.232 | Primary |
| Qwen2.5-Coder-1.5B | 1.5B | 0.230 | 0.451 | 0.354 | Quality upgrade |
| Qwen2.5-Coder-32B-AWQ | 32B | 0.337 | 0.556 | 0.707 | Teacher / ceiling |
The 32B teacher is 1.9x better than the 0.5B on exact match -- aiming to close this gap with training.
Generation-only latency with short prompts (cache hot). Real-world latency is higher due to prompt eval -- see Latency Analysis below.
| Tokens | Spec Decoding | 1.5B Alone | Speedup |
|---|---|---|---|
| 16 | 109ms | 212ms | 1.94x |
| 32 | 213ms | 368ms | 1.73x |
| 64 | 460ms | 856ms | 1.86x |
| 128 | 622ms | 1670ms | 2.68x |
Built "Dash," a VS Code extension (~1500 lines) to validate the end-to-end experience:
- FIM prompt construction with cross-file context via VS Code language server
- KV cache warmup on file switches and cursor jumps
- Per-completion telemetry with latency, confidence scores (logprobs), and accept/reject tracking
- Context-aware debouncing (100ms at completion points, 500ms during active typing)
Tested with base 1.5B model (speculative decoding) and 32B teacher model swapped in:
| Metric | Base 1.5B | Teacher 32B |
|---|---|---|
| Acceptance rate | 23.9% | 33.3% |
The 1.4x acceptance improvement from base to teacher affirms that training is worthwhile.
Initial benchmarks showed 176ms avg latency with 99% cache hit rate. Real-world usage told a different story:
| Metric | Lab benchmark | Real IDE usage |
|---|---|---|
| Avg latency | 176ms | 2,092ms |
| Cache hit rate | 99% | 28% |
| Prompt size | ~20 tokens | ~2,000 tokens |
Root causes identified:
-
Prompt eval dominates (82% of latency). With 2,048-token FIM prompts, the 1.5B target model takes ~2.5s for prompt evaluation on cache miss. Generation (the part speculative decoding speeds up) is only ~560ms.
-
Cache misses are frequent. Real coding involves jumping between files and scrolling -- each jump invalidates the KV cache. The 28% hit rate reflects natural coding behavior, not a bug.
-
Warmup contention. Cache warmup requests and completion requests competed for a single llama-server slot, causing completions to queue behind warmups.
Mitigations applied:
- Increased server parallelism from 1 to 3 slots (~484 MB total, trivial on 16 GB)
- Increased warmup debounce from 300ms to 1.5s (avoids firing warmups during rapid navigation)
- Estimated improvement: avg latency from ~2,092ms to ~1,500-1,700ms
Key insight: The 1.5B model's prompt eval is the fundamental bottleneck. Speculative decoding only helps the generation phase, which is already fast. The 0.5B model alone has ~3.5x faster prompt eval.
8.7% acceptance rate (6/69). Accepted completions had significantly higher confidence (avg logprob -0.035 vs -0.121 for rejected).
Rejection causes (excluding latency):
| Category | % | Description | Fix |
|---|---|---|---|
| Wrong content | 46% | Plausible but incorrect code (wrong model names, wrong API patterns) | Training (SFT + distillation) |
| TSX/TS struggle | 21% | 0% acceptance on TypeScript/TSX | Training (broader language coverage) |
| Redundant with suffix | 13% | Suggests code already present after cursor | Extension (better suffix dedup) |
| Trivial | 13% | Single tokens or obvious fragments not worth showing | Extension (minimum completion length) |
| Hallucinated API | 8% | Invents API calls that don't match actual library | Training (distillation from 32B) |
The core quality problem (46% wrong content + 8% hallucinated APIs) is that the base 0.5B model generates from memory instead of following context. The cross-file context system already provides the right information -- import resolution via the VS Code language server injects actual function signatures into the prompt. The model just ignores it. For example, it suggests client.completion() (old Anthropic SDK pattern memorized during pre-training) when both the suffix and cross-file context show the newer client.messages.create() API.
This means training needs to teach the model to copy from context rather than generate from memory. Whether a 0.5B model has enough capacity to do this reliably is the open question -- and exactly what the context matching experiment is designed to test. The redundant/trivial categories (26% combined) are solvable in the extension layer without training.
VS Code Extension (Dash) llama-server
+---------------------------+ +-------------------+
| FIM prompt construction | HTTP | Qwen2.5-Coder |
| Cross-file context (LSP) | ------> | Q4_K_M quantized |
| Cache warmup | | Metal GPU accel |
| Telemetry + metrics | <------ | 3 parallel slots |
| Confidence filtering | | |
+---------------------------+ +-------------------+
Shoot for the best possible performance using the 0.5B model. At 500M params with Q4_K_M quantization, prompt eval is fast (~700ms for 2K tokens vs ~2.5s for the 1.5B). If training closes the quality gap sufficiently, the 0.5B alone may be all we need.
Training pipeline:
- SFT on IDE-style completions (FIM format, cursor positions, short completions)
- Distillation from 32B teacher to transfer quality
- Smarter context -- don't greedily fill 2K tokens; use only what the completion needs
After training, compare the fine-tuned 0.5B against the base 1.5B on real IDE acceptance rate. If the trained 0.5B matches or exceeds the base 1.5B, skip speculative decoding entirely and ship the 0.5B alone.
If 1.5B quality is still required for complex completions, run both models as separate servers:
- Fast tier (0.5B only): Simple completions (after
.,:, brackets). Sub-300ms. - Quality tier (0.5B + 1.5B speculative): Complex completions (new lines, function bodies). 1-2s.
The extension routes based on trigger context. Most completions hit the fast tier.
nanocomplete/
├── dash/ VS Code extension ("Dash")
│ └── src/
│ ├── extension.ts Entry point, cache warmup
│ ├── completionProvider.ts Completion logic, metrics, telemetry
│ ├── fimFormatter.ts FIM prompt construction
│ ├── crossFileContext.ts Import resolution via LSP
│ └── llamaClient.ts HTTP client for llama-server
├── evals/ Evaluation harness (RepoBench, HumanEval)
├── scripts/ Server, eval, and pipeline scripts
└── docs/ Design documents and experiment results
See dash/README.md for setup (model download, server startup, extension install).
| Doc | Description |
|---|---|
docs/01_project_spec.md |
Project specification |
docs/21_baseline_evals.md |
Full baseline evaluation results |
docs/14_speculative_decoding_approach.md |
Speculative decoding validation |
docs/26_teacher_model_testing.md |
Teacher model testing in real IDE usage |
docs/31_extension_architecture.md |
Dash extension architecture |
docs/40_training_strategy.md |
Training approach and pipeline |
MIT