Skip to content

Create a code completion model & tool for IDEs that can run locally on consumer hardware and rival the performance of commercial products like Cursor.

Notifications You must be signed in to change notification settings

hyang97/nanocomplete

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NanoComplete

Create a code completion model that runs locally on consumer hardware (16GB M4 MacBook Air) and rivals commercial products like Cursor.

Single developer, $1,000 budget.

Status: Eval harness and baseline evals done. VS Code extension built and validated. Latency analysis complete. Training next.

Hypothesis

A small model doesn't need general knowledge, instruction following, tool use, or "helpful assistant" behavior. By specializing entirely on one narrow task -- code completions capped at 32 tokens -- it could theoretically match SOTA models 60x its size.

Approach

Models: Qwen2.5-Coder family (all sizes share the same 151K-token vocabulary)

Model Params Role
Qwen2.5-Coder-0.5B 500M Primary inference model
Qwen2.5-Coder-1.5B 1.5B Quality upgrade (speculative decoding target)
Qwen2.5-Coder-32B 32B Teacher for distillation

Originally planned to use Google's Gemma family, but discovered during testing that their tokenizer vocabularies are incompatible -- speculative decoding requires draft and target to share exact token IDs. Pivoted to Qwen2.5-Coder, where all model sizes share the same 151K-token vocabulary. Also allows us to skip continued pre-training since Qwen models are already code-specialized.

Results

Summary

Built the full eval and IDE infrastructure before training to validate assumptions in real-world usage. Key findings:

  • 2x quality headroom from 0.5B to 32B teacher on eval benchmarks, and 1.4x on real IDE acceptance rate. Training is worthwhile.
  • Latency, not quality, is the primary bottleneck. Real-world latency averaged 2,092ms (vs 176ms in lab benchmarks) because prompt eval dominates at 82% of request time. Speculative decoding only helps the generation phase, which is already fast.
  • The core quality problem is context following. The 0.5B model generates from memory instead of using the cross-file context already provided in the prompt. Training needs to teach context copying, and whether 0.5B has the capacity for this is the key open question.

Originally planned to use speculative decoding with 0.5B as draft and 1.5B as target. These findings shifted the strategy: optimize the 0.5B model first (3.5x faster prompt eval than 1.5B), and only add the 1.5B if needed.

Baselines (NVIDIA L4, RepoBench Python v1.1 + HumanEval)

Model Params RepoBench EM RepoBench ES HumanEval P@1 Role
Qwen2.5-Coder-0.5B 500M 0.175 0.397 0.232 Primary
Qwen2.5-Coder-1.5B 1.5B 0.230 0.451 0.354 Quality upgrade
Qwen2.5-Coder-32B-AWQ 32B 0.337 0.556 0.707 Teacher / ceiling

The 32B teacher is 1.9x better than the 0.5B on exact match -- aiming to close this gap with training.

Speculative Decoding (M4 MacBook Air)

Generation-only latency with short prompts (cache hot). Real-world latency is higher due to prompt eval -- see Latency Analysis below.

Tokens Spec Decoding 1.5B Alone Speedup
16 109ms 212ms 1.94x
32 213ms 368ms 1.73x
64 460ms 856ms 1.86x
128 622ms 1670ms 2.68x

IDE Validation

Built "Dash," a VS Code extension (~1500 lines) to validate the end-to-end experience:

  • FIM prompt construction with cross-file context via VS Code language server
  • KV cache warmup on file switches and cursor jumps
  • Per-completion telemetry with latency, confidence scores (logprobs), and accept/reject tracking
  • Context-aware debouncing (100ms at completion points, 500ms during active typing)

Tested with base 1.5B model (speculative decoding) and 32B teacher model swapped in:

Metric Base 1.5B Teacher 32B
Acceptance rate 23.9% 33.3%

The 1.4x acceptance improvement from base to teacher affirms that training is worthwhile.

Latency Analysis

Initial benchmarks showed 176ms avg latency with 99% cache hit rate. Real-world usage told a different story:

Metric Lab benchmark Real IDE usage
Avg latency 176ms 2,092ms
Cache hit rate 99% 28%
Prompt size ~20 tokens ~2,000 tokens

Root causes identified:

  1. Prompt eval dominates (82% of latency). With 2,048-token FIM prompts, the 1.5B target model takes ~2.5s for prompt evaluation on cache miss. Generation (the part speculative decoding speeds up) is only ~560ms.

  2. Cache misses are frequent. Real coding involves jumping between files and scrolling -- each jump invalidates the KV cache. The 28% hit rate reflects natural coding behavior, not a bug.

  3. Warmup contention. Cache warmup requests and completion requests competed for a single llama-server slot, causing completions to queue behind warmups.

Mitigations applied:

  • Increased server parallelism from 1 to 3 slots (~484 MB total, trivial on 16 GB)
  • Increased warmup debounce from 300ms to 1.5s (avoids firing warmups during rapid navigation)
  • Estimated improvement: avg latency from ~2,092ms to ~1,500-1,700ms

Key insight: The 1.5B model's prompt eval is the fundamental bottleneck. Speculative decoding only helps the generation phase, which is already fast. The 0.5B model alone has ~3.5x faster prompt eval.

Rejection Analysis (69 completions)

8.7% acceptance rate (6/69). Accepted completions had significantly higher confidence (avg logprob -0.035 vs -0.121 for rejected).

Rejection causes (excluding latency):

Category % Description Fix
Wrong content 46% Plausible but incorrect code (wrong model names, wrong API patterns) Training (SFT + distillation)
TSX/TS struggle 21% 0% acceptance on TypeScript/TSX Training (broader language coverage)
Redundant with suffix 13% Suggests code already present after cursor Extension (better suffix dedup)
Trivial 13% Single tokens or obvious fragments not worth showing Extension (minimum completion length)
Hallucinated API 8% Invents API calls that don't match actual library Training (distillation from 32B)

The core quality problem (46% wrong content + 8% hallucinated APIs) is that the base 0.5B model generates from memory instead of following context. The cross-file context system already provides the right information -- import resolution via the VS Code language server injects actual function signatures into the prompt. The model just ignores it. For example, it suggests client.completion() (old Anthropic SDK pattern memorized during pre-training) when both the suffix and cross-file context show the newer client.messages.create() API.

This means training needs to teach the model to copy from context rather than generate from memory. Whether a 0.5B model has enough capacity to do this reliably is the open question -- and exactly what the context matching experiment is designed to test. The redundant/trivial categories (26% combined) are solvable in the extension layer without training.

Architecture

VS Code Extension (Dash)              llama-server
+---------------------------+         +-------------------+
| FIM prompt construction   |  HTTP   | Qwen2.5-Coder     |
| Cross-file context (LSP)  | ------> | Q4_K_M quantized   |
| Cache warmup              |         | Metal GPU accel    |
| Telemetry + metrics       | <------ | 3 parallel slots   |
| Confidence filtering      |         |                    |
+---------------------------+         +-------------------+

What's Next

Phase 1: Optimize the 0.5B model (current priority)

Shoot for the best possible performance using the 0.5B model. At 500M params with Q4_K_M quantization, prompt eval is fast (~700ms for 2K tokens vs ~2.5s for the 1.5B). If training closes the quality gap sufficiently, the 0.5B alone may be all we need.

Training pipeline:

  1. SFT on IDE-style completions (FIM format, cursor positions, short completions)
  2. Distillation from 32B teacher to transfer quality
  3. Smarter context -- don't greedily fill 2K tokens; use only what the completion needs

Phase 2: Evaluate whether 1.5B is needed

After training, compare the fine-tuned 0.5B against the base 1.5B on real IDE acceptance rate. If the trained 0.5B matches or exceeds the base 1.5B, skip speculative decoding entirely and ship the 0.5B alone.

Phase 3: Two-tier routing (if needed)

If 1.5B quality is still required for complex completions, run both models as separate servers:

  • Fast tier (0.5B only): Simple completions (after ., :, brackets). Sub-300ms.
  • Quality tier (0.5B + 1.5B speculative): Complex completions (new lines, function bodies). 1-2s.

The extension routes based on trigger context. Most completions hit the fast tier.

Repository Structure

nanocomplete/
├── dash/              VS Code extension ("Dash")
│   └── src/
│       ├── extension.ts           Entry point, cache warmup
│       ├── completionProvider.ts   Completion logic, metrics, telemetry
│       ├── fimFormatter.ts         FIM prompt construction
│       ├── crossFileContext.ts     Import resolution via LSP
│       └── llamaClient.ts          HTTP client for llama-server
├── evals/             Evaluation harness (RepoBench, HumanEval)
├── scripts/           Server, eval, and pipeline scripts
└── docs/              Design documents and experiment results

Getting Started

See dash/README.md for setup (model download, server startup, extension install).

Documentation

Doc Description
docs/01_project_spec.md Project specification
docs/21_baseline_evals.md Full baseline evaluation results
docs/14_speculative_decoding_approach.md Speculative decoding validation
docs/26_teacher_model_testing.md Teacher model testing in real IDE usage
docs/31_extension_architecture.md Dash extension architecture
docs/40_training_strategy.md Training approach and pipeline

License

MIT

About

Create a code completion model & tool for IDEs that can run locally on consumer hardware and rival the performance of commercial products like Cursor.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published