NanoComplete

Create a code completion model that runs locally on consumer hardware (16GB M4 MacBook Air) and rivals commercial products like Cursor.

Single developer, $1,000 budget.

Status: Eval harness and baseline evals done. VS Code extension built and validated. Latency analysis complete. Training next.

Hypothesis

A small model doesn't need general knowledge, instruction following, tool use, or "helpful assistant" behavior. By specializing entirely on one narrow task -- code completions capped at 32 tokens -- it could theoretically match SOTA models 60x its size.

Approach

Models: Qwen2.5-Coder family (all sizes share the same 151K-token vocabulary)

Model	Params	Role
Qwen2.5-Coder-0.5B	500M	Primary inference model
Qwen2.5-Coder-1.5B	1.5B	Quality upgrade (speculative decoding target)
Qwen2.5-Coder-32B	32B	Teacher for distillation

Originally planned to use Google's Gemma family, but discovered during testing that their tokenizer vocabularies are incompatible -- speculative decoding requires draft and target to share exact token IDs. Pivoted to Qwen2.5-Coder, where all model sizes share the same 151K-token vocabulary. Also allows us to skip continued pre-training since Qwen models are already code-specialized.

Results

Summary

Built the full eval and IDE infrastructure before training to validate assumptions in real-world usage. Key findings:

2x quality headroom from 0.5B to 32B teacher on eval benchmarks, and 1.4x on real IDE acceptance rate. Training is worthwhile.
Latency, not quality, is the primary bottleneck. Real-world latency averaged 2,092ms (vs 176ms in lab benchmarks) because prompt eval dominates at 82% of request time. Speculative decoding only helps the generation phase, which is already fast.
The core quality problem is context following. The 0.5B model generates from memory instead of using the cross-file context already provided in the prompt. Training needs to teach context copying, and whether 0.5B has the capacity for this is the key open question.

Originally planned to use speculative decoding with 0.5B as draft and 1.5B as target. These findings shifted the strategy: optimize the 0.5B model first (3.5x faster prompt eval than 1.5B), and only add the 1.5B if needed.

Baselines (NVIDIA L4, RepoBench Python v1.1 + HumanEval)

Model	Params	RepoBench EM	RepoBench ES	HumanEval P@1	Role
Qwen2.5-Coder-0.5B	500M	0.175	0.397	0.232	Primary
Qwen2.5-Coder-1.5B	1.5B	0.230	0.451	0.354	Quality upgrade
Qwen2.5-Coder-32B-AWQ	32B	0.337	0.556	0.707	Teacher / ceiling

The 32B teacher is 1.9x better than the 0.5B on exact match -- aiming to close this gap with training.

Speculative Decoding (M4 MacBook Air)

Generation-only latency with short prompts (cache hot). Real-world latency is higher due to prompt eval -- see Latency Analysis below.

Tokens	Spec Decoding	1.5B Alone	Speedup
16	109ms	212ms	1.94x
32	213ms	368ms	1.73x
64	460ms	856ms	1.86x
128	622ms	1670ms	2.68x

IDE Validation

Built "Dash," a VS Code extension (~1500 lines) to validate the end-to-end experience:

FIM prompt construction with cross-file context via VS Code language server
KV cache warmup on file switches and cursor jumps
Per-completion telemetry with latency, confidence scores (logprobs), and accept/reject tracking
Context-aware debouncing (100ms at completion points, 500ms during active typing)

Tested with base 1.5B model (speculative decoding) and 32B teacher model swapped in:

Metric	Base 1.5B	Teacher 32B
Acceptance rate	23.9%	33.3%

The 1.4x acceptance improvement from base to teacher affirms that training is worthwhile.

Latency Analysis

Initial benchmarks showed 176ms avg latency with 99% cache hit rate. Real-world usage told a different story:

Metric	Lab benchmark	Real IDE usage
Avg latency	176ms	2,092ms
Cache hit rate	99%	28%
Prompt size	~20 tokens	~2,000 tokens

Root causes identified:

Prompt eval dominates (82% of latency). With 2,048-token FIM prompts, the 1.5B target model takes ~2.5s for prompt evaluation on cache miss. Generation (the part speculative decoding speeds up) is only ~560ms.
Cache misses are frequent. Real coding involves jumping between files and scrolling -- each jump invalidates the KV cache. The 28% hit rate reflects natural coding behavior, not a bug.
Warmup contention. Cache warmup requests and completion requests competed for a single llama-server slot, causing completions to queue behind warmups.

Mitigations applied:

Increased server parallelism from 1 to 3 slots (~484 MB total, trivial on 16 GB)
Increased warmup debounce from 300ms to 1.5s (avoids firing warmups during rapid navigation)
Estimated improvement: avg latency from ~2,092ms to ~1,500-1,700ms

Key insight: The 1.5B model's prompt eval is the fundamental bottleneck. Speculative decoding only helps the generation phase, which is already fast. The 0.5B model alone has ~3.5x faster prompt eval.

Rejection Analysis (69 completions)

8.7% acceptance rate (6/69). Accepted completions had significantly higher confidence (avg logprob -0.035 vs -0.121 for rejected).

Rejection causes (excluding latency):

Category	%	Description	Fix
Wrong content	46%	Plausible but incorrect code (wrong model names, wrong API patterns)	Training (SFT + distillation)
TSX/TS struggle	21%	0% acceptance on TypeScript/TSX	Training (broader language coverage)
Redundant with suffix	13%	Suggests code already present after cursor	Extension (better suffix dedup)
Trivial	13%	Single tokens or obvious fragments not worth showing	Extension (minimum completion length)
Hallucinated API	8%	Invents API calls that don't match actual library	Training (distillation from 32B)

The core quality problem (46% wrong content + 8% hallucinated APIs) is that the base 0.5B model generates from memory instead of following context. The cross-file context system already provides the right information -- import resolution via the VS Code language server injects actual function signatures into the prompt. The model just ignores it. For example, it suggests client.completion() (old Anthropic SDK pattern memorized during pre-training) when both the suffix and cross-file context show the newer client.messages.create() API.

This means training needs to teach the model to copy from context rather than generate from memory. Whether a 0.5B model has enough capacity to do this reliably is the open question -- and exactly what the context matching experiment is designed to test. The redundant/trivial categories (26% combined) are solvable in the extension layer without training.

Architecture

VS Code Extension (Dash)              llama-server
+---------------------------+         +-------------------+
| FIM prompt construction   |  HTTP   | Qwen2.5-Coder     |
| Cross-file context (LSP)  | ------> | Q4_K_M quantized   |
| Cache warmup              |         | Metal GPU accel    |
| Telemetry + metrics       | <------ | 3 parallel slots   |
| Confidence filtering      |         |                    |
+---------------------------+         +-------------------+

What's Next

Phase 1: Optimize the 0.5B model (current priority)

Shoot for the best possible performance using the 0.5B model. At 500M params with Q4_K_M quantization, prompt eval is fast (~700ms for 2K tokens vs ~2.5s for the 1.5B). If training closes the quality gap sufficiently, the 0.5B alone may be all we need.

Training pipeline:

SFT on IDE-style completions (FIM format, cursor positions, short completions)
Distillation from 32B teacher to transfer quality
Smarter context -- don't greedily fill 2K tokens; use only what the completion needs

Phase 2: Evaluate whether 1.5B is needed

After training, compare the fine-tuned 0.5B against the base 1.5B on real IDE acceptance rate. If the trained 0.5B matches or exceeds the base 1.5B, skip speculative decoding entirely and ship the 0.5B alone.

Phase 3: Two-tier routing (if needed)

If 1.5B quality is still required for complex completions, run both models as separate servers:

Fast tier (0.5B only): Simple completions (after ., :, brackets). Sub-300ms.
Quality tier (0.5B + 1.5B speculative): Complex completions (new lines, function bodies). 1-2s.

The extension routes based on trigger context. Most completions hit the fast tier.

Repository Structure

nanocomplete/
├── dash/              VS Code extension ("Dash")
│   └── src/
│       ├── extension.ts           Entry point, cache warmup
│       ├── completionProvider.ts   Completion logic, metrics, telemetry
│       ├── fimFormatter.ts         FIM prompt construction
│       ├── crossFileContext.ts     Import resolution via LSP
│       └── llamaClient.ts          HTTP client for llama-server
├── evals/             Evaluation harness (RepoBench, HumanEval)
├── scripts/           Server, eval, and pipeline scripts
└── docs/              Design documents and experiment results

Getting Started

See dash/README.md for setup (model download, server startup, extension install).

Documentation

Doc	Description
`docs/01_project_spec.md`	Project specification
`docs/21_baseline_evals.md`	Full baseline evaluation results
`docs/14_speculative_decoding_approach.md`	Speculative decoding validation
`docs/26_teacher_model_testing.md`	Teacher model testing in real IDE usage
`docs/31_extension_architecture.md`	Dash extension architecture
`docs/40_training_strategy.md`	Training approach and pipeline

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
dash		dash
docs		docs
evals		evals
scripts		scripts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
TODO.md		TODO.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NanoComplete

Hypothesis

Approach

Results

Summary

Baselines (NVIDIA L4, RepoBench Python v1.1 + HumanEval)

Speculative Decoding (M4 MacBook Air)

IDE Validation

Latency Analysis

Rejection Analysis (69 completions)

Architecture

What's Next

Phase 1: Optimize the 0.5B model (current priority)

Phase 2: Evaluate whether 1.5B is needed

Phase 3: Two-tier routing (if needed)

Repository Structure

Getting Started

Documentation

License

About

Uh oh!

Releases

Packages

Languages

hyang97/nanocomplete

Folders and files

Latest commit

History

Repository files navigation

NanoComplete

Hypothesis

Approach

Results

Summary

Baselines (NVIDIA L4, RepoBench Python v1.1 + HumanEval)

Speculative Decoding (M4 MacBook Air)

IDE Validation

Latency Analysis

Rejection Analysis (69 completions)

Architecture

What's Next

Phase 1: Optimize the 0.5B model (current priority)

Phase 2: Evaluate whether 1.5B is needed

Phase 3: Two-tier routing (if needed)

Repository Structure

Getting Started

Documentation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages