The Largest Coding Benchmark for Large Language Models
- List of LLM Models
- Languages and Task Ideas
- Task Prompts
- 1. Algorithms – Extreme Optimization
- 2. Computer Science Theory / Miscellaneous
- 3. Frontend (Next/React/Vanilla JS)
- 4. Secure Coding / Safety-Critical
- 5. Mobile Development (Android / iOS)
- 6. Backend
- 7. Desktop (Windows/Mac/Linux)
- 8. Pentest / Hardening
- 9. Systems Programming
- 10. Shell Scripts
- 11. Game Development (Unreal, Unity, Godot)
- 12. Miscellaneous / Other Picks
- Evaluation System (Rubric)
- For Evaluators
we will test many llms from 11 companies:
- GPT-5
- GPT-5 + Think Deeper
- GPT-5 + Web Search
- GPT-OSS-20B
- GPT-OSS-120B
- DeepSeek
- DeepSeek with Thinking
- DeepSeek with Web Search
- DeepSeek with Thinking + Web Search
- Qwen3-Max-preview
- Qwen3-Max-preview with web search
- Qwen3-Coder
- Qwen3-Coder with Web Search
- Qwen3-30B-A3B-2507
- Qwen3-30B-A3B-2507 with Web Search
- Qwen3-30B-A3B-2507 with Thinking [81,920 tokens]
- Qwen3-30B-A3B-2507 with Thinking + Web Search
- Qwen3-235B-A22B-2507
- Qwen3-235B-A22B-2507 with Internet Search
- Qwen3-235B-A22B-2507 with Thinking [81,920 tokens]
- Qwen3-235B-A22B-2507 with Thinking + Web Search
- Qwen3-Next-80B-A3B
- Qwen2.5-Max
- Qwen2.5-Max with Web Search
- Qwen2.5-Max with Thinking
- Qwen2.5-Max with Thinking + Internet Search
- Grok3
- Grok3 + DeepSearch
- Grok4
- Grok4 + DeepSearch
- Grok4 Heavy
- Grok4 Heavy + DeepSearch
- Claude Sonnet 4.1
- Claude Sonnet 4.1 with Web Search
- Claude Sonnet 4.1 with Extended Thinking
- Claude Sonnet 4.1 with Web Search + Extended Thinking
- Claude Sonnet 4
- Claude Sonnet 4 with Web Search
- Claude Sonnet 4 with Extended Thinking
- Claude Sonnet 4 with Web Search + Extended Thinking
- Claude Sonnet 3.7
- Claude Sonnet 3.7 with Web Search
- Claude Sonnet 3.7 with Extended Thinking
- Claude Sonnet 3.7 with Web Search + Extended Thinking
- Claude Opus 4.1
- Claude Opus 4.1 with Web Search
- Claude Opus 4.1 with Extended Thinking
- Claude Opus 4.1 with Web Search + Extended Thinking
- Claude Opus 3
- Claude Opus 3 with Web Search
- Claude Opus 3 with Extended Thinking
- Claude Opus 3 with Web Search + Extended Thinking
- Claude Haiku 3.5
- Claude Haiku 3.5 with Extended Thinking
- Gemini 2.5 Flash
- Gemini 2.5 Pro
- Copilot Fast Answer
- Copilot Think Deeper
- Copilot Smart (GPT-5)
- Kimi K2
- Kimi K2 with Web Search
- Kimi K1.5
- Kimi K1.5 with WebSearch
- Kimi K1.5 with Thinking
- Kimi K1.5 with Web Search + Thinking
- Highly optimized algorithms
- Computer science theory
- Frontend (Next.js / React / Vanilla JS)
- Secure coding
- Mobile development (Android / iOS)
- Backend systems
- Desktop applications (Windows, macOS, Linux)
- Penetration testing and security hardening
- Systems programming
- Shell scripting
- Game development (Unreal Engine, Unity, Godot)
- Python – widely supported by LLMs
- C – requires careful memory management
- Rust – modern, safe systems language
- C# – used for Windows apps and Unity
- C++ – used for performance-critical applications
- Kotlin – primary Android development
- Java – versatile backend and mobile language
- Go – efficient for backend and networking
- Lisp/Haskell – functional programming
- Bash – shell scripting
- Assembly – low-level performance tuning
Prompt:
You are given a weighted directed graph with up to N = 20,000 vertices and M up to 200,000 edges. Implement an algorithm that computes all-pairs shortest paths only for K specified source vertices (K ≤ 1000) and returns distances to all vertices for those K sources. Memory is extremely limited: your program must not allocate more than 150 MB of heap memory. Time limit: target O(K * (M log N / 4)) — show optimizations that reduce constant factors. Provide both code and a short explanation of optimizations used.
Languages: C, C++, Rust, Go
Difficulty: Very Hard
Constraints:
- No external libraries except standard library
- Memory usage ≤ 150 MB
- Use memory pools / custom allocators
- Code must fit within ~300 lines
Acceptance Criteria:
- Correctness on random graphs
- Peak memory under 150MB
- Runtime within 2× of optimal Dijkstra per source
Approximate Lines: 150–300
Prompt:
Implement a static kd-tree for 3D points (N up to 10M) optimized for minimal cache misses and compact memory layout. Provide construction and k-NN query (k ≤ 16) APIs. Emphasize contiguous storage, iterative algorithms (no recursion), and branch-minimizing traversal. Explain memory layout choices.
Languages: C, C++, Rust
Difficulty: Very Hard
Constraints:
- No recursion
- Data stored in a single contiguous block
- Include microbenchmarks
Acceptance Criteria:
- Correct nearest neighbors on tests
- Demonstrable cache miss reduction
Approximate Lines: 200
Prompt:
Implement the core of a DPLL-style SAT solver with unit propagation, watched literals, non-chronological backtracking (learning a single clause per conflict). The solver must solve crafted SAT instances up to 10^6 variables/clauses efficiently. Provide APIs: add_clause
, solve
. Explain heuristics used.
Languages: C, Rust, C++
Difficulty: Very Hard
Constraints:
- Must implement watched literals
- Iterative approach preferred
Acceptance Criteria:
- Solve small CNF benchmarks
- Demonstrate conflict-driven learning
Approximate Lines: 200–300
Prompt:
Write a Hindley–Milner style type inference engine (including let-polymorphism and simple algebraic data types) for an ML-like language. Return principal type or error. Provide tests.
Languages: OCaml, Haskell, Rust, Python
Difficulty: Hard
Constraints:
- Prefer immutable data structures
- Show Algorithm W
Acceptance Criteria:
- Correct principal types for provided expressions
Approximate Lines: 150
(... continue with the rest of the tasks similarly formatted ...)
Category | Points |
---|---|
Correctness & Functional Tests | 40 |
Performance & Resource Limits | 20 |
Security & Robustness | 15 |
Adherence to Constraints | 10 |
Code Quality, Readability | 10 |
Tests & Reproducibility | 5 |
- Partial solutions (e.g., design only) can receive up to 25 points.
- Disallowed external libraries deduct up to 10 points from "Adherence".
Default weights for evaluation categories:
Category | Weight |
---|---|
Algorithms | 0.12 |
CS Theory | 0.10 |
Frontend | 0.08 |
Secure Coding | 0.12 |
Mobile | 0.08 |
Backend | 0.12 |
Desktop | 0.06 |
Pentest/Hardening | 0.08 |
Systems | 0.08 |
Shell Scripts | 0.03 |
GameDev | 0.05 |
Misc/FP/Assembly | 0.08 |
Score Range | Level | Description |
---|---|---|
0–29 | Novice | Basic snippets, often incorrect |
30–54 | Competent | Works on easy/medium tasks |
55–74 | Advanced | Good performance, some edge cases missed |
75–89 | Expert | High correctness, strong tests |
90–100 | Architect | Production-ready, well-documented |
- Run unit/integration tests
- Measure memory and runtime
- Run fuzz tests for parsers/security tasks
- Check constraint adherence (e.g., banned APIs)
- Review comments and threat models
- Evaluate modularization if task is large
- Request brevity: Add “Solution must be ≤ 300 lines” to each prompt
- Accept pseudocode/core proofs for very long tasks
- Encourage inclusion of minimal unit tests within responses