Pure Rust inference engine for the SmolLM3-3B language model. No Python runtime, no CUDA, no external dependencies. Single executable + quantized weights = portable AI on any machine.
Now with GPU acceleration! Auto-detects Vulkan-compatible GPUs for ~4.8x faster inference, with intelligent fallback to CPU.
| Property | Value |
|---|---|
| Engine | QORA (Pure Rust) |
| Base Model | SmolLM3-3B (HuggingFaceTB/SmolLM3-3B) |
| Parameters | 3.07 Billion |
| Quantization | Q4 (4-bit symmetric, group_size=32) |
| Model Size | 1.68 GB (Q4) / ~6 GB (F16) |
| Executable | ~37 MB (GPU+CPU) / ~7 MB (CPU-only) |
| Context Length | 65,536 tokens (up to 128K with YARN) |
| Platform | Windows x86_64, Linux x86_64, macOS aarch64 |
| GPU Backend | Vulkan (Windows/Linux) / Metal (macOS) — auto-detect with CPU fallback |
| System Intelligence | Auto-detects RAM, adjusts token limits, sentence-boundary clean stop |
QORA automatically detects Vulkan-compatible GPUs and uses them for inference. If no GPU is available or VRAM is insufficient, it falls back to CPU seamlessly.
| Requirement | Value |
|---|---|
| Minimum VRAM | ~2.5 GB (Q4 weights + KV cache + activations) |
| GPU API | Vulkan 1.1+ (Windows/Linux) or Metal (macOS) |
| Tested On | GTX 1660 SUPER (6 GB VRAM) |
| Metric | GPU | CPU | Speedup |
|---|---|---|---|
| Decode Speed | ~4.1 tok/s | ~0.86 tok/s | ~4.8x |
| VRAM Usage | ~2.3 GB | — | — |
QORA includes intelligent VRAM management:
- Pre-flight check: Probes GPU with a 256 MB test allocation before loading the full model
- Estimated VRAM: Prints estimated VRAM requirement before loading
- Panic recovery: If the GPU runs out of memory during inference, catches the error and falls back to CPU
- Manual override: Use
--cpuflag to skip GPU and run on CPU directly
SmolLM3-3B is a decoder-only transformer with several advanced features:
| Component | Details |
|---|---|
| Layers | 36 decoder layers |
| Hidden Size | 2,048 |
| Attention Heads | 16 (Query) / 4 (KV) — Grouped Query Attention |
| Head Dimension | 128 |
| MLP (Intermediate) | 11,008 (SwiGLU: gate + up + down) |
| Vocabulary | 128,256 tokens |
| Normalization | RMSNorm (eps=1e-6) |
| Position Encoding | NoPE scheme — RoPE on every 4th layer only (9/36 layers) |
| RoPE Theta | 5,000,000 |
| Activation | SiLU (Sigmoid Linear Unit) |
| Embeddings | Tied (input = output projection) |
SmolLM3 uses a 3:1 NoPE ratio — 75% of layers have no positional encoding at all. Only layers 3, 7, 11, 15, 19, 23, 27, 31, 35 apply RoPE. This reduces computational overhead and enables better long-context generalization.
model/
qora.exe — ~37 MB Inference engine (GPU+CPU, single binary)
model.qora — 1.68 GB Q4 quantized weights (4-bit)
tokenizer.json — 16.4 MB Tokenizer vocabulary
config.json — 540 B Model configuration
README.md — This file
For the fastest results, use --no-think --greedy:
.\qora.exe --load model.qora --prompt "What is X?" --no-think --greedyThis skips the thinking phase and uses deterministic decoding — you get a direct answer immediately.
Tip: Think mode produces better answers for complex questions (math, coding, reasoning) but uses 100-300+ tokens just for thinking before the answer appears. For simple factual questions,
--no-thinkis much faster.
# Fastest: direct answer, no thinking, deterministic
qora.exe --load model.qora --prompt "What is the capital of France?" --no-think --greedy
# Fast: direct answer with some randomness
qora.exe --load model.qora --prompt "Tell me about Mars" --no-think
# Full quality: thinking mode (slower but better for complex questions)
qora.exe --load model.qora --prompt "Solve: if x^2 + 3x = 10, what is x?" --max-tokens 1024
# See what the model is thinking
qora.exe --load model.qora --prompt "What is 2+2?" --show-think
# Force CPU (skip GPU auto-detect)
qora.exe --load model.qora --prompt "Hello" --cpu
# Control output length
qora.exe --load model.qora --prompt "Tell me a story" --max-tokens 512
# Raw text completion (no chat template)
qora.exe --load model.qora --prompt "Once upon a time" --raw --max-tokens 128| Flag | Default | Description |
|---|---|---|
--load <path> |
model.qora |
Load from .qor3b binary (fast, ~2-5s) |
--prompt <text> |
"Hello, how are you?" | Input prompt |
--max-tokens <n> |
auto (smart) | Maximum tokens to generate (auto-adjusted by RAM) |
--think-budget <n> |
auto (smart) | Maximum thinking tokens before forcing </think> |
--no-think |
off | Disable thinking mode (faster, direct answers) |
--greedy |
off | Greedy decoding (temperature=0, deterministic) |
--show-think |
off | Display thinking content on stderr |
--raw |
off | Raw text completion (no chat template) |
--cpu |
off | Force CPU inference (skip GPU auto-detect) |
| Mode | Speed (GPU) | Speed (CPU) | Best For |
|---|---|---|---|
--no-think --greedy |
~4.1 tok/s | ~1 tok/s | Fastest. Simple factual questions. |
--no-think |
~4.1 tok/s | ~1 tok/s | Fast with variety. General questions. |
--show-think |
~4.1 tok/s | ~1 tok/s | See reasoning. Complex questions. |
| (default think mode) | ~4.1 tok/s | ~1 tok/s | Best quality but thinking uses 100-300+ tokens before answer appears. |
Tested on i5-11500 (6C/12T, AVX-512), 16GB RAM, GTX 1660 SUPER (6GB), Windows 11.
| Metric | GPU | CPU |
|---|---|---|
| Model Load (binary) | ~3-17s | ~3-17s |
| Prefill Speed | — | ~1.3-2.2 tok/s |
| Decode Speed | ~4.1 tok/s | ~1.1 tok/s |
| Single Decode Step | ~244ms | ~830ms |
| Memory (Q4) | ~2.3 GB VRAM | 1,681 MB RAM |
| Memory (F16) | ~6 GB VRAM | ~6,000 MB RAM |
All tests run with --greedy --no-think for deterministic, reproducible results.
Prompt: "What is the capital of France? Answer in one sentence."
Output:
The capital of France is Paris.
| Metric | Value |
|---|---|
| Correctness | PASS |
| Tokens Generated | 7 |
| Prefill | 123 tokens in 269.1s |
Prompt: "Solve step by step: If a train travels at 60 km/h for 2.5 hours, how far does it travel?"
Output:
To solve this problem, we can use the formula for distance:
Distance = Speed x Time
Given: Speed = 60 km/h Time = 2.5 hours
Step 1: Plug the values into the formula Distance = 60 km/h x 2.5 hours
Step 2: Perform the multiplication Distance = 150 km
So, the train travels 150 kilometers in 2.5 hours.
| Metric | Value |
|---|---|
| Correctness | PASS (150 km is correct) |
| Tokens Generated | 110 |
| Decode Speed | 0.20 tok/s |
Prompt: "Write a Python function to check if a number is prime. Include docstring."
Output:
def is_prime(n):
"""
Checks if a number is prime.
Parameters:
n (int): The number to check.
Returns:
bool: True if the number is prime, False otherwise.
Raises:
ValueError: If n is less than 1, since prime numbers are
defined for natural numbers greater than 1.
"""
if n < 1:
raise ValueError("Prime numbers are defined for natural numbers greater than 1.")
if n == 1 or n == 2:
return True
if n % 2 == 0:
return False
for i in range(3, int(n**0.5) + 1, 2):
if n % i == 0:
return False
return True| Metric | Value |
|---|---|
| Correctness | PARTIAL (n==1 should return False; otherwise correct algorithm) |
| Tokens Generated | 200 |
| Decode Speed | 0.29 tok/s |
| Code Quality | Good structure, docstring, error handling, efficient trial division |
Prompt: "Explain quantum computing in 3 sentences for a 10-year-old."
Output:
Quantum computing is like a super-fast version of regular computers that use tiny particles called qubits to perform calculations much faster than today's computers. In quantum computing, these qubits can be in multiple states at once, allowing them to solve complex problems much more quickly than classical computers. However, quantum computing is still a new and complex field that is still being developed, so it's like a futuristic computer that is still in the process of learning and growing!
| Metric | Value |
|---|---|
| Correctness | PASS |
| Age-Appropriate | Yes (simple language, relatable analogy) |
| Tokens Generated | 92 |
| Decode Speed | 0.24 tok/s |
Prompt: "What are the three laws of thermodynamics? Be concise."
Output:
First Law of Thermodynamics: The law of energy conservation states that energy cannot be created or destroyed, only converted from one form to another.
Second Law of Thermodynamics: This law describes the direction of spontaneous processes, stating that the entropy (a measure of disorder or randomness) of an isolated system always increases over time, or that the entropy of a closed system will always increase or remain constant.
Third Law of Thermodynamics: This law states that as the temperature of a system approaches absolute zero, the entropy of the system approaches a constant value...
| Metric | Value |
|---|---|
| Correctness | PASS (all three laws accurately described) |
| Tokens Generated | 128 |
| Decode Speed | 0.25 tok/s |
Prompt: "If all roses are flowers, and some flowers fade quickly, can we conclude that some roses fade quickly? Explain your reasoning."
Output:
Yes, we can conclude that some roses fade quickly based on the given information.
Reasoning:
- All roses are flowers (given).
- Some flowers fade quickly (given).
- Since roses are a subset of flowers, any characteristic or property that applies to flowers can also be applied to roses.
Therefore, since some flowers fade quickly, and roses are flowers, it logically follows that some roses also fade quickly.
| Metric | Value |
|---|---|
| Correctness | FAIL (this is a syllogistic fallacy — "some flowers fade" does not imply roses specifically fade) |
| Tokens Generated | 128 |
| Decode Speed | 0.25 tok/s |
| Note | This is a known weakness of small LMs — the correct answer is "No, we cannot conclude this" |
| Test Category | Result | Score |
|---|---|---|
| Factual Knowledge | PASS | 1/1 |
| Math Reasoning | PASS | 1/1 |
| Code Generation | PARTIAL | 0.5/1 |
| Explanation | PASS | 1/1 |
| Science Knowledge | PASS | 1/1 |
| Logical Reasoning | FAIL | 0/1 |
| Total | 4.5/6 (75%) |
Official scores from the HuggingFace model card. QORA runs the same weights with Q4 quantization (minimal accuracy loss).
| Benchmark | SmolLM3-3B | Qwen2.5-3B | Llama3.2-3B | Qwen3-4B |
|---|---|---|---|---|
| HellaSwag | 76.15 | 74.19 | 75.52 | 74.37 |
| ARC-CF | 65.61 | 59.81 | 58.58 | 62.11 |
| BoolQ | 78.99 | 73.61 | 75.33 | 74.28 |
| PIQA | 78.89 | 78.35 | 78.51 | 77.58 |
| Winogrande | 58.88 | 61.41 | 58.72 | 59.59 |
| CommonsenseQA | 55.28 | 49.14 | 60.60 | 52.99 |
| Benchmark | SmolLM3-3B | Qwen2.5-3B | Llama3.2-3B | Qwen3-4B |
|---|---|---|---|---|
| MMLU-CF | 44.13 | 42.93 | 41.32 | 47.65 |
| MMLU Pro CF | 19.61 | 16.66 | 16.42 | 24.92 |
| MMLU Pro MCF | 32.70 | 31.32 | 25.07 | 41.07 |
| OpenBookQA | 40.60 | 40.20 | 42.00 | 42.40 |
| Benchmark | SmolLM3-3B | Qwen2.5-3B | Llama3.2-3B | Qwen3-4B |
|---|---|---|---|---|
| HumanEval+ | 30.48 | 34.14 | 25.00 | 54.87 |
| MBPP+ | 52.91 | 52.11 | 38.88 | 63.75 |
| MATH (4-shot) | 46.10 | 40.10 | 7.44 | 51.20 |
| GSM8K (5-shot) | 67.63 | 70.13 | 25.92 | 74.14 |
| Benchmark | SmolLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-4B |
|---|---|---|---|---|
| IFEval | 76.7 | 65.6 | 71.6 | 68.9 |
| AIME 2025 | 9.3 | 2.9 | 0.3 | 17.1 |
| GSM-Plus | 72.8 | 74.1 | 59.2 | 82.1 |
| LiveCodeBench | 15.2 | 10.5 | 3.4 | 24.9 |
| GPQA Diamond | 35.7 | 32.2 | 29.4 | 44.4 |
| Global MMLU | 53.5 | 50.54 | 46.8 | 65.1 |
| BFCL (Tools) | 92.3 | — | 92.3 | 95.0 |
| Benchmark | No Think | With Think | Improvement |
|---|---|---|---|
| AIME 2025 | 9.3 | 36.7 | +295% |
| GSM-Plus | 72.8 | 83.4 | +15% |
| LiveCodeBench | 15.2 | 30.0 | +97% |
| GPQA Diamond | 35.7 | 41.7 | +17% |
| Global MMLU | 53.5 | 64.1 | +20% |
| Benchmark | SmolLM3-3B | Qwen2.5-3B | Llama3.2-3B | Qwen3-4B |
|---|---|---|---|---|
| RULER 32K | 76.35 | 75.93 | 77.58 | 83.98 |
| RULER 64K | 67.85 | 64.90 | 72.93 | 60.29 |
| RULER 128K | 61.03 | 62.23 | 71.30 | 47.23 |
| Language | SmolLM3-3B | Qwen2.5-3B | Llama3.2-3B | Qwen3-4B |
|---|---|---|---|---|
| French | 63.94 | 57.47 | 57.66 | 61.00 |
| Spanish | 65.85 | 58.25 | 59.39 | 61.85 |
| German | 59.56 | 49.99 | 53.19 | 56.43 |
| Italian | 62.49 | 53.21 | 54.96 | 58.76 |
| Portuguese | 63.22 | 57.38 | 56.84 | 59.89 |
| Model | Params | Format | Size on Disk | Best At |
|---|---|---|---|---|
| QORA (SmolLM3-3B) | 3.07B | Q4 | 1.68 GB | Reasoning, multilingual, instruction following |
| Qwen2.5-3B | 3B | — | ~6 GB | Math (GSM8K), Winogrande |
| Llama3.2-3B | 3.2B | — | ~6 GB | Long context (128K), CommonsenseQA |
| Qwen3-4B | 4B | — | ~8 GB | Overall best (larger model), math, code |
- Best-in-class reasoning among 3B models (HellaSwag 76.15, ARC 65.61, BoolQ 78.99)
- Best instruction following (IFEval 76.7) — beats even Qwen3-4B
- Best multilingual performance among 3B models across 5 European languages
- Thinking mode boosts AIME by 295% — competitive reasoning from a 3B model
- 128K context support with strong RULER scores
QORA uses the Cortex framework's wgpu backend for GPU acceleration (Vulkan on Windows/Linux, Metal on macOS):
- Q4 on GPU: Weights are uploaded as Burn quantized tensors (Q4S + PackedU32). Matmul performs on-the-fly dequantization on the GPU — no need to decompress the full model into VRAM.
- KV Cache: Stored as f32 tensors on GPU, concatenated each step.
- Sampling: Logits are transferred to CPU for top-p/temperature sampling.
- 128MB stack thread: GPU inference runs in a dedicated thread with 128MB stack to handle Burn's deep lazy computation graphs.
On CPUs with AVX-512 support (Intel 11th gen+, AMD Zen 4+), QORA automatically uses hand-written AVX-512 SIMD kernels for a ~2.5x CPU speedup:
| Kernel | Technique | Speedup |
|---|---|---|
| Q4 GEMV | permutexvar_ps 16-entry LUT lookup, nibble extract via cvtepu8_epi32 |
~2.5x |
| F16 GEMV | cvtph_ps f16→f32 + fmadd_ps FMA accumulation |
~2.5x |
| Fused gate+up | Parallel gate & up SIMD LUT decode in SwiGLU MLP | ~2.5x |
Detection is automatic at runtime — falls back to scalar code on non-AVX-512 CPUs with zero overhead.
QORA uses symmetric 4-bit quantization with group_size=32:
- Each group of 32 float values is quantized to 4-bit integers
- One f32 scale factor per group
- Total: 4 bits/weight + 1 bit/weight overhead = ~5 bits effective
- Memory reduction: 32-bit -> ~5 bits = 6.4x compression
1. Model Load — Read .qora binary (Q4 weights + f16 norms)
2. GPU Detect — Probe Vulkan GPU, check VRAM, fallback to CPU if needed
3. Upload — Transfer Q4 weights to GPU (or keep on CPU)
4. Tokenize — Encode prompt with chat template
5. Prefill — Process full prompt through 36 layers (batched)
6. Decode Loop — Generate tokens one at a time:
a. Embedding lookup
b. 36x: RMSNorm -> Attention (GQA, KV cache) -> RMSNorm -> SwiGLU MLP
c. Final RMSNorm -> LM head (tied weights)
d. Sample (top-p, temperature)
7. Detokenize — Decode token IDs back to text
| Parameter | Default | Description |
|---|---|---|
| Temperature | 0.6 (think) / 0.7 (no-think) | Controls randomness (0 = greedy) |
| Top-K | 20 | Keep only top 20 candidates before nucleus sampling |
| Top-P | 0.95 | Nucleus sampling threshold |
| Repetition Penalty | 1.1 | Discourages repeating recent tokens (window=64) |
| Presence Penalty | 1.5 | Flat subtraction for any previously-seen token |
| Max Tokens | auto (RAM-based) | Maximum generation length |
| Think Budget | auto (RAM-based) | Maximum thinking tokens |
QORA automatically detects your system resources and adjusts parameters:
| Available RAM | Max Tokens | Think Budget | Note |
|---|---|---|---|
| < 4 GB | 512 | 256 | Very low RAM warning |
| 4-8 GB | 1024 | 1024 | Low RAM warning |
| 8-12 GB | 2048 | 2048 | Normal |
| > 12 GB | 8192 | 8192 | Full capability |
Smart features:
- RAM detection: Reads available memory on Windows (wmic), Linux (/proc/meminfo), macOS (sysctl/vm_stat)
- Auto token limits: Defaults adjust based on available RAM — no manual tuning needed
- Length-aware prompting: System prompt includes length hints so the model respects token budget
- Sentence-boundary stop: At 85% of token budget, waits for a sentence ending (
.!?) instead of cutting mid-sentence - Loop detection: Detects repeating token patterns and forces EOS to prevent infinite loops
- Think budget enforcement: Forces
</think>if thinking exceeds budget, ensuring the model always produces an answer
| Engine | Model | Params | Size (Q4) | Purpose | GPU |
|---|---|---|---|---|---|
| QORA-LLM-3B | SmolLM3-3B | 3.07B | 1.68 GB | Text generation, reasoning, chat | Vulkan/Metal |
| QORA-LLM-4B | Qwen3.5-4B | 4B | ~2 GB | Multimodal (text + vision), DeltaNet | Vulkan/Metal |
| QORA-LLM-0.8B | Qwen3.5-0.8B | 0.8B | ~600 MB | Lightweight multimodal, mobile target | CPU only |
| QORA-Image | SDXS-512 | — | ~1.5 GB | Text-to-image generation (1-step) | Vulkan/Metal |
| QORA-TTS | Qwen3-TTS-0.6B | 0.6B | ~1.2 GB | Text-to-speech synthesis | CPU only |
| QORA-STT | Whisper-tiny | 39M | 144 MB | Speech-to-text transcription | CPU only |
All engines are pure Rust, single-binary executables with no Python dependencies. GPU-enabled engines auto-detect Vulkan (Windows/Linux) or Metal (macOS) with automatic CPU fallback.
cd QOR3B
# CPU-only build (all platforms)
cargo build --release
# GPU build — Windows/Linux (Vulkan)
cargo build --release --features gpu
# GPU build — macOS (Metal)
cargo build --release --features gpu-metalcortex— Rust deep learning framework (GPU via wgpu/Vulkan or Metal backend)rayon— Thread pool for parallel GEMV, attention, and lm_headhalf— F16 supportserde/serde_json— Config parsingtokenizers— HuggingFace tokenizer
Pre-built binaries are automatically built via GitHub Actions for:
- Windows x86_64 — CPU + GPU (Vulkan)
- Linux x86_64 — CPU + GPU (Vulkan)
- macOS aarch64 — CPU + GPU (Metal)
Create a git tag (e.g. v0.1.0) and push to trigger a release build.
The QORA inference engine is custom-built. The SmolLM3-3B model weights are released under the SmolLM3 License by HuggingFace.
Built with QORA — Pure Rust AI Inference