Zero-RSS trigram index for code search
Named for the triglyph: three vertical grooves in a Doric frieze—just like the three tiers in this index.
A naive trigram index using HashMap<u32, Vec<u32>> for posting lists turns a 2.86GB on-disk index into 2.1GB+ RSS:
- HashMap bucket allocation overhead
- Vec capacity slack (typically 2x actual usage)
- Per-allocation overhead (16+ bytes each)
Memory-map the index and only page in blocks actually accessed during queries.
Before: 2.1GB RSS for posting lists
After: ~0 bytes RSS (OS pages in only accessed blocks)
Three-tier design, like the three grooves of a triglyph:
┌─────────────────────────────────────────────────────────────┐
│ Tier 0: Presence Bitset (2.2KB) │
│ ───────────────────────────────────────────── │
│ • Direct-addressed bitset for 17,576 a-z trigrams │
│ • Zero false positives, O(1) lookup │
│ • Rejects absent trigrams before touching directory │
│ • Hot in L1 cache │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Tier 1: Slot Directory (~281KB) │
│ ───────────────────────────────────────────── │
│ • 17,576 fixed slots, direct addressing │
│ • No binary search: slots[trigram_id] │
│ • Each slot: { offset, len_bytes, count } │
│ • Likely in L2/L3 cache │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Tier 2: Posting Blocks (mmap'd, cold) │
│ ───────────────────────────────────────────── │
│ • Variable-sized blocks with 8-byte headers │
│ • Delta-varint encoded doc IDs │
│ • Block-skip: skip entire blocks when max < target │
│ • Page faults only for blocks actually needed │
└─────────────────────────────────────────────────────────────┘
use triglyph::{build_atomic_index, trigram_id};
use std::collections::HashMap;
use std::path::Path;
let mut postings: HashMap<u32, Vec<u32>> = HashMap::new();
// Add documents containing "foo"
postings.insert(trigram_id(b'f', b'o', b'o'), vec![1, 5, 10, 15, 20]);
// Add documents containing "bar"
postings.insert(trigram_id(b'b', b'a', b'r'), vec![5, 10, 25, 30]);
// Build atomically (tmp file + rename)
build_atomic_index(
Path::new("index.tri"),
&postings,
|trigram| Some(trigram as usize),
)?;use triglyph::{IndexReader, PresenceBitset, search_trigrams, extract_trigrams};
use memmap2::Mmap;
use std::fs::File;
// Open index
let file = File::open("index.tri")?;
let mmap = unsafe { Mmap::map(&file)? };
let reader = IndexReader::open(&mmap)?;
// Load presence bitset for fast rejection
let presence = PresenceBitset::load("index.tri.presence").ok();
// Extract trigrams from query
let query_trigrams = extract_trigrams("foobar");
// Returns: trigram IDs for "foo", "oob", "oba", "bar"
// Find documents containing ALL trigrams
let doc_ids = search_trigrams(&reader, presence.as_ref(), &query_trigrams)?;
// Returns: [5, 10] (intersection of foo ∩ bar)Store file metadata without keeping millions of paths in RAM:
use triglyph::{IndexedFile, build_file_list, FileListReader};
use std::path::Path;
let files = vec![
IndexedFile { path: "/src/main.rs".into(), mtime: 1234567890, size: 1024 },
IndexedFile { path: "/src/lib.rs".into(), mtime: 1234567891, size: 2048 },
];
build_file_list(
Path::new("index.files.str"),
Path::new("index.files.dir"),
&files,
)?;
// Query by doc ID
let str_mmap = unsafe { Mmap::map(&File::open("index.files.str")?)? };
let dir_mmap = unsafe { Mmap::map(&File::open("index.files.dir")?)? };
let file_list = FileListReader::open(&str_mmap, &dir_mmap)?;
let entry = file_list.get(0)?;
println!("{} ({} bytes)", entry.path, entry.size);Header (16 bytes)
├── magic: b"TRIG"
├── version: u32 = 1
├── slot_count: u32 = 17576
└── data_offset: u32 = 281232
Slot Directory (17576 × 16 bytes)
└── For each trigram:
├── offset: u64 (relative to data region)
├── len_bytes: u32
└── count: u32 (selectivity for query planning)
Data Region (variable)
└── For each non-empty trigram, contiguous blocks:
Block Header (8 bytes)
├── max_doc: u32 (for block-skip)
├── payload_len: u16
└── doc_count: u16 (1-128)
Block Payload (payload_len bytes)
└── delta-varint encoded doc IDs
Header (16 bytes)
├── magic: b"TRIPRES\0"
├── space: u32 = 17576
└── reserved: u32 = 0
Bitset (275 × 8 bytes)
└── bits[trigram / 64] & (1 << (trigram % 64))
.str: Concatenated UTF-8 paths (no separators)
.dir:
Header (16 bytes)
├── magic: b"FILEDIR\0"
├── version: u32 = 1
└── file_count: u32
Entries (file_count × 28 bytes)
└── For each file:
├── str_offset: u64
├── str_len: u32
├── mtime: u64
└── size: u64
- Tier 0: If any trigram absent from presence bitset → return empty
- Tier 1: Look up slots, collect data slices
- Sort by selectivity: Smallest posting lists first
- Intersect smallest two: Block-skip intersection
- Filter through rest: Retain candidates in all remaining lists
Block-skip intersection:
// Skip entire blocks where block.max < target
while self.block_max < target {
self.load_next_block();
}
// Then linear scan within block| Operation | Cost |
|---|---|
| Presence check | O(1), ~2.2KB in L1 |
| Slot lookup | O(1), direct addressing |
| Block skip | O(blocks skipped), no decompression |
| Varint decode | O(docs in block) |
| Intersection | O(min(list₁, list₂)) with skip |
Typical query touches:
- 2.2KB presence bitset (always hot)
- 16 bytes per trigram slot
- Only blocks containing matching doc IDs
Handles corruption gracefully:
- Varint overflow: Rejects values > u32::MAX
- Block validation: count ∈ [1, 128], len > 0
- Bounds checking: All slice accesses validated
- Iterator safety: Clears state on all failure paths (no livelock)
- Collision detection: Builder errors on duplicate slot mappings
- Atomic writes: tmp file + rename pattern
Run reader.validate_deep() in tests to probe all block headers.
cargo build --release
cargo test[dependencies]
triglyph = "0.1"Requires:
[dependencies]
memmap2 = "0.9"Lowercase ASCII a-z only, producing IDs 0..17575:
fn trigram_id(b0: u8, b1: u8, b2: u8) -> u32 {
let c0 = (b0 - b'a') as u32;
let c1 = (b1 - b'a') as u32;
let c2 = (b2 - b'a') as u32;
c0 * 676 + c1 * 26 + c2
}extract_trigrams() normalizes to lowercase and filters non-alphabetic characters.
Why not page-aligned blocks?
Variable-sized blocks with contiguous layout were chosen over 4KB alignment:
- Pro: Better compression, fewer wasted bytes
- Con: Headers and payloads intermix on pages
- Result: Skip avoids decompression but still touches header pages
Why fixed slot directory?
17,576 slots × 16 bytes = 281KB fits in cache. Direct addressing slots[trigram] beats binary search.
Why delta-varint?
Doc IDs are sorted, so deltas are small. Most encode in 1-2 bytes vs 4 bytes for raw u32.
Why separate presence file?
Keeps the hot 2.2KB isolated from cold posting data. Could embed in header, but separate sidecar is simpler.
A triglyph is an architectural element in the Doric order: a block with three vertical grooves (tri- + glyphein "to carve"). The three grooves map to the three tiers of this index:
- First groove: Presence bitset (tiny, always hot)
- Second groove: Slot directory (small, usually cached)
- Third groove: Posting blocks (large, paged on demand)
MIT
Icons from iconics (3,372+ semantic icons)