triglyph

Zero-RSS trigram index for code search

Named for the triglyph: three vertical grooves in a Doric frieze—just like the three tiers in this index.

The Problem

A naive trigram index using HashMap<u32, Vec<u32>> for posting lists turns a 2.86GB on-disk index into 2.1GB+ RSS:

HashMap bucket allocation overhead
Vec capacity slack (typically 2x actual usage)
Per-allocation overhead (16+ bytes each)

The Solution

Memory-map the index and only page in blocks actually accessed during queries.

Before: 2.1GB RSS for posting lists
After:  ~0 bytes RSS (OS pages in only accessed blocks)

Architecture

Three-tier design, like the three grooves of a triglyph:

┌─────────────────────────────────────────────────────────────┐
│  Tier 0: Presence Bitset (2.2KB)                            │
│  ─────────────────────────────────────────────              │
│  • Direct-addressed bitset for 17,576 a-z trigrams          │
│  • Zero false positives, O(1) lookup                        │
│  • Rejects absent trigrams before touching directory        │
│  • Hot in L1 cache                                          │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│  Tier 1: Slot Directory (~281KB)                            │
│  ─────────────────────────────────────────────              │
│  • 17,576 fixed slots, direct addressing                    │
│  • No binary search: slots[trigram_id]                      │
│  • Each slot: { offset, len_bytes, count }                  │
│  • Likely in L2/L3 cache                                    │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│  Tier 2: Posting Blocks (mmap'd, cold)                      │
│  ─────────────────────────────────────────────              │
│  • Variable-sized blocks with 8-byte headers                │
│  • Delta-varint encoded doc IDs                             │
│  • Block-skip: skip entire blocks when max < target         │
│  • Page faults only for blocks actually needed              │
└─────────────────────────────────────────────────────────────┘

Usage

Building an Index

use triglyph::{build_atomic_index, trigram_id};
use std::collections::HashMap;
use std::path::Path;

let mut postings: HashMap<u32, Vec<u32>> = HashMap::new();

// Add documents containing "foo"
postings.insert(trigram_id(b'f', b'o', b'o'), vec![1, 5, 10, 15, 20]);

// Add documents containing "bar"
postings.insert(trigram_id(b'b', b'a', b'r'), vec![5, 10, 25, 30]);

// Build atomically (tmp file + rename)
build_atomic_index(
    Path::new("index.tri"),
    &postings,
    |trigram| Some(trigram as usize),
)?;

Querying

use triglyph::{IndexReader, PresenceBitset, search_trigrams, extract_trigrams};
use memmap2::Mmap;
use std::fs::File;

// Open index
let file = File::open("index.tri")?;
let mmap = unsafe { Mmap::map(&file)? };
let reader = IndexReader::open(&mmap)?;

// Load presence bitset for fast rejection
let presence = PresenceBitset::load("index.tri.presence").ok();

// Extract trigrams from query
let query_trigrams = extract_trigrams("foobar");
// Returns: trigram IDs for "foo", "oob", "oba", "bar"

// Find documents containing ALL trigrams
let doc_ids = search_trigrams(&reader, presence.as_ref(), &query_trigrams)?;
// Returns: [5, 10] (intersection of foo ∩ bar)

File List

Store file metadata without keeping millions of paths in RAM:

use triglyph::{IndexedFile, build_file_list, FileListReader};
use std::path::Path;

let files = vec![
    IndexedFile { path: "/src/main.rs".into(), mtime: 1234567890, size: 1024 },
    IndexedFile { path: "/src/lib.rs".into(), mtime: 1234567891, size: 2048 },
];

build_file_list(
    Path::new("index.files.str"),
    Path::new("index.files.dir"),
    &files,
)?;

// Query by doc ID
let str_mmap = unsafe { Mmap::map(&File::open("index.files.str")?)? };
let dir_mmap = unsafe { Mmap::map(&File::open("index.files.dir")?)? };
let file_list = FileListReader::open(&str_mmap, &dir_mmap)?;

let entry = file_list.get(0)?;
println!("{} ({} bytes)", entry.path, entry.size);

File Formats

index.tri

Header (16 bytes)
├── magic: b"TRIG"
├── version: u32 = 1
├── slot_count: u32 = 17576
└── data_offset: u32 = 281232

Slot Directory (17576 × 16 bytes)
└── For each trigram:
    ├── offset: u64      (relative to data region)
    ├── len_bytes: u32
    └── count: u32       (selectivity for query planning)

Data Region (variable)
└── For each non-empty trigram, contiguous blocks:
    Block Header (8 bytes)
    ├── max_doc: u32     (for block-skip)
    ├── payload_len: u16
    └── doc_count: u16   (1-128)

    Block Payload (payload_len bytes)
    └── delta-varint encoded doc IDs

index.tri.presence

Header (16 bytes)
├── magic: b"TRIPRES\0"
├── space: u32 = 17576
└── reserved: u32 = 0

Bitset (275 × 8 bytes)
└── bits[trigram / 64] & (1 << (trigram % 64))

index.files.str / index.files.dir

.str: Concatenated UTF-8 paths (no separators)

.dir:
Header (16 bytes)
├── magic: b"FILEDIR\0"
├── version: u32 = 1
└── file_count: u32

Entries (file_count × 28 bytes)
└── For each file:
    ├── str_offset: u64
    ├── str_len: u32
    ├── mtime: u64
    └── size: u64

Query Algorithm

Tier 0: If any trigram absent from presence bitset → return empty
Tier 1: Look up slots, collect data slices
Sort by selectivity: Smallest posting lists first
Intersect smallest two: Block-skip intersection
Filter through rest: Retain candidates in all remaining lists

Block-skip intersection:

// Skip entire blocks where block.max < target
while self.block_max < target {
    self.load_next_block();
}
// Then linear scan within block

Performance

Operation	Cost
Presence check	O(1), ~2.2KB in L1
Slot lookup	O(1), direct addressing
Block skip	O(blocks skipped), no decompression
Varint decode	O(docs in block)
Intersection	O(min(list₁, list₂)) with skip

Typical query touches:

2.2KB presence bitset (always hot)
16 bytes per trigram slot
Only blocks containing matching doc IDs

Safety

Handles corruption gracefully:

Varint overflow: Rejects values > u32::MAX
Block validation: count ∈ [1, 128], len > 0
Bounds checking: All slice accesses validated
Iterator safety: Clears state on all failure paths (no livelock)
Collision detection: Builder errors on duplicate slot mappings
Atomic writes: tmp file + rename pattern

Run reader.validate_deep() in tests to probe all block headers.

Building

cargo build --release
cargo test

[dependencies]
triglyph = "0.1"

Requires:

[dependencies]
memmap2 = "0.9"

Trigram Encoding

Lowercase ASCII a-z only, producing IDs 0..17575:

fn trigram_id(b0: u8, b1: u8, b2: u8) -> u32 {
    let c0 = (b0 - b'a') as u32;
    let c1 = (b1 - b'a') as u32;
    let c2 = (b2 - b'a') as u32;
    c0 * 676 + c1 * 26 + c2
}

extract_trigrams() normalizes to lowercase and filters non-alphabetic characters.

Design Decisions

Why not page-aligned blocks?

Variable-sized blocks with contiguous layout were chosen over 4KB alignment:

Pro: Better compression, fewer wasted bytes
Con: Headers and payloads intermix on pages
Result: Skip avoids decompression but still touches header pages

Why fixed slot directory?

17,576 slots × 16 bytes = 281KB fits in cache. Direct addressing slots[trigram] beats binary search.

Why delta-varint?

Doc IDs are sorted, so deltas are small. Most encode in 1-2 bytes vs 4 bytes for raw u32.

Why separate presence file?

Keeps the hot 2.2KB isolated from cold posting data. Could embed in header, but separate sidecar is simpler.

Etymology

A triglyph is an architectural element in the Doric order: a block with three vertical grooves (tri- + glyphein "to carve"). The three grooves map to the three tiers of this index:

First groove: Presence bitset (tiny, always hot)
Second groove: Slot directory (small, usually cached)
Third groove: Posting blocks (large, paged on demand)

License

MIT

_{Icons from iconics (3,372+ semantic icons)}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
icons		icons
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

triglyph

The Problem

The Solution

Architecture

Usage

Building an Index

Querying

File List

File Formats

index.tri

index.tri.presence

index.files.str / index.files.dir

Query Algorithm

Performance

Safety

Building

Trigram Encoding

Design Decisions

Etymology

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

johnzfitch/triglyph

Folders and files

Latest commit

History

Repository files navigation

triglyph

The Problem

The Solution

Architecture

Usage

Building an Index

Querying

File List

File Formats

index.tri

index.tri.presence

index.files.str / index.files.dir

Query Algorithm

Performance

Safety

Building

Trigram Encoding

Design Decisions

Etymology

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages