Skip to content

johnzfitch/triglyph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

triglyph

search Zero-RSS trigram index for code search

Named for the triglyph: three vertical grooves in a Doric frieze—just like the three tiers in this index.


lightning The Problem

A naive trigram index using HashMap<u32, Vec<u32>> for posting lists turns a 2.86GB on-disk index into 2.1GB+ RSS:

  • HashMap bucket allocation overhead
  • Vec capacity slack (typically 2x actual usage)
  • Per-allocation overhead (16+ bytes each)

magic-wand The Solution

Memory-map the index and only page in blocks actually accessed during queries.

Before: 2.1GB RSS for posting lists
After:  ~0 bytes RSS (OS pages in only accessed blocks)

layers Architecture

Three-tier design, like the three grooves of a triglyph:

┌─────────────────────────────────────────────────────────────┐
│  Tier 0: Presence Bitset (2.2KB)                            │
│  ─────────────────────────────────────────────              │
│  • Direct-addressed bitset for 17,576 a-z trigrams          │
│  • Zero false positives, O(1) lookup                        │
│  • Rejects absent trigrams before touching directory        │
│  • Hot in L1 cache                                          │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│  Tier 1: Slot Directory (~281KB)                            │
│  ─────────────────────────────────────────────              │
│  • 17,576 fixed slots, direct addressing                    │
│  • No binary search: slots[trigram_id]                      │
│  • Each slot: { offset, len_bytes, count }                  │
│  • Likely in L2/L3 cache                                    │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│  Tier 2: Posting Blocks (mmap'd, cold)                      │
│  ─────────────────────────────────────────────              │
│  • Variable-sized blocks with 8-byte headers                │
│  • Delta-varint encoded doc IDs                             │
│  • Block-skip: skip entire blocks when max < target         │
│  • Page faults only for blocks actually needed              │
└─────────────────────────────────────────────────────────────┘

rocket Usage

Building an Index

use triglyph::{build_atomic_index, trigram_id};
use std::collections::HashMap;
use std::path::Path;

let mut postings: HashMap<u32, Vec<u32>> = HashMap::new();

// Add documents containing "foo"
postings.insert(trigram_id(b'f', b'o', b'o'), vec![1, 5, 10, 15, 20]);

// Add documents containing "bar"
postings.insert(trigram_id(b'b', b'a', b'r'), vec![5, 10, 25, 30]);

// Build atomically (tmp file + rename)
build_atomic_index(
    Path::new("index.tri"),
    &postings,
    |trigram| Some(trigram as usize),
)?;

Querying

use triglyph::{IndexReader, PresenceBitset, search_trigrams, extract_trigrams};
use memmap2::Mmap;
use std::fs::File;

// Open index
let file = File::open("index.tri")?;
let mmap = unsafe { Mmap::map(&file)? };
let reader = IndexReader::open(&mmap)?;

// Load presence bitset for fast rejection
let presence = PresenceBitset::load("index.tri.presence").ok();

// Extract trigrams from query
let query_trigrams = extract_trigrams("foobar");
// Returns: trigram IDs for "foo", "oob", "oba", "bar"

// Find documents containing ALL trigrams
let doc_ids = search_trigrams(&reader, presence.as_ref(), &query_trigrams)?;
// Returns: [5, 10] (intersection of foo ∩ bar)

File List

Store file metadata without keeping millions of paths in RAM:

use triglyph::{IndexedFile, build_file_list, FileListReader};
use std::path::Path;

let files = vec![
    IndexedFile { path: "/src/main.rs".into(), mtime: 1234567890, size: 1024 },
    IndexedFile { path: "/src/lib.rs".into(), mtime: 1234567891, size: 2048 },
];

build_file_list(
    Path::new("index.files.str"),
    Path::new("index.files.dir"),
    &files,
)?;

// Query by doc ID
let str_mmap = unsafe { Mmap::map(&File::open("index.files.str")?)? };
let dir_mmap = unsafe { Mmap::map(&File::open("index.files.dir")?)? };
let file_list = FileListReader::open(&str_mmap, &dir_mmap)?;

let entry = file_list.get(0)?;
println!("{} ({} bytes)", entry.path, entry.size);

folder File Formats

index.tri

Header (16 bytes)
├── magic: b"TRIG"
├── version: u32 = 1
├── slot_count: u32 = 17576
└── data_offset: u32 = 281232

Slot Directory (17576 × 16 bytes)
└── For each trigram:
    ├── offset: u64      (relative to data region)
    ├── len_bytes: u32
    └── count: u32       (selectivity for query planning)

Data Region (variable)
└── For each non-empty trigram, contiguous blocks:
    Block Header (8 bytes)
    ├── max_doc: u32     (for block-skip)
    ├── payload_len: u16
    └── doc_count: u16   (1-128)

    Block Payload (payload_len bytes)
    └── delta-varint encoded doc IDs

index.tri.presence

Header (16 bytes)
├── magic: b"TRIPRES\0"
├── space: u32 = 17576
└── reserved: u32 = 0

Bitset (275 × 8 bytes)
└── bits[trigram / 64] & (1 << (trigram % 64))

index.files.str / index.files.dir

.str: Concatenated UTF-8 paths (no separators)

.dir:
Header (16 bytes)
├── magic: b"FILEDIR\0"
├── version: u32 = 1
└── file_count: u32

Entries (file_count × 28 bytes)
└── For each file:
    ├── str_offset: u64
    ├── str_len: u32
    ├── mtime: u64
    └── size: u64

diagram Query Algorithm

  1. Tier 0: If any trigram absent from presence bitset → return empty
  2. Tier 1: Look up slots, collect data slices
  3. Sort by selectivity: Smallest posting lists first
  4. Intersect smallest two: Block-skip intersection
  5. Filter through rest: Retain candidates in all remaining lists

Block-skip intersection:

// Skip entire blocks where block.max < target
while self.block_max < target {
    self.load_next_block();
}
// Then linear scan within block

star Performance

Operation Cost
Presence check O(1), ~2.2KB in L1
Slot lookup O(1), direct addressing
Block skip O(blocks skipped), no decompression
Varint decode O(docs in block)
Intersection O(min(list₁, list₂)) with skip

Typical query touches:

  • 2.2KB presence bitset (always hot)
  • 16 bytes per trigram slot
  • Only blocks containing matching doc IDs

protection Safety

Handles corruption gracefully:

  • Varint overflow: Rejects values > u32::MAX
  • Block validation: count ∈ [1, 128], len > 0
  • Bounds checking: All slice accesses validated
  • Iterator safety: Clears state on all failure paths (no livelock)
  • Collision detection: Builder errors on duplicate slot mappings
  • Atomic writes: tmp file + rename pattern

Run reader.validate_deep() in tests to probe all block headers.


toolbox Building

cargo build --release
cargo test
[dependencies]
triglyph = "0.1"

Requires:

[dependencies]
memmap2 = "0.9"

key Trigram Encoding

Lowercase ASCII a-z only, producing IDs 0..17575:

fn trigram_id(b0: u8, b1: u8, b2: u8) -> u32 {
    let c0 = (b0 - b'a') as u32;
    let c1 = (b1 - b'a') as u32;
    let c2 = (b2 - b'a') as u32;
    c0 * 676 + c1 * 26 + c2
}

extract_trigrams() normalizes to lowercase and filters non-alphabetic characters.


settings Design Decisions

Why not page-aligned blocks?

Variable-sized blocks with contiguous layout were chosen over 4KB alignment:

  • Pro: Better compression, fewer wasted bytes
  • Con: Headers and payloads intermix on pages
  • Result: Skip avoids decompression but still touches header pages

Why fixed slot directory?

17,576 slots × 16 bytes = 281KB fits in cache. Direct addressing slots[trigram] beats binary search.

Why delta-varint?

Doc IDs are sorted, so deltas are small. Most encode in 1-2 bytes vs 4 bytes for raw u32.

Why separate presence file?

Keeps the hot 2.2KB isolated from cold posting data. Could embed in header, but separate sidecar is simpler.


tree Etymology

A triglyph is an architectural element in the Doric order: a block with three vertical grooves (tri- + glyphein "to carve"). The three grooves map to the three tiers of this index:

  1. First groove: Presence bitset (tiny, always hot)
  2. Second groove: Slot directory (small, usually cached)
  3. Third groove: Posting blocks (large, paged on demand)

lock License

MIT


Icons from iconics (3,372+ semantic icons)

About

Zero-RSS trigram index library for code search

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages