PR Proposal: Cache-Line-Friendly In-Memory Segment Index #2381

chiradip · 2025-11-19T16:26:18Z

chiradip
Nov 19, 2025

storage: add compact IndexEntryDense for cache-friendly in-memory segment index

Summary / Motivation
Segment index lookups (mapping offsets to file positions or timestamps) are a hot path in Iggy’s read performance. Currently, the in-memory representation of index entries may be scattered or pointer-heavy, causing:

Cache line misses during binary search or linear scans over segment index entries.
Pointer chasing and unpredictable memory access patterns.
Lower throughput and higher latency under high concurrency.

Proposal: Introduce a dense, cache-line-aligned in-memory representation for segment index entries (IndexEntryDense) that:

Stores hot fields (offset, file position, timestamp) together in a contiguous 64-byte aligned struct.
Allocates the in-memory index as a contiguous Vec.
Leaves the on-disk .idx format unchanged. Conversion occurs during in-memory loading.

Is optional / opt-in — existing code paths continue to work, enabling gradual integration.

Design / Implementation Details

IndexEntryDense is a #[repr(C, align(64))] struct:

pub struct IndexEntryDense {
    pub offset: u64,
    pub file_pos: u64,
    pub timestamp: u64,
    _pad: [u8; 40], // to fill 64 bytes
}

Loader reads the on-disk index and converts it into a contiguous Vec.
Binary searches and scans operate over this contiguous vector, reducing cache misses and improving prefetch efficiency.
All existing disk parsing and segment loader logic remains unchanged.
Purely an in-memory performance optimization, opt-in for safe adoption.

Benefits / Expected Impact

Reduced cache misses during index lookups.
Lower latency for offset/timestamp lookups, especially under high concurrency.
Improved scan throughput for consumer reads or internal operations like compaction or segment merging.
Minimal memory overhead (~64 bytes per index entry), tunable via struct packing if needed.
Compatible with existing segment files and API; can be adopted incrementally.

Testing & Benchmarking Plan

Unit tests: verify equivalence — for every offset lookup using IndexEntryDense, results match the current in-memory index.
Microbenchmarks: measure average and p99 latency for binary search over realistic segment index sizes.

System-level tests: run server under read-heavy load; measure cache misses and CPU cycles:

perf stat -e cache-misses,cycles target/release/iggy-bench --bench index_lookup

Next Steps / Future Work
If this PR yields measurable improvement, the dense, cache-line-friendly pattern can be applied to other hot paths:

Consumer offset / state arrays
Message metadata slabs
I/O buffer pool allocation for io_uring
Cache eviction metadata

These can be stacked for cumulative performance gains without altering on-disk formats.

Summary
This PR provides a small, safe, and measurable in-memory optimization that reduces cache-miss overhead on segment index lookups.

Disk format remains unchanged.

Opt-in design allows gradual adoption.

Lays groundwork for further cache-friendly optimizations in the future.

numinnex · 2025-11-19T16:39:55Z

numinnex
Nov 19, 2025
Collaborator

Hi, we were thinking about this when we merged together offset and timestamp indexes into singular structure and we've found that this does not scale really well with huge partition count.

Our indexes aren't sparse, which means we store them per message, rather than per N messages (batch), thus the memory overhead will be pretty big.

We are not concerned about cache misses, as our indexes are cached in memory per segment and since they are iterated through fairly frequently, the odds of a cache miss are fairly small (didn't measure this yet, as we are not at this stage of optimization).

Also you've mentioned binary search, which we use already and afaik binary search isn't the most cache friendly searching algorithm, unless you use something like eytzinger layout, which we don't want to use as it adds a lot of complexity and it doesn't work really well with dynamically sized containers since everytime the collection has to scale up, it has to be reallocated and copied over, constructing eytzinger layout array is very expensive.

2 replies

chiradip Nov 19, 2025
Author

Alright alright :)

chiradip Nov 19, 2025
Author

I just wanted to clarify one (of some) technical point: iteration frequency alone does not guarantee low cache misses. Cache efficiency depends on data layout, alignment, and contiguity. Even with indexes in memory, accessing fields across multiple structures or large segments can still trigger L1/L2 cache misses.

The proposed dense, cache-line-aligned in-memory layout is opt-in and purely in-memory, so we can measure its impact safely. If benchmarks show no meaningful improvement, we can discard it — this keeps any optimization data-driven rather than assumption-based.

hubcio · 2025-11-19T17:31:41Z

hubcio
Nov 19, 2025
Collaborator

Just to add: if we have open segment, we store indexes in memory. Each index points to a single message. There is no binary search when getting messages from open segment by offset. This is O(1) operation because index of index is relative offset of message in segment.

The only binary search that could happen is during fetch messages by timestamp, but that's another story, IMHO not worth optimizing at this point.

0 replies

chiradip · 2025-11-19T17:38:51Z

chiradip
Nov 19, 2025
Author

Just to clarify: the optimization isn’t about binary search at all — that keeps coming up, (AM I READING IT WRONG?) but it’s not the target. The gain comes from improving the memory layout, which affects every path: O(1) open-segment lookups, sequential scans, timestamp lookups, compaction, replication, everything. Even with O(1) access, if an entry spans multiple cache lines or its hot fields aren’t contiguous, the CPU still pays extra memory loads. So the proposal isn’t changing algorithms — it’s simply making each access cheaper at the hardware level. Whether the search is O(1), O(log N), or a scan doesn’t matter; cacheline efficiency is orthogonal.

1 reply

hubcio Nov 19, 2025
Collaborator

This is our IggyIndex:

pub struct IggyIndex {
    pub offset: u32,
    pub position: u32,
    pub timestamp: u64,
}

So with a 64-byte cache line we get exactly 4 entries per line.

In your solution each index would be padded to cache line (64b), so it'll contain extra bytes:

pub struct IggyIndex {
    pub offset: u32,
    pub position: u32,
    pub timestamp: u64,
     _pad: [u8; 48], // to fill 64 bytes
}

lets say we have a 1 GB segment and each message is ~1000 bytes of user data. That gives us roughly 1M index entries per segment. With the current layout that’s about 16 MB of index data per segment (on disk and in memory), and we don’t want to change the on-disk layout.

with the padded layout:

each entry becomes 64 bytes,
so the same 1M entries use ~64 MB,
and 3/4 of that is padding.

that leads to a few issues:

With, say, 500 segments loaded, we currently use ~8 GB of memory for indexes; with the padded layout this jumps to ~32 GB.
We can have a lot of segments open at once, and right now we drop indexes from memory when a segment is closed (there’s no TTL cache), so the per-segment footprint matters a lot.
Because the struct is 4x larger, many fewer entries fit into the CPU caches, which goes against the goal of making lookups more cache-friendly. Right now we get 4 entries per 64-byte cache line; with the padded layout we’d only get 1.
during index load when server starts you cannot "just" load entries into memory, you need to prepare them to load them so that it's aligned with proper padding, it's not 1 :1 operation so server startup will be increased (not sure by how much)

Just to add - we don't store Vec<IggyIndex> in memory. We store Bytes (or Vec<u8>) which we cast to IggyIndexView:

iggy/core/common/src/types/message/index_view.rs

Line 27 in a8e2341

pub struct IggyIndexView<'a> {

From my side you still have a green light to experiment with a faster in-memory index representation 🦀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PR Proposal: Cache-Line-Friendly In-Memory Segment Index #2381

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

PR Proposal: Cache-Line-Friendly In-Memory Segment Index #2381

Uh oh!

chiradip Nov 19, 2025

Replies: 3 comments · 3 replies

Uh oh!

Uh oh!

numinnex Nov 19, 2025 Collaborator

Uh oh!

chiradip Nov 19, 2025 Author

Uh oh!

chiradip Nov 19, 2025 Author

Uh oh!

hubcio Nov 19, 2025 Collaborator

Uh oh!

chiradip Nov 19, 2025 Author

Uh oh!

Uh oh!

hubcio Nov 19, 2025 Collaborator

chiradip
Nov 19, 2025

Replies: 3 comments 3 replies

numinnex
Nov 19, 2025
Collaborator

chiradip Nov 19, 2025
Author

chiradip Nov 19, 2025
Author

hubcio
Nov 19, 2025
Collaborator

chiradip
Nov 19, 2025
Author

hubcio Nov 19, 2025
Collaborator