sstable: lazy load the index block in single- and two- level iterator #4916

xxmplus · 2025-06-19T20:02:25Z

This commit implements the index block lazy loading in single-level iterator as default behavior. The index block loading is now deferred until first access. At the same time, it leads to semantic changes as the file reading ordering changes because of this.

Here is a quick overview of the change.

Before (eager loading)

Iterator Construction:
1.1 the constructor calls readTopLevelIndexBlock()
1.2 readTopLevelIndexBlock() calls readIndexBlock()
1.3 readIndexBlock(() calls blockReader.Read()
1.4 blockReader.Read() calls objstorage ReadHandle ReadAt()
1.5 ReadAt() loads the index block
1.6 construction completes
First Positioning (e.g., First(), SeekGE(), SeekPrefixGE()):
2.1 the positioning function uses the already-loaded index via PI(&i.index)
2.2 file opened for data read
2.3 ReadAt() loads the data block

After (lazy loading)

Iterator Construction:
1.1 the constructor sets indexLoaded = false to defer loading
1.2 construction completes (no I/O)
First Positioning (e.g., First(), SeekGE(), SeekPrefixGE()):
2.1 the positioning function calls the new indexIter() wrapper
2.2 indexIter() calls ensureIndexLoaded()
2.3 ensureIndexLoaded() calls ReadAt()
2.4 ReadAt() loads the index block
2.5 uses the newly-loaded index for positioning
2.6 ReadAt() loads the data block

As seen above, the I/O operation is now deferred. While the overall behavior is transparent to users, the internal semantic is different now. And the data-driven tests have to change accordingly. See the order of events change in file: checkpoint, cleaner, and event_listener.

What's interesting (optimized) is the change observed in flushable_ingest, which demonstrate the I/O saving thanks to the optimization.

-# When the key doesn't pass the bloom filter, we should see only two block
-# reads.
+# When the key doesn't pass the bloom filter, we should see only one block
+# read due to lazy loading optimization - bloom filter is checked first,
+# and if it rejects the key, we can avoid loading the index block entirely.
 get with-fs-logging
 small-00001-does-not-exist
 ----
-read-at(158, 41): 000004.sst
 read-at(199, 74): 000004.sst

Next is a comparison between eager loading and lazy loading. We should collect metrics on how often SeekPrefixGE() saved I/O; how often an iterator is created but not used; or how much the deferred operations improve memory usage if they are not available today (TODO).

Benchmark will be moved to a separate PR.
While one may argue the amortized cost is similar in the long run, lazy load could also help with perceived performance potentially. In other words, the average might be same in theory, but the common case is indeed optimized.

implements #3248

cockroach-teamcity · 2025-06-19T20:02:33Z

This change is

RaduBerinde

I'd expect the (constructor + first access) time to be similar, any ideas why it's so much larger after the change?

Reviewable status: 0 of 10 files reviewed, 5 unresolved discussions (waiting on @jbowens)

sstable/simple_benchmark_test.go line 93 at r1 (raw file):

// BenchmarkConstructionPlusFirstAccess measures construction + first index access
// How to: go test -bench=BenchmarkConstructionPlusFirstAccess -run=^$ ./sstable
func BenchmarkConstructionPlusFirstAccess(b *testing.B) {

[nit] all these benchmarks should start with BenchmarkIterator...

sstable/comparison_benchmark_test.go line 44 at r1 (raw file):

	})

	b.Run("SimulatedEagerLoading_ConstructOnly", func(b *testing.B) {

I don't understand all these benchmark variants (and the three files). We should have just:

construction
construction+first
construction+seekge
construction+seek-prefix-ge with no hit in filter
maybe construction+seek-prefix-ge with existing key

They should be in a different commit so we can easily obtain and show before and after values.

sstable/reader_iter_single_lvl.go line 173 at r1 (raw file):

	clearForResetBoundary struct{}

	// Lazy loading flag - must be below clearForResetBoundary to survive pool reuse

Why does this need to survive pool reuse?

In any case, we should put it next to another bool so it doesn't unnecessarily increase the struct size by a word

sstable/reader_iter_single_lvl.go line 451 at r1 (raw file):

	}
	// Load the next block.
	bhp, err := i.indexIter().BlockHandleWithProperties()

We don't need to keep calling the function each time..

sstable/reader_iter_single_lvl.go line 1585 at r1 (raw file):

func (i *singleLevelIterator[I, PI, D, PD]) indexIter() PI {
	if err := i.ensureIndexLoaded(); err != nil {

[nit] since it's a hot path, I'd move the if i.indexLoaded here, in case ensureIndexLoaded() doesn't get inlined.

xxmplus

Let me address that in the separate PR after split.

Reviewable status: 0 of 10 files reviewed, 5 unresolved discussions (waiting on @jbowens and @RaduBerinde)

sstable/reader_iter_single_lvl.go line 173 at r1 (raw file):

Previously, RaduBerinde wrote…

Why does this need to survive pool reuse?

In any case, we should put it next to another bool so it doesn't unnecessarily increase the struct size by a word

You are right. This is unnecessary. Moved it above clearForResetBoundary.

sstable/reader_iter_single_lvl.go line 451 at r1 (raw file):

Previously, RaduBerinde wrote…

We don't need to keep calling the function each time..

This is existing code replaced with the indexIter() wrapper. Test would fail if use PI(&i.index).

sstable/reader_iter_single_lvl.go line 1585 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] since it's a hot path, I'd move the if i.indexLoaded here, in case ensureIndexLoaded() doesn't get inlined.

updated. I'm always under the impression that the compiler is smart enough these days, anyway :p

sstable/comparison_benchmark_test.go line 44 at r1 (raw file):

Previously, RaduBerinde wrote…

I don't understand all these benchmark variants (and the three files). We should have just:

construction

construction+first

construction+seekge

construction+seek-prefix-ge with no hit in filter

maybe construction+seek-prefix-ge with existing key

They should be in a different commit so we can easily obtain and show before and after values.

hmmm, ok. I will split the PR.

sstable/simple_benchmark_test.go line 93 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] all these benchmarks should start with BenchmarkIterator...

ack

RaduBerinde

Reviewable status: 0 of 10 files reviewed, 6 unresolved discussions (waiting on @jbowens and @xxmplus)

sstable/reader_iter_single_lvl.go line 451 at r1 (raw file):

Previously, xxmplus (Andy Xu) wrote…

This is existing code replaced with the indexIter() wrapper. Test would fail if use PI(&i.index).

We already called indexIter() right above and returned if it wasn't valid, PI(&i.index) should be fine here.

sstable/reader_iter_single_lvl.go line 1582 at r2 (raw file):

	if !i.indexLoaded {
		if err := i.ensureIndexLoaded(); err != nil {
			i.err = err

Is it always safe to use the PI(&i.index) returned value even if the load failed? E.g. I believe the code i.indexIter().BlockHandleWithProperties() will crash if we hit an error here.

Also, this introduces a subtle place where i.err gets set; existing code might be modifying i.err with the assumption that it was nil before.

I think it would be cleaner to just call ensureIndexLoaded() and return any error at the beginning of every function that needs it, and then proceed to use PI(&i.index) as usual.

Related to cockroachdb#3248 and cockroachdb#4916

xxmplus

Reviewable status: 0 of 10 files reviewed, 6 unresolved discussions (waiting on @jbowens and @RaduBerinde)

sstable/reader_iter_single_lvl.go line 1582 at r2 (raw file):

Is it always safe to use the PI(&i.index) returned value even if the load failed?

It is not. This is indeed a good catch! We should always check for nil to be safe. Let me fix that in the next revision. Btw, when nil is detected, say, before calling i.indexIter().BlockHandleWithProperties(), what is the correct behavior? Is returning nil fine? I assume panicking is not ok, right?

existing code might be modifying i.err with the assumption that it was nil before.

I don't see existing code check i.err before setting it, what's the correct handling behavior?

I think it would be cleaner to just call ensureIndexLoaded() and return any error at the beginning of every function that needs it, and then proceed to use PI(&i.index) as usual.

I'm afraid it'd defeat the purpose of lazy loading. We do want to delay index loading until it's actually required. Also, all reference to PI(&i.index) is replaced by indexIter(). If we mixes the usage of them, it will be really confusing in the future.

RaduBerinde · 2025-06-25T02:00:44Z

sstable/reader_iter_single_lvl.go line 1582 at r2 (raw file):

Previously, xxmplus (Andy Xu) wrote…

Is it always safe to use the PI(&i.index) returned value even if the load failed?

It is not. This is indeed a good catch! We should always check for nil to be safe. Let me fix that in the next revision. Btw, when nil is detected, say, before calling i.indexIter().BlockHandleWithProperties(), what is the correct behavior? Is returning nil fine? I assume panicking is not ok, right?

existing code might be modifying i.err with the assumption that it was nil before.

I don't see existing code check i.err before setting it, what's the correct handling behavior?

I think it would be cleaner to just call ensureIndexLoaded() and return any error at the beginning of every function that needs it, and then proceed to use PI(&i.index) as usual.

I'm afraid it'd defeat the purpose of lazy loading. We do want to delay index loading until it's actually required. Also, all reference to PI(&i.index) is replaced by indexIter(). If we mixes the usage of them, it will be really confusing in the future.

Of course we would call it later (after the bloom filter check) in the one method where it matters (SeekPrefixGE). Methods need to return a nil KV on errors. We can change indexIter() to assert (when invariants.Enabled) that the filter is loaded, and use a separate error-returning ensureIndexLoaded() to do the load.

Related to #3248 and #4916

This commit implements the index block lazy loading in single-level iterator as default behavior. The index block loading is now deferred until first access. At the same time, it leads to semantic changes as the file reading ordering changes because of this. Here is a quick overview of the change. Before (eager loading) 1. Iterator Construction: 1.1 the constructor calls readTopLevelIndexBlock() 1.2 readTopLevelIndexBlock() calls readIndexBlock() 1.3 readIndexBlock(() calls blockReader.Read() 1.4 blockReader.Read() calls objstorage ReadHandle ReadAt() 1.5 ReadAt() loads the index block 1.6 construction completes 2. First Positioning (e.g., First(), SeekGE(), SeekPrefixGE()): 2.1 the positioning function uses the already-loaded index via PI(&i.index) 2.2 file opened for data read 2.3 ReadAt() loads the data block After (lazy loading) 1. Iterator Construction: 1.1 the constructor sets indexLoaded = false to defer loading 1.2 construction completes (no I/O) 2. First Positioning (e.g., First(), SeekGE(), SeekPrefixGE()): 2.1 the positioning function calls the new indexIter() wrapper 2.2 indexIter() calls ensureIndexLoaded() 2.3 ensureIndexLoaded() calls ReadAt() 2.4 ReadAt() loads the index block 2.5 uses the newly-loaded index for positioning 2.6 ReadAt() loads the data block As seen above, the I/O operation is now deferred. While the overall behavior is transparent to users, the internal semantic is different now. And the data-driven tests have to change accordingly. See the order of events change in file: checkpoint, cleaner, and event_listener. What's interesting (optimized) is the change observed in flushable_ingest, which demonstrate the I/O saving thanks to the optimization. ``` -# When the key doesn't pass the bloom filter, we should see only two block -# reads. +# When the key doesn't pass the bloom filter, we should see only one block +# read due to lazy loading optimization - bloom filter is checked first, +# and if it rejects the key, we can avoid loading the index block entirely. get with-fs-logging small-00001-does-not-exist ---- -read-at(158, 41): 000004.sst read-at(199, 74): 000004.sst ``` implements cockroachdb#3248

xxmplus

Reviewable status: 0 of 10 files reviewed, 5 unresolved discussions (waiting on @jbowens and @RaduBerinde)

sstable/reader_iter_single_lvl.go line 1582 at r2 (raw file):

Previously, RaduBerinde wrote…

Of course we would call it later (after the bloom filter check) in the one method where it matters (SeekPrefixGE). Methods need to return a nil KV on errors. We can change indexIter() to assert (when invariants.Enabled) that the filter is loaded, and use a separate error-returning ensureIndexLoaded() to do the load.

Alright, all indexIter() usages are removed. We now continue to use PI(&i.index) as before and call ensureIndexLoaded() when needed.

Related to cockroachdb#3248

This commit implements the index block lazy loading in two-level iterator as default behavior. The index block loading is now deferred until first access. Implements cockroachdb#3248

RaduBerinde

Reviewable status: 0 of 11 files reviewed, 5 unresolved discussions (waiting on @jbowens)

xxmplus requested a review from a team as a code owner June 19, 2025 20:02

xxmplus requested a review from jbowens June 19, 2025 20:02

xxmplus force-pushed the i3248-lazy-load-index-block-evaluation branch from b0f66ca to f203b44 Compare June 19, 2025 20:19

xxmplus mentioned this pull request Jun 19, 2025

sstable: index block always loaded; even if unused #3248

Open

RaduBerinde reviewed Jun 21, 2025

View reviewed changes

xxmplus commented Jun 23, 2025

View reviewed changes

xxmplus force-pushed the i3248-lazy-load-index-block-evaluation branch from f203b44 to 86145b6 Compare June 23, 2025 20:59

RaduBerinde reviewed Jun 23, 2025

View reviewed changes

xxmplus added a commit to xxmplus/pebble that referenced this pull request Jun 24, 2025

test: introduce single level iterator lazy load benchmark test

e709b38

Related to cockroachdb#3248 and cockroachdb#4916

xxmplus mentioned this pull request Jun 24, 2025

test: introduce single level iterator lazy load benchmark test #4953

Merged

xxmplus commented Jun 25, 2025

View reviewed changes

xxmplus added a commit that referenced this pull request Jun 25, 2025

test: introduce single level iterator lazy load benchmark test

9b2540f

Related to #3248 and #4916

xxmplus force-pushed the i3248-lazy-load-index-block-evaluation branch from 86145b6 to 6425752 Compare June 25, 2025 20:07

xxmplus commented Jun 25, 2025

View reviewed changes

xxmplus added 2 commits June 26, 2025 16:15

test: introduce two level iterator benchmark test

28081ae

Related to cockroachdb#3248

sstable: lazy load the index block in two level iterator

8841f49

This commit implements the index block lazy loading in two-level iterator as default behavior. The index block loading is now deferred until first access. Implements cockroachdb#3248

xxmplus changed the title ~~sstable: lazy load the index block in single level iterator~~ sstable: lazy load the index block in single- and two- level iterator Jun 27, 2025

RaduBerinde approved these changes Jun 27, 2025

View reviewed changes

xxmplus merged commit fe43e7b into cockroachdb:master Jun 27, 2025
17 of 20 checks passed

xxmplus deleted the i3248-lazy-load-index-block-evaluation branch June 27, 2025 20:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sstable: lazy load the index block in single- and two- level iterator #4916

sstable: lazy load the index block in single- and two- level iterator #4916

xxmplus commented Jun 19, 2025 •

edited

Loading

Uh oh!

cockroach-teamcity commented Jun 19, 2025

Uh oh!

RaduBerinde left a comment

Uh oh!

xxmplus left a comment

Uh oh!

RaduBerinde left a comment

Uh oh!

xxmplus left a comment

Uh oh!

RaduBerinde commented Jun 25, 2025

Uh oh!

xxmplus left a comment

Uh oh!

RaduBerinde left a comment

Uh oh!

Uh oh!

Uh oh!

sstable: lazy load the index block in single- and two- level iterator #4916

sstable: lazy load the index block in single- and two- level iterator #4916

Conversation

xxmplus commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jun 19, 2025

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

xxmplus left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

xxmplus left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde commented Jun 25, 2025

Uh oh!

xxmplus left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xxmplus commented Jun 19, 2025 •

edited

Loading