Skip to content

sstable: lazy load the index block in single- and two- level iterator #4916

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

xxmplus
Copy link
Contributor

@xxmplus xxmplus commented Jun 19, 2025

This commit implements the index block lazy loading in single-level iterator as default behavior. The index block loading is now deferred until first access. At the same time, it leads to semantic changes as the file reading ordering changes because of this.

Here is a quick overview of the change.

Before (eager loading)

  1. Iterator Construction:
    1.1 the constructor calls readTopLevelIndexBlock()
    1.2 readTopLevelIndexBlock() calls readIndexBlock()
    1.3 readIndexBlock(() calls blockReader.Read()
    1.4 blockReader.Read() calls objstorage ReadHandle ReadAt()
    1.5 ReadAt() loads the index block
    1.6 construction completes

  2. First Positioning (e.g., First(), SeekGE(), SeekPrefixGE()):
    2.1 the positioning function uses the already-loaded index via PI(&i.index)
    2.2 file opened for data read
    2.3 ReadAt() loads the data block

After (lazy loading)

  1. Iterator Construction:
    1.1 the constructor sets indexLoaded = false to defer loading
    1.2 construction completes (no I/O)

  2. First Positioning (e.g., First(), SeekGE(), SeekPrefixGE()):
    2.1 the positioning function calls the new indexIter() wrapper
    2.2 indexIter() calls ensureIndexLoaded()
    2.3 ensureIndexLoaded() calls ReadAt()
    2.4 ReadAt() loads the index block
    2.5 uses the newly-loaded index for positioning
    2.6 ReadAt() loads the data block

As seen above, the I/O operation is now deferred. While the overall behavior is transparent to users, the internal semantic is different now. And the data-driven tests have to change accordingly. See the order of events change in file: checkpoint, cleaner, and event_listener.

What's interesting (optimized) is the change observed in flushable_ingest, which demonstrate the I/O saving thanks to the optimization.

-# When the key doesn't pass the bloom filter, we should see only two block
-# reads.
+# When the key doesn't pass the bloom filter, we should see only one block
+# read due to lazy loading optimization - bloom filter is checked first,
+# and if it rejects the key, we can avoid loading the index block entirely.
 get with-fs-logging
 small-00001-does-not-exist
 ----
-read-at(158, 41): 000004.sst
 read-at(199, 74): 000004.sst

Next is a comparison between eager loading and lazy loading. We should collect metrics on how often SeekPrefixGE() saved I/O; how often an iterator is created but not used; or how much the deferred operations improve memory usage if they are not available today (TODO).

Benchmark will be moved to a separate PR.
While one may argue the amortized cost is similar in the long run, lazy load could also help with perceived performance potentially. In other words, the average might be same in theory, but the common case is indeed optimized.

implements #3248

@xxmplus xxmplus requested a review from a team as a code owner June 19, 2025 20:02
@xxmplus xxmplus requested a review from jbowens June 19, 2025 20:02
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@xxmplus xxmplus force-pushed the i3248-lazy-load-index-block-evaluation branch from b0f66ca to f203b44 Compare June 19, 2025 20:19
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect the (constructor + first access) time to be similar, any ideas why it's so much larger after the change?

Reviewable status: 0 of 10 files reviewed, 5 unresolved discussions (waiting on @jbowens)


sstable/simple_benchmark_test.go line 93 at r1 (raw file):

// BenchmarkConstructionPlusFirstAccess measures construction + first index access
// How to: go test -bench=BenchmarkConstructionPlusFirstAccess -run=^$ ./sstable
func BenchmarkConstructionPlusFirstAccess(b *testing.B) {

[nit] all these benchmarks should start with BenchmarkIterator...


sstable/comparison_benchmark_test.go line 44 at r1 (raw file):

	})

	b.Run("SimulatedEagerLoading_ConstructOnly", func(b *testing.B) {

I don't understand all these benchmark variants (and the three files). We should have just:

  • construction
  • construction+first
  • construction+seekge
  • construction+seek-prefix-ge with no hit in filter
  • maybe construction+seek-prefix-ge with existing key

They should be in a different commit so we can easily obtain and show before and after values.


sstable/reader_iter_single_lvl.go line 173 at r1 (raw file):

	clearForResetBoundary struct{}

	// Lazy loading flag - must be below clearForResetBoundary to survive pool reuse

Why does this need to survive pool reuse?

In any case, we should put it next to another bool so it doesn't unnecessarily increase the struct size by a word


sstable/reader_iter_single_lvl.go line 451 at r1 (raw file):

	}
	// Load the next block.
	bhp, err := i.indexIter().BlockHandleWithProperties()

We don't need to keep calling the function each time..


sstable/reader_iter_single_lvl.go line 1585 at r1 (raw file):

func (i *singleLevelIterator[I, PI, D, PD]) indexIter() PI {
	if err := i.ensureIndexLoaded(); err != nil {

[nit] since it's a hot path, I'd move the if i.indexLoaded here, in case ensureIndexLoaded() doesn't get inlined.

Copy link
Contributor Author

@xxmplus xxmplus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me address that in the separate PR after split.

Reviewable status: 0 of 10 files reviewed, 5 unresolved discussions (waiting on @jbowens and @RaduBerinde)


sstable/reader_iter_single_lvl.go line 173 at r1 (raw file):

Previously, RaduBerinde wrote…

Why does this need to survive pool reuse?

In any case, we should put it next to another bool so it doesn't unnecessarily increase the struct size by a word

You are right. This is unnecessary. Moved it above clearForResetBoundary.


sstable/reader_iter_single_lvl.go line 451 at r1 (raw file):

Previously, RaduBerinde wrote…

We don't need to keep calling the function each time..

This is existing code replaced with the indexIter() wrapper. Test would fail if use PI(&i.index).


sstable/reader_iter_single_lvl.go line 1585 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] since it's a hot path, I'd move the if i.indexLoaded here, in case ensureIndexLoaded() doesn't get inlined.

updated. I'm always under the impression that the compiler is smart enough these days, anyway :p


sstable/comparison_benchmark_test.go line 44 at r1 (raw file):

Previously, RaduBerinde wrote…

I don't understand all these benchmark variants (and the three files). We should have just:

  • construction
  • construction+first
  • construction+seekge
  • construction+seek-prefix-ge with no hit in filter
  • maybe construction+seek-prefix-ge with existing key

They should be in a different commit so we can easily obtain and show before and after values.

hmmm, ok. I will split the PR.


sstable/simple_benchmark_test.go line 93 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] all these benchmarks should start with BenchmarkIterator...

ack

@xxmplus xxmplus force-pushed the i3248-lazy-load-index-block-evaluation branch from f203b44 to 86145b6 Compare June 23, 2025 20:59
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 6 unresolved discussions (waiting on @jbowens and @xxmplus)


sstable/reader_iter_single_lvl.go line 451 at r1 (raw file):

Previously, xxmplus (Andy Xu) wrote…

This is existing code replaced with the indexIter() wrapper. Test would fail if use PI(&i.index).

We already called indexIter() right above and returned if it wasn't valid, PI(&i.index) should be fine here.


sstable/reader_iter_single_lvl.go line 1582 at r2 (raw file):

	if !i.indexLoaded {
		if err := i.ensureIndexLoaded(); err != nil {
			i.err = err

Is it always safe to use the PI(&i.index) returned value even if the load failed? E.g. I believe the code i.indexIter().BlockHandleWithProperties() will crash if we hit an error here.

Also, this introduces a subtle place where i.err gets set; existing code might be modifying i.err with the assumption that it was nil before.

I think it would be cleaner to just call ensureIndexLoaded() and return any error at the beginning of every function that needs it, and then proceed to use PI(&i.index) as usual.

Copy link
Contributor Author

@xxmplus xxmplus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 6 unresolved discussions (waiting on @jbowens and @RaduBerinde)


sstable/reader_iter_single_lvl.go line 1582 at r2 (raw file):

Is it always safe to use the PI(&i.index) returned value even if the load failed?

It is not. This is indeed a good catch! We should always check for nil to be safe. Let me fix that in the next revision. Btw, when nil is detected, say, before calling i.indexIter().BlockHandleWithProperties(), what is the correct behavior? Is returning nil fine? I assume panicking is not ok, right?

existing code might be modifying i.err with the assumption that it was nil before.

I don't see existing code check i.err before setting it, what's the correct handling behavior?

I think it would be cleaner to just call ensureIndexLoaded() and return any error at the beginning of every function that needs it, and then proceed to use PI(&i.index) as usual.

I'm afraid it'd defeat the purpose of lazy loading. We do want to delay index loading until it's actually required. Also, all reference to PI(&i.index) is replaced by indexIter(). If we mixes the usage of them, it will be really confusing in the future.

@RaduBerinde
Copy link
Member

sstable/reader_iter_single_lvl.go line 1582 at r2 (raw file):

Previously, xxmplus (Andy Xu) wrote…

Is it always safe to use the PI(&i.index) returned value even if the load failed?

It is not. This is indeed a good catch! We should always check for nil to be safe. Let me fix that in the next revision. Btw, when nil is detected, say, before calling i.indexIter().BlockHandleWithProperties(), what is the correct behavior? Is returning nil fine? I assume panicking is not ok, right?

existing code might be modifying i.err with the assumption that it was nil before.

I don't see existing code check i.err before setting it, what's the correct handling behavior?

I think it would be cleaner to just call ensureIndexLoaded() and return any error at the beginning of every function that needs it, and then proceed to use PI(&i.index) as usual.

I'm afraid it'd defeat the purpose of lazy loading. We do want to delay index loading until it's actually required. Also, all reference to PI(&i.index) is replaced by indexIter(). If we mixes the usage of them, it will be really confusing in the future.

Of course we would call it later (after the bloom filter check) in the one method where it matters (SeekPrefixGE). Methods need to return a nil KV on errors. We can change indexIter() to assert (when invariants.Enabled) that the filter is loaded, and use a separate error-returning ensureIndexLoaded() to do the load.

xxmplus added a commit that referenced this pull request Jun 25, 2025
This commit implements the index block lazy loading in single-level iterator as default behavior.
The index block loading is now deferred until first access. At the same time, it leads to semantic
changes as the file reading ordering changes because of this.

Here is a quick overview of the change.

Before (eager loading)

1. Iterator Construction:
   1.1 the constructor calls readTopLevelIndexBlock()
   1.2 readTopLevelIndexBlock() calls readIndexBlock()
   1.3 readIndexBlock(() calls blockReader.Read()
   1.4 blockReader.Read() calls objstorage ReadHandle ReadAt()
   1.5 ReadAt() loads the index block
   1.6 construction completes

2. First Positioning (e.g., First(), SeekGE(), SeekPrefixGE()):
   2.1 the positioning function uses the already-loaded index via PI(&i.index)
   2.2 file opened for data read
   2.3 ReadAt() loads the data block

After (lazy loading)

1. Iterator Construction:
   1.1 the constructor sets indexLoaded = false to defer loading
   1.2 construction completes (no I/O)

2. First Positioning (e.g., First(), SeekGE(), SeekPrefixGE()):
   2.1 the positioning function calls the new indexIter() wrapper
   2.2 indexIter() calls ensureIndexLoaded()
   2.3 ensureIndexLoaded() calls ReadAt()
   2.4 ReadAt() loads the index block
   2.5 uses the newly-loaded index for positioning
   2.6 ReadAt() loads the data block

As seen above, the I/O operation is now deferred. While the overall behavior is transparent to
users, the internal semantic is different now. And the data-driven tests have to change
accordingly. See the order of events change in file: checkpoint, cleaner, and event_listener.

What's interesting (optimized) is the change observed in flushable_ingest, which demonstrate
the I/O saving thanks to the optimization.

```
-# When the key doesn't pass the bloom filter, we should see only two block
-# reads.
+# When the key doesn't pass the bloom filter, we should see only one block
+# read due to lazy loading optimization - bloom filter is checked first,
+# and if it rejects the key, we can avoid loading the index block entirely.
 get with-fs-logging
 small-00001-does-not-exist
 ----
-read-at(158, 41): 000004.sst
 read-at(199, 74): 000004.sst
```

implements cockroachdb#3248
@xxmplus xxmplus force-pushed the i3248-lazy-load-index-block-evaluation branch from 86145b6 to 6425752 Compare June 25, 2025 20:07
Copy link
Contributor Author

@xxmplus xxmplus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 5 unresolved discussions (waiting on @jbowens and @RaduBerinde)


sstable/reader_iter_single_lvl.go line 1582 at r2 (raw file):

Previously, RaduBerinde wrote…

Of course we would call it later (after the bloom filter check) in the one method where it matters (SeekPrefixGE). Methods need to return a nil KV on errors. We can change indexIter() to assert (when invariants.Enabled) that the filter is loaded, and use a separate error-returning ensureIndexLoaded() to do the load.

Alright, all indexIter() usages are removed. We now continue to use PI(&i.index) as before and call ensureIndexLoaded() when needed.

xxmplus added 2 commits June 26, 2025 16:15
This commit implements the index block lazy loading in two-level iterator as default behavior.
The index block loading is now deferred until first access.

Implements cockroachdb#3248
@xxmplus xxmplus changed the title sstable: lazy load the index block in single level iterator sstable: lazy load the index block in single- and two- level iterator Jun 27, 2025
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 0 of 11 files reviewed, 5 unresolved discussions (waiting on @jbowens)

@xxmplus xxmplus merged commit fe43e7b into cockroachdb:master Jun 27, 2025
17 of 20 checks passed
@xxmplus xxmplus deleted the i3248-lazy-load-index-block-evaluation branch June 27, 2025 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants