[PERF]: Modify get_range to return an iterator #5256

sanketkedia · 2025-08-13T00:29:50Z

Description of changes

Summarize the changes made by this PR.

Improvements & Bug fixes
- BlockfileReader's get_range() method used to return a Vec. The overhead of vector allocation and extend is non-trivial especially if the range is large. Regex brute force for e.g. has this overhead.
- This PR modifies the interface to return an iterator instead of a vector so that the overhead of materialization can be cut down
- get_all_data() is also updated to return an iterator and all callsites updated
New functionality
- ...

Test plan

How are these changes tested?

Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

None

Observability plan

Perf testing on staging

Documentation Changes

None

sanketkedia · 2025-08-13T00:30:07Z

[PERF]: Modify get_range to return an iterator #5256 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

github-actions · 2025-08-13T00:30:09Z

propel-code-bot · 2025-08-13T00:37:25Z

Optimize get_range to Return Iterator Instead of Vec for BlockfileReader

This PR changes the signature and behavior of BlockfileReader's get_range and related methods across the Rust codebase, replacing Vector-based results (Vec) with iterator-based results (impl Iterator). This change is aimed at reducing the overhead of materializing large intermediate Vecs during range queries, such as those used in brute force regex or full dataset scans, thereby improving performance and memory efficiency. All relevant call-sites, including get_all_data in segment readers, the full-text indexing logic, and associated test utilities, have been updated to accommodate the new iterator-based APIs.

Key Changes

• BlockfileReader's get_range() and related APIs now return iterators (impl Iterator) instead of Vecs, reducing allocation overhead.
• All call-sites (including get_all_data, regex/literal expression matching, index and segment test code) updated to consume iterators and materialize via collect() where necessary.
• Type signatures and trait bounds expanded to propagate appropriate lifetimes and me/``referred_data regions required for returned iterators, affecting API signatures throughout blockstore, segment, and index crates.
• Associated test and utility code updated to collect iterators for assertions or computation, maintaining test correctness.

Affected Areas

• rust/blockstore/src/arrow/blockfile.rs (BlockfileReader implementation)
• rust/blockstore/src/types/reader.rs (BlockfileReader trait and logic)
• rust/segment/src/blockfile_record.rs (get_all_data and dependent logic)
• rust/segment/src/blockfile_metadata.rs (affected get_all_data callsites in tests and writers)
• rust/segment/src/types.rs (test helpers and usages of get_all_data)
• rust/index/src/fulltext/types.rs (FullTextIndexReader/Literal matching)
• rust/types/src/regex/literal_expr.rs (LiteralExpr trait updates)
• rust/blockstore/tests/blockfile_writer_test.rs (test iterations and comparisons)
• rust/segment/src/distributed_spann.rs, rust/index/src/spann/types.rs (segment test support)

This summary was automatically generated by @propel-code-bot

propel-code-bot · 2025-08-13T00:39:19Z

rust/segment/src/blockfile_metadata.rs

+            .expect("Error getting all data from record segment")
+            .collect::<Vec<_>>();


[PerformanceOptimization]

Iterator is unnecessarily collected into Vec, defeating the purpose of the lazy iterator API. Consider processing the iterator lazily to maintain memory efficiency benefits.

// Instead of: let res = record_segment_reader.get_all_data().await?.collect::<Vec<_>>(); // Consider streaming approach: for item in record_segment_reader.get_all_data().await? { // process item }

It's a test so idc

I understand that this is a test, and often the strict performance characteristics are less critical in test code. However, maintaining the lazy processing of iterators, even in tests, can contribute to code clarity and consistency with the intended API design. It also helps prevent accidental performance regressions if this pattern were to be copied into non-test code later. For this specific instance, given it's a test, your rationale is noted.

propel-code-bot · 2025-08-13T00:39:20Z

rust/index/src/fulltext/types.rs

                let positional_posting_list = self
                    .posting_lists_blockfile_reader
                    .get_range(token.text.as_str()..=token.text.as_str(), ..)
-                    .await?;
+                    .await?
+                    .collect::<Vec<_>>();


[PerformanceOptimization]

Potential performance issue: The iterator is collected into a Vec immediately after being returned, which defeats the memory efficiency purpose of returning an iterator. Consider processing the data lazily.

// Instead of collecting immediately: let positional_posting_list = self .posting_lists_blockfile_reader .get_range(token.text.as_str()..=token.text.as_str(), ..) .await? .collect::<Vec<_>>(); // Consider processing lazily: let iter = self .posting_lists_blockfile_reader .get_range(token.text.as_str()..=token.text.as_str(), ..) .await?; // Process iter without collecting unless absolutely necessary

Sicheng-Pan · 2025-08-13T00:39:48Z

rust/blockstore/src/types/reader.rs

@@ -94,18 +94,22 @@ impl<
        &'referred_data self,
        prefix_range: PrefixRange,
        key_range: KeyRange,
-    ) -> Result<Vec<(&'referred_data str, K, V)>, Box<dyn ChromaError>>
+    ) -> Result<
+        Box<dyn Iterator<Item = (&'referred_data str, K, V)> + 'referred_data>,


q: when do we return Box<dyn trait> vs impl trait

yeah I was originally doing an impl trait. Problem is that the concrete types of memory blockfile and arrow blockfiles are different even though they both implement the same iterator trait. Hence, it does not compile with impl trait and we have to pay small runtime cost of dynamic dispatch

HammadB · 2025-08-13T00:46:12Z

rust/blockstore/src/types/reader.rs

    where
-        PrefixRange: RangeBounds<&'prefix str> + Clone,
-        KeyRange: RangeBounds<K> + Clone,
+        PrefixRange: RangeBounds<&'prefix str> + Clone + 'referred_data,


Why do you need this added bound? Doesn't quite make sense to me

Since the iterator being returned has lifetime 'referred_data (via + 'referred_data), any values captured by closures within that iterator must also live at least as long as 'referred_data. Iterator captures the prefix and key range so they also need to live at least as long as 'referred_data

HammadB · 2025-08-13T00:57:43Z

rust/blockstore/src/arrow/blockfile.rs

    where
-        PrefixRange: RangeBounds<&'prefix str> + Clone,
-        KeyRange: RangeBounds<K> + Clone,
+        PrefixRange: RangeBounds<&'prefix str> + Clone + 'me,


I think this should be 'prefix

HammadB · 2025-08-13T01:04:00Z

rust/index/src/fulltext/types.rs

@@ -454,12 +455,13 @@ impl<'reader> NgramLiteralProvider<FullTextIndexError> for FullTextIndexReader<'
        ngram_range: NgramRange,
    ) -> Result<Vec<(&'me str, u32, &'me [u32])>, FullTextIndexError>
    where
-        NgramRange: Clone + RangeBounds<&'me str> + Send + Sync,
+        NgramRange: Clone + RangeBounds<&'me str> + Send + Sync + 'me,


unrelated - i think 'me can be a separate lifetime

HammadB · 2025-08-13T01:04:43Z

rust/types/src/regex/literal_expr.rs

@@ -94,7 +94,7 @@ pub trait NgramLiteralProvider<E, const N: usize = 3> {
        ngram_range: NgramRange,
    ) -> Result<Vec<(&'me str, u32, &'me [u32])>, E>
    where
-        NgramRange: Clone + RangeBounds<&'me str> + Send + Sync;
+        NgramRange: Clone + RangeBounds<&'me str> + Send + Sync + 'me;


same lifetime comment

HammadB

Looks good minus maybe the lifetime comments

sanketkedia · 2025-08-13T01:40:39Z

Summary: 1 successful workflow, 1 pending workflow

✅ Check PR Title (1 job succeeded)
🔄 PR checks (54 jobs succeeded, 1 job pending)

Last updated: 2025-08-13 19:23:16 UTC

[PERF]: Modify get_range to return an iterator

e3d0fc4

sanketkedia requested review from HammadB, jairad26 and Sicheng-Pan August 13, 2025 00:36

sanketkedia marked this pull request as ready for review August 13, 2025 00:36

propel-code-bot bot reviewed Aug 13, 2025

View reviewed changes

Sicheng-Pan reviewed Aug 13, 2025

View reviewed changes

Sicheng-Pan approved these changes Aug 13, 2025

View reviewed changes

HammadB reviewed Aug 13, 2025

View reviewed changes

HammadB approved these changes Aug 13, 2025

View reviewed changes

sanketkedia merged commit 04df791 into main Aug 13, 2025
118 of 120 checks passed

		.expect("Error getting all data from record segment")
		.collect::<Vec<_>>();

[PERF]: Modify get_range to return an iterator #5256

[PERF]: Modify get_range to return an iterator #5256

Uh oh!

Conversation

sanketkedia commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Test plan

Migration plan

Observability plan

Documentation Changes

Uh oh!

sanketkedia commented Aug 13, 2025

Uh oh!

github-actions bot commented Aug 13, 2025

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

propel-code-bot bot commented Aug 13, 2025

Uh oh!

propel-code-bot bot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HammadB left a comment

Choose a reason for hiding this comment

Uh oh!

sanketkedia commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sanketkedia commented Aug 13, 2025 •

edited

Loading

sanketkedia commented Aug 13, 2025 •

edited

Loading