Skip to content

[PERF]: Modify get_range to return an iterator #5256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

sanketkedia
Copy link
Contributor

@sanketkedia sanketkedia commented Aug 13, 2025

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • BlockfileReader's get_range() method used to return a Vec. The overhead of vector allocation and extend is non-trivial especially if the range is large. Regex brute force for e.g. has this overhead.
    • This PR modifies the interface to return an iterator instead of a vector so that the overhead of materialization can be cut down
    • get_all_data() is also updated to return an iterator and all callsites updated
  • New functionality
    • ...

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

None

Observability plan

Perf testing on staging

Documentation Changes

None

Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@sanketkedia sanketkedia marked this pull request as ready for review August 13, 2025 00:36
Copy link
Contributor

Optimize get_range to Return Iterator Instead of Vec for BlockfileReader

This PR changes the signature and behavior of BlockfileReader's get_range and related methods across the Rust codebase, replacing Vector-based results (Vec) with iterator-based results (impl Iterator). This change is aimed at reducing the overhead of materializing large intermediate Vecs during range queries, such as those used in brute force regex or full dataset scans, thereby improving performance and memory efficiency. All relevant call-sites, including get_all_data in segment readers, the full-text indexing logic, and associated test utilities, have been updated to accommodate the new iterator-based APIs.

Key Changes

BlockfileReader's get_range() and related APIs now return iterators (impl Iterator) instead of Vecs, reducing allocation overhead.
• All call-sites (including get_all_data, regex/literal expression matching, index and segment test code) updated to consume iterators and materialize via collect() where necessary.
• Type signatures and trait bounds expanded to propagate appropriate lifetimes and me/``referred_data regions required for returned iterators, affecting API signatures throughout blockstore, segment, and index crates.
• Associated test and utility code updated to collect iterators for assertions or computation, maintaining test correctness.

Affected Areas

• rust/blockstore/src/arrow/blockfile.rs (BlockfileReader implementation)
• rust/blockstore/src/types/reader.rs (BlockfileReader trait and logic)
• rust/segment/src/blockfile_record.rs (get_all_data and dependent logic)
• rust/segment/src/blockfile_metadata.rs (affected get_all_data callsites in tests and writers)
• rust/segment/src/types.rs (test helpers and usages of get_all_data)
• rust/index/src/fulltext/types.rs (FullTextIndexReader/Literal matching)
• rust/types/src/regex/literal_expr.rs (LiteralExpr trait updates)
• rust/blockstore/tests/blockfile_writer_test.rs (test iterations and comparisons)
• rust/segment/src/distributed_spann.rs, rust/index/src/spann/types.rs (segment test support)

This summary was automatically generated by @propel-code-bot

Comment on lines +1228 to +1229
.expect("Error getting all data from record segment")
.collect::<Vec<_>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[PerformanceOptimization]

Iterator is unnecessarily collected into Vec, defeating the purpose of the lazy iterator API. Consider processing the iterator lazily to maintain memory efficiency benefits.

// Instead of:
let res = record_segment_reader.get_all_data().await?.collect::<Vec<_>>();

// Consider streaming approach:
for item in record_segment_reader.get_all_data().await? {
    // process item
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a test so idc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this is a test, and often the strict performance characteristics are less critical in test code. However, maintaining the lazy processing of iterators, even in tests, can contribute to code clarity and consistency with the intended API design. It also helps prevent accidental performance regressions if this pattern were to be copied into non-test code later. For this specific instance, given it's a test, your rationale is noted.

Comment on lines 327 to +331
let positional_posting_list = self
.posting_lists_blockfile_reader
.get_range(token.text.as_str()..=token.text.as_str(), ..)
.await?;
.await?
.collect::<Vec<_>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[PerformanceOptimization]

Potential performance issue: The iterator is collected into a Vec immediately after being returned, which defeats the memory efficiency purpose of returning an iterator. Consider processing the data lazily.

// Instead of collecting immediately:
let positional_posting_list = self
    .posting_lists_blockfile_reader
    .get_range(token.text.as_str()..=token.text.as_str(), ..)
    .await?
    .collect::<Vec<_>>();

// Consider processing lazily:
let iter = self
    .posting_lists_blockfile_reader
    .get_range(token.text.as_str()..=token.text.as_str(), ..)
    .await?;
// Process iter without collecting unless absolutely necessary

@@ -94,18 +94,22 @@ impl<
&'referred_data self,
prefix_range: PrefixRange,
key_range: KeyRange,
) -> Result<Vec<(&'referred_data str, K, V)>, Box<dyn ChromaError>>
) -> Result<
Box<dyn Iterator<Item = (&'referred_data str, K, V)> + 'referred_data>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: when do we return Box<dyn trait> vs impl trait

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I was originally doing an impl trait. Problem is that the concrete types of memory blockfile and arrow blockfiles are different even though they both implement the same iterator trait. Hence, it does not compile with impl trait and we have to pay small runtime cost of dynamic dispatch

where
PrefixRange: RangeBounds<&'prefix str> + Clone,
KeyRange: RangeBounds<K> + Clone,
PrefixRange: RangeBounds<&'prefix str> + Clone + 'referred_data,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this added bound? Doesn't quite make sense to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the iterator being returned has lifetime 'referred_data (via + 'referred_data), any values captured by closures within that iterator must also live at least as long as 'referred_data. Iterator captures the prefix and key range so they also need to live at least as long as 'referred_data

where
PrefixRange: RangeBounds<&'prefix str> + Clone,
KeyRange: RangeBounds<K> + Clone,
PrefixRange: RangeBounds<&'prefix str> + Clone + 'me,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be 'prefix

@@ -454,12 +455,13 @@ impl<'reader> NgramLiteralProvider<FullTextIndexError> for FullTextIndexReader<'
ngram_range: NgramRange,
) -> Result<Vec<(&'me str, u32, &'me [u32])>, FullTextIndexError>
where
NgramRange: Clone + RangeBounds<&'me str> + Send + Sync,
NgramRange: Clone + RangeBounds<&'me str> + Send + Sync + 'me,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated - i think 'me can be a separate lifetime

@@ -94,7 +94,7 @@ pub trait NgramLiteralProvider<E, const N: usize = 3> {
ngram_range: NgramRange,
) -> Result<Vec<(&'me str, u32, &'me [u32])>, E>
where
NgramRange: Clone + RangeBounds<&'me str> + Send + Sync;
NgramRange: Clone + RangeBounds<&'me str> + Send + Sync + 'me;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same lifetime comment

Copy link
Collaborator

@HammadB HammadB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good minus maybe the lifetime comments

Copy link
Contributor Author

sanketkedia commented Aug 13, 2025

Summary: 1 successful workflow, 1 pending workflow

Last updated: 2025-08-13 19:23:16 UTC

@sanketkedia sanketkedia merged commit 04df791 into main Aug 13, 2025
118 of 120 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants