-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[PERF]: Modify get_range to return an iterator #5256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF]: Modify get_range to return an iterator #5256
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
Optimize get_range to Return Iterator Instead of Vec for BlockfileReader This PR changes the signature and behavior of BlockfileReader's get_range and related methods across the Rust codebase, replacing Vector-based results (Vec) with iterator-based results (impl Iterator). This change is aimed at reducing the overhead of materializing large intermediate Vecs during range queries, such as those used in brute force regex or full dataset scans, thereby improving performance and memory efficiency. All relevant call-sites, including get_all_data in segment readers, the full-text indexing logic, and associated test utilities, have been updated to accommodate the new iterator-based APIs. Key Changes• Affected Areas• rust/blockstore/src/arrow/blockfile.rs ( This summary was automatically generated by @propel-code-bot |
.expect("Error getting all data from record segment") | ||
.collect::<Vec<_>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[PerformanceOptimization]
Iterator is unnecessarily collected into Vec, defeating the purpose of the lazy iterator API. Consider processing the iterator lazily to maintain memory efficiency benefits.
// Instead of:
let res = record_segment_reader.get_all_data().await?.collect::<Vec<_>>();
// Consider streaming approach:
for item in record_segment_reader.get_all_data().await? {
// process item
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a test so idc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that this is a test, and often the strict performance characteristics are less critical in test code. However, maintaining the lazy processing of iterators, even in tests, can contribute to code clarity and consistency with the intended API design. It also helps prevent accidental performance regressions if this pattern were to be copied into non-test code later. For this specific instance, given it's a test, your rationale is noted.
let positional_posting_list = self | ||
.posting_lists_blockfile_reader | ||
.get_range(token.text.as_str()..=token.text.as_str(), ..) | ||
.await?; | ||
.await? | ||
.collect::<Vec<_>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[PerformanceOptimization]
Potential performance issue: The iterator is collected into a Vec immediately after being returned, which defeats the memory efficiency purpose of returning an iterator. Consider processing the data lazily.
// Instead of collecting immediately:
let positional_posting_list = self
.posting_lists_blockfile_reader
.get_range(token.text.as_str()..=token.text.as_str(), ..)
.await?
.collect::<Vec<_>>();
// Consider processing lazily:
let iter = self
.posting_lists_blockfile_reader
.get_range(token.text.as_str()..=token.text.as_str(), ..)
.await?;
// Process iter without collecting unless absolutely necessary
@@ -94,18 +94,22 @@ impl< | |||
&'referred_data self, | |||
prefix_range: PrefixRange, | |||
key_range: KeyRange, | |||
) -> Result<Vec<(&'referred_data str, K, V)>, Box<dyn ChromaError>> | |||
) -> Result< | |||
Box<dyn Iterator<Item = (&'referred_data str, K, V)> + 'referred_data>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q: when do we return Box<dyn trait>
vs impl trait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I was originally doing an impl trait
. Problem is that the concrete types of memory blockfile and arrow blockfiles are different even though they both implement the same iterator trait. Hence, it does not compile with impl trait
and we have to pay small runtime cost of dynamic dispatch
where | ||
PrefixRange: RangeBounds<&'prefix str> + Clone, | ||
KeyRange: RangeBounds<K> + Clone, | ||
PrefixRange: RangeBounds<&'prefix str> + Clone + 'referred_data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this added bound? Doesn't quite make sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the iterator being returned has lifetime 'referred_data (via + 'referred_data), any values captured by closures within that iterator must also live at least as long as 'referred_data. Iterator captures the prefix and key range so they also need to live at least as long as 'referred_data
where | ||
PrefixRange: RangeBounds<&'prefix str> + Clone, | ||
KeyRange: RangeBounds<K> + Clone, | ||
PrefixRange: RangeBounds<&'prefix str> + Clone + 'me, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be 'prefix
@@ -454,12 +455,13 @@ impl<'reader> NgramLiteralProvider<FullTextIndexError> for FullTextIndexReader<' | |||
ngram_range: NgramRange, | |||
) -> Result<Vec<(&'me str, u32, &'me [u32])>, FullTextIndexError> | |||
where | |||
NgramRange: Clone + RangeBounds<&'me str> + Send + Sync, | |||
NgramRange: Clone + RangeBounds<&'me str> + Send + Sync + 'me, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated - i think 'me can be a separate lifetime
@@ -94,7 +94,7 @@ pub trait NgramLiteralProvider<E, const N: usize = 3> { | |||
ngram_range: NgramRange, | |||
) -> Result<Vec<(&'me str, u32, &'me [u32])>, E> | |||
where | |||
NgramRange: Clone + RangeBounds<&'me str> + Send + Sync; | |||
NgramRange: Clone + RangeBounds<&'me str> + Send + Sync + 'me; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same lifetime comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good minus maybe the lifetime comments
Summary: 1 successful workflow, 1 pending workflow
Last updated: 2025-08-13 19:23:16 UTC |
Description of changes
Summarize the changes made by this PR.
Test plan
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustMigration plan
None
Observability plan
Perf testing on staging
Documentation Changes
None