refactor: Rework get_online_features
helper functions
#5060
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
Is some use cases, for example in recommender systems, we need to retrieve features for hundreds and thousands of entities. As stated in the issue #3596, current implementation has a significant overhead.
I ran a benchmark test for 1000 entities with profiling and noticed, that
_populate_response_from_feature_data
and_convert_rows_to_protobuf
functions are quite heavy.A have several proposals for improvement:
_populate_response_from_feature_data
has 3 calls ofapply_list_mapping
where we calculateoutput_len = sum(len(item) for item in mapping_indexes)
which is time consuming. We could calculateoutput_len
only once, when we deduplicate entities. And we could combine all 3 vectors into response in one loop_convert_rows_to_protobuf
and start re-formatting it with zip(zip)s. We could constructfeature_data
exactly in the desired format and get rid of this unpacking.This test should cover all the above changes.
Which issue(s) this PR fixes:
Misc
For experimentation I used current benchmark tooling with some adjustments. What I changed:
redis
andsqlite
Here's results for the latest release (time in ms):
And here's results for proposed changes. Performance for 1000 and more entities is better: