-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up ht.percentile
#1389
Comments
As far as I understand, computing percentiles in a distributed setting might really pose a problem in terms of quite heavy communication as re-ordering might be necessary. An option might be to add a faster variant using sketching, i.e. drawing a "small" sample from the data set and computing percentiles for this sample. Maybe https://arxiv.org/pdf/2004.08604.pdf helps, although this algorithms has been designed for 1d (?) data streams. |
I have implemented a version of |
Branch features/1389-Speed_up_ht_percentile created! |
This issue is stale because it has been open for 60 days with no activity. |
This issue is stale because it has been open for 60 days with no activity. |
Closing this for now as #1420 has been merged and provides a faster alternative. |
Feature functionality
ht.percentile()
performance seems disproportionately slow, even considering that it potentially requires sorting along the split axis.This function should be heavily refactored or even reimplemented from scratch. I vaguely remember implementing it myself back in the day. I'm afraid to look into that code.
Additional context
Code snippet to benchmark (I was running it on 4 processes on HDFML - on GPU it even deadlocks) based on @mrfh92 's preprocessing tutorial.
The text was updated successfully, but these errors were encountered: