Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up ht.percentile #1389

Closed
ClaudiaComito opened this issue Feb 29, 2024 · 6 comments
Closed

Speed up ht.percentile #1389

ClaudiaComito opened this issue Feb 29, 2024 · 6 comments
Assignees
Labels
Milestone

Comments

@ClaudiaComito
Copy link
Contributor

ClaudiaComito commented Feb 29, 2024

Feature functionality
ht.percentile() performance seems disproportionately slow, even considering that it potentially requires sorting along the split axis.

This function should be heavily refactored or even reimplemented from scratch. I vaguely remember implementing it myself back in the day. I'm afraid to look into that code.

Additional context
Code snippet to benchmark (I was running it on 4 processes on HDFML - on GPU it even deadlocks) based on @mrfh92 's preprocessing tutorial.

import heat as ht
import time

start = time.perf_counter()
X = ht.load_hdf5("/p/scratch/training2404/data/JPL_SBDB/sbdb_asteroids.h5",dataset="data",split=0, device="cpu")
end = time.perf_counter()
print(f"Loading data on {X.comm.size} ranks took {end-start} seconds")

if X.comm.rank == 0:
    print(f"X is a {X.ndim}-dimensional array with shape{X.shape}")
    print(f"X takes up {X.nbytes/1e6} MB of memory.")


start = time.perf_counter()
features_mean = ht.mean(X,axis=0)
end = time.perf_counter()

if X.comm.rank == 0:
    print(f"Mean on {X.comm.size} ranks took {end-start} seconds")

start = time.perf_counter()
features_median = ht.percentile(X,50.,axis=0)
end = time.perf_counter()

if X.comm.rank == 0:
    print(f"Median on {X.comm.size} ranks took {end-start} seconds")
Loading data on 4 ranks took 0.015628864999825964 seconds
X is a 2-dimensional array with shape(1317275, 9)
X takes up 47.4219 MB of memory.
Mean on 4 ranks took 0.004456439000023238 seconds
Median on 4 ranks took 62.82521499199993 seconds
@ClaudiaComito ClaudiaComito added enhancement New feature or request statistics labels Feb 29, 2024
@ClaudiaComito ClaudiaComito added this to the 1.5.0 milestone Feb 29, 2024
@mrfh92
Copy link
Collaborator

mrfh92 commented Mar 11, 2024

As far as I understand, computing percentiles in a distributed setting might really pose a problem in terms of quite heavy communication as re-ordering might be necessary. An option might be to add a faster variant using sketching, i.e. drawing a "small" sample from the data set and computing percentiles for this sample.

Maybe https://arxiv.org/pdf/2004.08604.pdf helps, although this algorithms has been designed for 1d (?) data streams.

@mrfh92
Copy link
Collaborator

mrfh92 commented Apr 4, 2024

I have implemented a version of ht.percentile that estimates the true percentile only on behalf of a random sketch of the entire data set. #1420

@ClaudiaComito ClaudiaComito self-assigned this Apr 11, 2024
Copy link
Contributor

Branch features/1389-Speed_up_ht_percentile created!

Copy link
Contributor

This issue is stale because it has been open for 60 days with no activity.

@github-actions github-actions bot added the stale label Jun 17, 2024
@mrfh92 mrfh92 removed the stale label Jun 24, 2024
Copy link
Contributor

This issue is stale because it has been open for 60 days with no activity.

@github-actions github-actions bot added the stale label Aug 26, 2024
@ClaudiaComito
Copy link
Contributor Author

Closing this for now as #1420 has been merged and provides a faster alternative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants