[Data] Add approximate quantile to aggregator #57598

owenowenisme · 2025-10-09T13:18:53Z

Why are these changes needed?

Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL.

Reason:
• Enables efficient support for the summary API.
• More scalable than exact Quantile on large datasets.

Note:
• DataSketches is not added as a Ray dependency; if missing, users are prompted to install it.

Here's a simple test to show the efficiency difference between ApproximateQuantile and Quantile

import ray
import ray.data
import time

ray.init(num_cpus=16)
from ray.data.aggregate import ApproximateQuantile, Quantile

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5])))
print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds")

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(Quantile(on="id", q=0.5)))
print(f"Time taken Quantile: {time.time() - start_time} seconds")

In this run with 1e8 rows, the approximate median returned 49,979,428.0 in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s. The difference reflects the sketch’s accuracy trade-off for significant speed and scalability gains.

When k=800 (the default), we are guaranteed to have the error rate < 0.45% , in this test our error rate is (49,999,999.5-49,979,428.0)/49,999,999.5= 0.00041143 = 0.041143% which is < 0.45% , but we get the approximate median 13.11x faster.

{'approx_quantile(id)': [49979428.0]}
Time taken ApproximateQuantile: 12.457247257232666 seconds
{'quantile(id)': 49999999.5}
Time taken Quantile: 163.32705521583557 seconds

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>

goutamvenkat-anyscale · 2025-10-10T17:39:52Z

python/ray/data/aggregate.py

+        """
+        self._require_datasketches()
+        self._quantiles = quantiles
+        self._k = k


instead of k, let's use capacity_per_level

capacity_per_level does not feel accurate to me, I think maybe we don't need to hide the detail of k, since user will need to see the doc from datasketches anyway.

I added link to k params description to guide users to the doc for more info.

python/ray/data/aggregate.py

goutamvenkat-anyscale · 2025-10-10T17:40:48Z

python/ray/data/aggregate.py

+        sketch = self.zero(self._k)
+        for value in column:
+            # we ignore nulls here
+            if value.as_py() is not None:


do we need an as_py() conversion here? What type is this value?

This is because we will get this error when the value is none.

def test_approximate_quantile_ignores_nulls(self, ray_start_regular_shared_2_cpus): data = [ {"id": 1, "value": 5.0}, {"id": 2, "value": None}, {"id": 3, "value": 15.0}, {"id": 4, "value": None}, {"id": 5, "value": 25.0}, ] ds = ray.data.from_items(data) result = ds.aggregate(ApproximateQuantile(on="value", quantiles=[0.5])) assert result["approx_quantile(value)"] == [15.0]

TypeError: float() argument must be a string or a number, not 'pyarrow.lib.NullScalar'

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

owenowenisme force-pushed the data/add-approximate-quantile-to-aggregrator branch from e0584b6 to 45381b1 Compare October 9, 2025 13:20

owenowenisme added the go add ONLY when ready to merge, run all tests label Oct 9, 2025

use test deps for datasketches

024f199

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

owenowenisme force-pushed the data/add-approximate-quantile-to-aggregrator branch from 45381b1 to 024f199 Compare October 9, 2025 23:55

owenowenisme added 4 commits October 10, 2025 00:39

add import error

f983338

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

update requirements_compiled.txt

93a531b

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

add stability annotation

153a44c

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

add to rst

891dccb

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

owenowenisme marked this pull request as ready for review October 10, 2025 08:27

owenowenisme requested a review from a team as a code owner October 10, 2025 08:27

This comment was marked as outdated.

Sign in to view

owenowenisme added 2 commits October 10, 2025 17:26

Update incorrect comments

b0e9923

Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>

Remove print

a3978b4

Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>

This comment was marked as outdated.

Sign in to view

ray-gardener bot added the data Ray Data-related issues label Oct 10, 2025

goutamvenkat-anyscale reviewed Oct 10, 2025

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

goutamvenkat-anyscale reviewed Oct 10, 2025

View reviewed changes

owenowenisme added 2 commits October 11, 2025 01:12

update KLL link

47d8d7b

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

add link to k params

47a89f8

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Add approximate quantile to aggregator #57598

[Data] Add approximate quantile to aggregator #57598

owenowenisme commented Oct 9, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

goutamvenkat-anyscale Oct 10, 2025

Uh oh!

owenowenisme Oct 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

goutamvenkat-anyscale Oct 10, 2025

Uh oh!

owenowenisme Oct 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Data] Add approximate quantile to aggregator #57598

Are you sure you want to change the base?

[Data] Add approximate quantile to aggregator #57598

Conversation

owenowenisme commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

goutamvenkat-anyscale Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

owenowenisme Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

goutamvenkat-anyscale Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

owenowenisme Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

owenowenisme commented Oct 9, 2025 •

edited

Loading

owenowenisme Oct 11, 2025 •

edited

Loading

owenowenisme Oct 11, 2025 •

edited

Loading