Skip to content

[data] sort large dataset by ray.data.Dataset always fail #49679

Open
@Yanghello

Description

What happened + What you expected to happen

I run sort with large scaling data (1000mil row x 1000 col data, total 1TB), with test code as:

import ray

ray.init()

ctx = ray.data.DataContext.get_current()
ctx.use_push_based_shuffle = True

data_path = "hdfs://ip:port/home/data/testdata/1y_rows_1000_columns/"

data = ray.data.read_csv(data_path)

data = data.sort("id")

data = data.materialize()
print(data.count())
print(data.schema())
print(data.take(10))

running in my ray cluster (16cpu64gb x 40worker), it always fail with error of worker dead. It is there something config I missing to config for sort with large data ?
截屏2025-01-07 18 59 26

Versions / Dependencies

ray == 2.10.0

Reproduction script

import ray

ray.init()

ctx = ray.data.DataContext.get_current()
ctx.use_push_based_shuffle = True

data_path = "hdfs://ip:port/home/data/testdata/1y_rows_1000_columns/"

data = ray.data.read_csv(data_path)

data = data.sort("id")

data = data.materialize()
print(data.count())
print(data.schema())
print(data.take(10))

Issue Severity

None

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tdataRay Data-related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions