Skip to content

[data] Ray Data adds >100ms delay before producing the first batch of a Dataset #42376

Open
@stephanie-wang

Description

What happened + What you expected to happen

When iterating through a Dataset multiple times, the first batch always appears to be slow to return.

Versions / Dependencies

3.0dev

Reproduction script

A simple no-op Dataset that just returns 10 rows. The first block always takes >100ms to return.

import ray
import time
import numpy as np

ctx = ray.data.DataContext.get_current()

def sleep(row):
    #time.sleep(1)
    return {"val": 1}
    
ctx.execution_options.resource_limits.cpu = 2

ds = ray.data.range(10, parallelism=10)
sleep_ds = ds.map(sleep)
batch_start = time.perf_counter()

for _ in range(3):
    start = time.perf_counter()

    i = 0
    for batch in sleep_ds.iter_batches(batch_size=None):
        print("blocked time", time.perf_counter() - batch_start)
        time.sleep(0.5)
        i += 1
        batch_start = time.perf_counter()

    end = time.perf_counter()
    print("Took", end - start, "expected time", 0.5 * i)
    print(sleep_ds.stats())

ray.timeline("timeline.json")

Issue Severity

None

Metadata

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tdataRay Data-related issuesray-2.10

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions