[data] Ray Data adds >100ms delay before producing the first batch of a Dataset #42376
Open
Description
What happened + What you expected to happen
When iterating through a Dataset multiple times, the first batch always appears to be slow to return.
Versions / Dependencies
3.0dev
Reproduction script
A simple no-op Dataset that just returns 10 rows. The first block always takes >100ms to return.
import ray
import time
import numpy as np
ctx = ray.data.DataContext.get_current()
def sleep(row):
#time.sleep(1)
return {"val": 1}
ctx.execution_options.resource_limits.cpu = 2
ds = ray.data.range(10, parallelism=10)
sleep_ds = ds.map(sleep)
batch_start = time.perf_counter()
for _ in range(3):
start = time.perf_counter()
i = 0
for batch in sleep_ds.iter_batches(batch_size=None):
print("blocked time", time.perf_counter() - batch_start)
time.sleep(0.5)
i += 1
batch_start = time.perf_counter()
end = time.perf_counter()
print("Took", end - start, "expected time", 0.5 * i)
print(sleep_ds.stats())
ray.timeline("timeline.json")
Issue Severity
None