[Data] Poor utilization when reading images from URIs #48105
Open
Description
What happened + What you expected to happen
I have a pandas DataFrame that contains image URIs. I load the DataFrame with from_pandas
and call map
to load the images, but I'm observing that Ray Data isn't fully utilizing my cluster.
See repro.
Versions / Dependencies
2.37
Reproduction script
import time
import numpy as np
import pandas as pd
import ray
df = pd.DataFrame({"uris": ["s3://spam/ham/eggs"] * 100})
def load_image(row):
time.sleep(0.1)
return {"image": np.zeros((256, 256, 3))}
# This pipeline will only launch one task to load the images.
ds = ray.data.from_pandas(df).map(load_image)
for _ in ds.iter_internal_ref_bundles():
pass
Issue Severity
None