Skip to content

[Data] Poor utilization when reading images from URIs #48105

Open
@bveeramani

Description

What happened + What you expected to happen

I have a pandas DataFrame that contains image URIs. I load the DataFrame with from_pandas and call map to load the images, but I'm observing that Ray Data isn't fully utilizing my cluster.

See repro.

Versions / Dependencies

2.37

Reproduction script

import time

import numpy as np
import pandas as pd

import ray

df = pd.DataFrame({"uris": ["s3://spam/ham/eggs"] * 100})


def load_image(row):
    time.sleep(0.1)
    return {"image": np.zeros((256, 256, 3))}


# This pipeline will only launch one task to load the images.
ds = ray.data.from_pandas(df).map(load_image)
for _ in ds.iter_internal_ref_bundles():
    pass

Issue Severity

None

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tdataRay Data-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions