Skip to content

[Data] ray.data.from_torch fails on datasets with variable shaped images #50229

Open
@crypdick

Description

What happened + What you expected to happen

I am trying to process some image data using Ray Data. However, when I try to use any torchvision dataset with variable image shapes, it fails with an ray.air.util.tensor_extensions.arrow.ArrowConversionError: Error converting data to Arrow. Example stacktrace here. Repro script below.

Versions / Dependencies

ray 2.42.0
Python 3.11.11

Reproduction script

import numpy as np
import ray
import torchvision

ray.init()

def extract_and_process_image(row: dict) -> dict:
    """Discard label and convert image to numpy array."""
    return {"image": np.array(row["item"][0])}

dataset = torchvision.datasets.Caltech256(root="~/tmp/data", download=True)
ds = ray.data.from_torch(dataset)
ds = ds.map(extract_and_process_image)
print(ds.take(1))

Issue Severity

High: It blocks me from completing my task.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tdataRay Data-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions