Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Add ImageFolderDatasource #24641

Merged
merged 28 commits into from
Jul 16, 2022

Conversation

bveeramani
Copy link
Member

@bveeramani bveeramani commented May 10, 2022

Why are these changes needed?

Popular datasets like ImageNet and Tiny ImageNet are arranged in a specific layout like this:

root/dog/xxx.png
root/dog/xxy.png
root/dog/[...]/xxz.png

root/cat/123.png
root/cat/nsdf3.png
root/cat/[...]/asd932_.png

This PR adds a datasource that reads such datasets.

Related issue number

Closes #23977

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@bveeramani bveeramani changed the title [Datasets] Add ImageFolderDatasource [Datasets] [WIP] Add ImageFolderDatasource May 10, 2022
@bveeramani bveeramani marked this pull request as draft May 10, 2022 06:07
@bveeramani bveeramani changed the title [Datasets] [WIP] Add ImageFolderDatasource [Datasets] Add ImageFolderDatasource May 11, 2022
@bveeramani bveeramani marked this pull request as ready for review May 11, 2022 08:58
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 11, 2022
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
@bveeramani bveeramani requested a review from maxpumperla as a code owner May 18, 2022 08:58
@richardliaw
Copy link
Contributor

For the CUJ that amog posted can we add it in as an example to test in ci?

python/ray/data/datasource/image_folder_datasource.py Outdated Show resolved Hide resolved
python/ray/data/datasource/image_folder_datasource.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Outdated Show resolved Hide resolved
@bveeramani
Copy link
Member Author

For the CUJ that amog posted can we add it in as an example to test in ci?

No. There are issues with applying TorchVision transformations that are completely unrelated to this PR.

To make this code snippet work, you need to add several workarounds

def preprocess(df):
    preprocess = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])
    df["image"] = df["image"].map(preprocess)
    return df
def preprocess(df):
    preprocess = transforms.Compose([
        lambda ray_tensor: ray_tensor.to_numpy(),
        transforms.ToTensor(),
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        lambda torch_tensor: torch_tensor.numpy().astype(np.float32)
    ])
    df["image"] = TensorArray([preprocess(image) for image in df["image"]])
    return df

@richardliaw
Copy link
Contributor

@bveeramani sorry, what I mean is we should have an end-to-end test such as Amog's example (with whatever modifications you want to make).

Can you please do that before merging?

@bveeramani
Copy link
Member Author

bveeramani commented Jul 15, 2022

Can you please do that before merging?

@richardliaw Added an E2E test. Wasn't sure what to test other than that there are no errors

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work!

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
@richardliaw richardliaw merged commit 34cf1f1 into ray-project:master Jul 16, 2022
path, data = records[0]

image = iio.imread(data)
label = _get_class_from_path(path, self.root)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have any docs / past discussion about this part? Basically we're assuming we get the label based on user file path, which has to be structured in certain way in order to get the correct one without knobs needed to pass in custom label file or join ?

For example, if i read a s3 bucket with filenames of "dog.jpg", "dog_2.jpg" my dataloader will end up getting these string values by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically we're assuming we get the label based on user file path, which has to be structured in certain way in order to get the correct one without knobs needed to pass in custom label file or join ?

Yeah, that's right. The datasource assumes that the layout is structured in the same way as ImageNet. The functionality of the datasource is based on that of TorchVision's ImageFolder.

For example, if i read a s3 bucket with filenames of "dog.jpg", "dog_2.jpg" my dataloader will end up getting these string values by default.

Yeah, you're right. We don't validate that the label corresponds to a directory. In this case, we could raise an error stating that the folder isn't structured correctly.

Alternatively, if images aren't stored in a directory, we could set the label to None.

If images aren't stored in a sub-directory, then the image's label will be set to `None`.

.. code-block::

    root/dog/xxx.png  # Label is 'dog'
    root/123.jpg.     # Label is `None`

@bveeramani bveeramani deleted the image-datasource branch July 16, 2022 23:05
xwjiang2010 pushed a commit to xwjiang2010/ray that referenced this pull request Jul 19, 2022
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] [Feature] Add datasource for canned ML datasets and imagery data sources
7 participants