Skip to content

add tests for prototype builtin datasets #4682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Nov 4, 2021

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Oct 21, 2021

Copy link
Contributor

@bjuncek bjuncek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks ok to me, but I'm slightly worried that this might add some overhead for users adding datasets no?

Also, is there a way to disable shufflers for testing? These tend to make things quite slow.

raise AssertionError(f"Loading the dataset should return an IterDataPipe, but got {type(dataset)} instead.")

@builtin_datasets
def test_sample(self, dataset, mock_info):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK to test just a single sample?
In the function below we test the num_samples, but not the validity of them.

Copy link
Collaborator Author

@pmeier pmeier Nov 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it is reasonable to assume that if the first sample is valid, all the other will also be. Otherwise we would need to have some weird branching within the dataset that could cause this and this should be caught during code review.

Of course we can also merge test_sample and test_num_samples to check the validity of every sample. But then the question arises, do we need to check all samples for every single test? For example, what about the decoding tests? Should we also check every sample there? If not, what is the difference to test_sample?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it's hard to do these without clogging up the CI. I'd say the first one is ok for 90% of the cases, but I think we should re-think these for the video datasets with clips to include somehow randomly drawn sample bc these can be a bit unstable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are thinking about the video utilities you added in #4838, I don't think we need to change anything here. Since they are separate from the datasets, we can also test them separately.

Comment on lines +80 to +83
f"No mock data available for dataset '{name}'. "
f"Did you add a new dataset, but forget to provide mock data for it? "
f"Did you register the mock data function with `@DatasetMocks.register_mock_data_fn`?"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I don't understand the decorators as well, but it looks like you put most of the mock data function here - won't this make this file huge and somewhat unreadable once we add more and more datasets?
This also adds one more hoop to go through to adding datasets. Should we perhaps add this mock data in the datasets themselves?

Would it make more sense to add

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I don't understand the decorators as well, but it looks like you put most of the mock data function here - won't this make this file huge and somewhat unreadable once we add more and more datasets?

Unfortunately, yes.

Should we perhaps add this mock data in the datasets themselves?

That is a fair question. In general I like the idea to have the mock data close the "actual" data. The problem I see, is that we would need to pull some pure test utilities such as make_tar or make_image_folder inside the torchvision package. I'm not opposed to that, but that would break the current paradigm that our tests are kind of "standalone".

PyTorch core also has a testing namespace with an _internal submodule where a lot of this functionality lives. We could also do that. In the light of #4721 we could also place some specialized comparison / generation functions there for downstream libraries to use. @NicolasHug

@pmeier
Copy link
Collaborator Author

pmeier commented Nov 3, 2021

Generally looks ok to me, but I'm slightly worried that this might add some overhead for users adding datasets no?

Overhead as discussed in #4682 (comment) or overhead in general, because the contributor also has to provide mock data along with the dataset?

If you mean the latter, than yes, this is "overhead", but it is well spent. If you have a look at the changes in this PR, I've found bugs in ImageNet, MNIST (and variants), as well as Caltech101 that were not caught be me or anyone else during code review.

Plus, we currently also require a contributor to add tests including mock data for our legacy datasets API. In fact, IMO, it will be easier than it currently is, since the users only need to add a single function for the mock data, rather than provide a test case and override custom methods.

Also, is there a way to disable shufflers for testing? These tend to make things quite slow.

I don't think the option is already available. IIRC, it will be available through the DataLoader. cc @ejguan @vitaly-fedyunin

@ejguan
Copy link
Contributor

ejguan commented Nov 3, 2021

I don't think the option is already available. IIRC, it will be available through the [DataLoader]

Correct. I think we will provide a switch in DataLoader to turn off all shufflers in the future. As a work around, you may want to only add shuffler in the pipeline when needed.

@pmeier
Copy link
Collaborator Author

pmeier commented Nov 3, 2021

@ejguan

As a work around, you may want to only add shuffler in the pipeline when needed.

One should never use a dataset without shuffling and shuffling has to happen inside our implementation, i.e. before decoding, for memory reasons. So unless we have a switch to turn it off through the data loader, we need to keep it in.

@ejguan
Copy link
Contributor

ejguan commented Nov 3, 2021

Given that one should never use a dataset without shuffling and shuffling has to happen inside our implementation. So unless we have a switch to turn it off through the data loader, we need to keep it in.

Understood. And DataLoader does have an option to optionally attach a shuffler at the end as a hack to turn on/off shuffling. The reason I didn't propose to you is global shuffling is required as we discussed. Shuffling at the end with an unlimited buffer is kind crazy especially considering all data are images.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Let me know if you want to get this merged now or wait to refactor following the comment from Bruno

@fmassa
Copy link
Member

fmassa commented Nov 3, 2021

One thing to keep in mind, ideally, our datasets wouldn't be stochastic in the standard case (so next(iter(ds)) always returns the same thing).
If users are always expected to use the dataset within a DataLoader, then fine. But this should be clear from the docs / etc

Comment on lines +102 to +106
def _decoder(self, dataset_type):
if dataset_type == datasets.utils.DatasetType.RAW:
return datasets.decoder.raw
else:
return lambda file: file.close()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bjuncek This is the decoder we are passing to each dataset during the tests unless we manually specify a different one. Thus, although the video datasets do not use a decoder by default through the normal API, they will use one during test.

@pmeier pmeier merged commit 49ec677 into pytorch:main Nov 4, 2021
@pmeier pmeier deleted the datasets/tests/builtin branch November 4, 2021 08:51
facebook-github-bot pushed a commit to pytorch/data that referenced this pull request Nov 5, 2021
Summary:
Addresses #69 for `torchvision`. Note that the workflow will fail for another ~24 hours, since pytorch/vision#4682 will only then be included in the nightly release.

Pull Request resolved: #96

Reviewed By: wenleix

Differential Revision: D32171429

Pulled By: ejguan

fbshipit-source-id: 3b5cd0a56bbcc86672a62d047e4491a433811d6a
facebook-github-bot pushed a commit that referenced this pull request Nov 8, 2021
Summary:
* add tests for builtin prototype datasets

* fix caltech101

* fix emnist

* fix mnist and variants

* add iopath as test requirement

* fix MNIST warning

* fix qmnist data generation

* fix cifar data generation

* add tests for imagenet

* cleanup

Reviewed By: kazhang

Differential Revision: D32216667

fbshipit-source-id: 4efc2b61574f4523ce31d70c85cc2d5150f3a721
cyyever pushed a commit to cyyever/vision that referenced this pull request Nov 16, 2021
* add tests for builtin prototype datasets

* fix caltech101

* fix emnist

* fix mnist and variants

* add iopath as test requirement

* fix MNIST warning

* fix qmnist data generation

* fix cifar data generation

* add tests for imagenet

* cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants