add tests for prototype builtin datasets #4682

pmeier · 2021-10-21T09:54:11Z

bjuncek

Generally looks ok to me, but I'm slightly worried that this might add some overhead for users adding datasets no?

Also, is there a way to disable shufflers for testing? These tend to make things quite slow.

bjuncek · 2021-11-02T18:47:37Z

test/test_prototype_builtin_datasets.py

+            raise AssertionError(f"Loading the dataset should return an IterDataPipe, but got {type(dataset)} instead.")
+
+    @builtin_datasets
+    def test_sample(self, dataset, mock_info):


Is it OK to test just a single sample?
In the function below we test the num_samples, but not the validity of them.

I feel like it is reasonable to assume that if the first sample is valid, all the other will also be. Otherwise we would need to have some weird branching within the dataset that could cause this and this should be caught during code review.

Of course we can also merge test_sample and test_num_samples to check the validity of every sample. But then the question arises, do we need to check all samples for every single test? For example, what about the decoding tests? Should we also check every sample there? If not, what is the difference to test_sample?

I feel like it's hard to do these without clogging up the CI. I'd say the first one is ok for 90% of the cases, but I think we should re-think these for the video datasets with clips to include somehow randomly drawn sample bc these can be a bit unstable.

If you are thinking about the video utilities you added in #4838, I don't think we need to change anything here. Since they are separate from the datasets, we can also test them separately.

test/test_prototype_builtin_datasets.py

bjuncek · 2021-11-02T19:02:36Z

test/builtin_dataset_mocks.py

+                f"No mock data available for dataset '{name}'. "
+                f"Did you add a new dataset, but forget to provide mock data for it? "
+                f"Did you register the mock data function with `@DatasetMocks.register_mock_data_fn`?"
+            )


Perhaps I don't understand the decorators as well, but it looks like you put most of the mock data function here - won't this make this file huge and somewhat unreadable once we add more and more datasets?
This also adds one more hoop to go through to adding datasets. Should we perhaps add this mock data in the datasets themselves?

Would it make more sense to add

Perhaps I don't understand the decorators as well, but it looks like you put most of the mock data function here - won't this make this file huge and somewhat unreadable once we add more and more datasets?

Unfortunately, yes.

Should we perhaps add this mock data in the datasets themselves?

That is a fair question. In general I like the idea to have the mock data close the "actual" data. The problem I see, is that we would need to pull some pure test utilities such as make_tar or make_image_folder inside the torchvision package. I'm not opposed to that, but that would break the current paradigm that our tests are kind of "standalone".

PyTorch core also has a testing namespace with an _internal submodule where a lot of this functionality lives. We could also do that. In the light of #4721 we could also place some specialized comparison / generation functions there for downstream libraries to use. @NicolasHug

pmeier · 2021-11-03T06:34:46Z

Generally looks ok to me, but I'm slightly worried that this might add some overhead for users adding datasets no?

Overhead as discussed in #4682 (comment) or overhead in general, because the contributor also has to provide mock data along with the dataset?

If you mean the latter, than yes, this is "overhead", but it is well spent. If you have a look at the changes in this PR, I've found bugs in ImageNet, MNIST (and variants), as well as Caltech101 that were not caught be me or anyone else during code review.

Plus, we currently also require a contributor to add tests including mock data for our legacy datasets API. In fact, IMO, it will be easier than it currently is, since the users only need to add a single function for the mock data, rather than provide a test case and override custom methods.

Also, is there a way to disable shufflers for testing? These tend to make things quite slow.

I don't think the option is already available. IIRC, it will be available through the DataLoader. cc @ejguan @vitaly-fedyunin

ejguan · 2021-11-03T13:50:41Z

I don't think the option is already available. IIRC, it will be available through the [DataLoader]

Correct. I think we will provide a switch in DataLoader to turn off all shufflers in the future. As a work around, you may want to only add shuffler in the pipeline when needed.

pmeier · 2021-11-03T13:59:06Z

@ejguan

As a work around, you may want to only add shuffler in the pipeline when needed.

One should never use a dataset without shuffling and shuffling has to happen inside our implementation, i.e. before decoding, for memory reasons. So unless we have a switch to turn it off through the data loader, we need to keep it in.

ejguan · 2021-11-03T14:04:21Z

Given that one should never use a dataset without shuffling and shuffling has to happen inside our implementation. So unless we have a switch to turn it off through the data loader, we need to keep it in.

Understood. And DataLoader does have an option to optionally attach a shuffler at the end as a hack to turn on/off shuffling. The reason I didn't propose to you is global shuffling is required as we discussed. Shuffling at the end with an unlimited buffer is kind crazy especially considering all data are images.

fmassa

LGTM, thanks!

Let me know if you want to get this merged now or wait to refactor following the comment from Bruno

fmassa · 2021-11-03T15:07:53Z

One thing to keep in mind, ideally, our datasets wouldn't be stochastic in the standard case (so next(iter(ds)) always returns the same thing).
If users are always expected to use the dataset within a DataLoader, then fine. But this should be clear from the docs / etc

pmeier · 2021-11-04T08:41:09Z

test/builtin_dataset_mocks.py

+    def _decoder(self, dataset_type):
+        if dataset_type == datasets.utils.DatasetType.RAW:
+            return datasets.decoder.raw
+        else:
+            return lambda file: file.close()


@bjuncek This is the decoder we are passing to each dataset during the tests unless we manually specify a different one. Thus, although the video datasets do not use a decoder by default through the normal API, they will use one during test.

Summary: Addresses #69 for `torchvision`. Note that the workflow will fail for another ~24 hours, since pytorch/vision#4682 will only then be included in the nightly release. Pull Request resolved: #96 Reviewed By: wenleix Differential Revision: D32171429 Pulled By: ejguan fbshipit-source-id: 3b5cd0a56bbcc86672a62d047e4491a433811d6a

Summary: * add tests for builtin prototype datasets * fix caltech101 * fix emnist * fix mnist and variants * add iopath as test requirement * fix MNIST warning * fix qmnist data generation * fix cifar data generation * add tests for imagenet * cleanup Reviewed By: kazhang Differential Revision: D32216667 fbshipit-source-id: 4efc2b61574f4523ce31d70c85cc2d5150f3a721

* add tests for builtin prototype datasets * fix caltech101 * fix emnist * fix mnist and variants * add iopath as test requirement * fix MNIST warning * fix qmnist data generation * fix cifar data generation * add tests for imagenet * cleanup

add tests for builtin prototype datasets

3a4eca4

pmeier added module: datasets module: tests prototype labels Oct 21, 2021

pmeier requested review from NicolasHug and fmassa October 21, 2021 09:54

pytorch-probot bot added the ciflow/default label Oct 21, 2021

facebook-github-bot added the cla signed label Oct 21, 2021

pmeier added 16 commits October 21, 2021 11:54

Merge branch 'main' into datasets/tests/builtin

9f71143

Merge branch 'main' into datasets/tests/builtin

dc469b7

Merge branch 'main' into datasets/tests/builtin

83be9c6

Merge branch 'main' into datasets/tests/builtin

283128c

Merge branch 'main' into datasets/tests/builtin

765f882

fix caltech101

dadf236

fix emnist

0bee5b6

fix mnist and variants

637a827

add iopath as test requirement

77c2f0d

fix MNIST warning

329fa87

Merge branch 'main' into datasets/tests/builtin

efaa731

fix qmnist data generation

5d83ca6

fix cifar data generation

e682dae

add tests for imagenet

3b58f20

cleanup

a538cf7

Merge branch 'main' into datasets/tests/builtin

bbc0339

bjuncek reviewed Nov 2, 2021

View reviewed changes

pmeier mentioned this pull request Nov 3, 2021

Run tests from domains on our CI pytorch/data#69

Closed

Merge branch 'main' into datasets/tests/builtin

f48ed97

fmassa approved these changes Nov 3, 2021

View reviewed changes

pmeier commented Nov 4, 2021

View reviewed changes

Merge branch 'main' into datasets/tests/builtin

9388e6d

pmeier merged commit 49ec677 into pytorch:main Nov 4, 2021

pmeier deleted the datasets/tests/builtin branch November 4, 2021 08:51

pmeier mentioned this pull request Nov 4, 2021

Improve torchvision ci pytorch/data#96

Closed

pmeier mentioned this pull request Nov 17, 2021

remove debug artifacts from prototype dataset tests #4958

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add tests for prototype builtin datasets #4682

add tests for prototype builtin datasets #4682

Uh oh!

pmeier commented Oct 21, 2021 •

edited by pytorch-probot bot

Loading

Uh oh!

bjuncek left a comment

Uh oh!

bjuncek Nov 2, 2021

Uh oh!

pmeier Nov 3, 2021 •

edited

Loading

Uh oh!

bjuncek Nov 3, 2021

Uh oh!

pmeier Nov 4, 2021

Uh oh!

Uh oh!

bjuncek Nov 2, 2021

Uh oh!

pmeier Nov 3, 2021

Uh oh!

pmeier commented Nov 3, 2021

Uh oh!

ejguan commented Nov 3, 2021

Uh oh!

pmeier commented Nov 3, 2021 •

edited

Loading

Uh oh!

ejguan commented Nov 3, 2021

Uh oh!

fmassa left a comment

Uh oh!

fmassa commented Nov 3, 2021

Uh oh!

pmeier Nov 4, 2021

Uh oh!

Uh oh!

add tests for prototype builtin datasets #4682

add tests for prototype builtin datasets #4682

Uh oh!

Conversation

pmeier commented Oct 21, 2021 • edited by pytorch-probot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjuncek left a comment

Choose a reason for hiding this comment

Uh oh!

bjuncek Nov 2, 2021

Choose a reason for hiding this comment

Uh oh!

pmeier Nov 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjuncek Nov 3, 2021

Choose a reason for hiding this comment

Uh oh!

pmeier Nov 4, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bjuncek Nov 2, 2021

Choose a reason for hiding this comment

Uh oh!

pmeier Nov 3, 2021

Choose a reason for hiding this comment

Uh oh!

pmeier commented Nov 3, 2021

Uh oh!

ejguan commented Nov 3, 2021

Uh oh!

pmeier commented Nov 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ejguan commented Nov 3, 2021

Uh oh!

fmassa left a comment

Choose a reason for hiding this comment

Uh oh!

fmassa commented Nov 3, 2021

Uh oh!

pmeier Nov 4, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pmeier commented Oct 21, 2021 •

edited by pytorch-probot bot

Loading

pmeier Nov 3, 2021 •

edited

Loading

pmeier commented Nov 3, 2021 •

edited

Loading