Make extraction and decompression maps public #4113

adamjstewart · 2021-06-24T19:48:48Z

In #4098 I introduced a new archival extraction method for RAR files. However, it was decided that we want to avoid adding extract dependencies on third-party libraries. Instead, dataset authors can write their own custom extraction and decompression routines and add them to the maps in torchvision.datasets.utils.

The only change that this PR makes is to make these maps public. If anyone wants to change the global variable names, now is your last chance!

@pmeier @NicolasHug @calebrob6

Closes #4098 for now...

adamjstewart · 2021-06-24T20:05:55Z

Out of curiosity, is it normal that so many tests fail for reasons unrelated to the PR? CI seems more fragile than I would expect...

adamjstewart · 2021-06-24T22:57:36Z

P.S. Has anyone ever thought about releasing a standalone library with all of these utilities? I'm sure there's a lot of duplication between torchvision, torchaudio, and whatever other extensions PyTorch has. If we did that my torchgeo library could depend on torchutils instead of wrapping around torchvision.

pmeier · 2021-06-25T06:01:36Z

Out of curiosity, is it normal that so many tests fail for reasons unrelated to the PR? CI seems more fragile than I would expect...

Unfortunately, it is. Our master branch depends on PyTorch's master. If they break something that affects our workflow, our CI is now failing and the only thing we can do is wait for a fix.

If you look into the failing runs, it is not our tests that fail, but the build workflows:

CMake Error in CMakeLists.txt:
  Imported target "torch" includes non-existent path

    "/opt/conda/conda-bld/pytorch_1624518393304/work/torch/lib"

🤷

pmeier

LGTM, thanks @adamjstewart! One question though: should we document this somehow?

NicolasHug

Hi @adamjstewart ,

In #4098 it seemed that only ARCHIVE_EXTRACTORS was needed for users to register an unrar callable.

Could you detail your use-case for the rest of the dictionaries being made public here? Unless there's a strong and clear need for making those public, I'd prefer to keep them private. It is clear to me for ARCHIVE_EXTRACTORS, but not for the rest.

One question though: should we document this somehow?

I'm usually that guy who says "this should be documented" 5 times a day, but in this case we can probably keep it hidden: those who truly need it like @adamjstewart will find it anyway, and the rest don't need to know about it. Eventually, when we have a clearer idea of our user docs and our developer docs, we can start documenting this.

adamjstewart · 2021-06-25T15:07:59Z

should we document this somehow?

It will show up in the API docs, that's probably sufficient.

Could you detail your use-case for the rest of the dictionaries being made public here?

If ARCHIVE_EXTRACTORS was the only interface that needed to be made public, I wouldn't have had to submit #4097. With this PR, no matter what weird compression/archival scheme I encounter, I can immediately use torchvision without having to submit a new PR. Based on https://en.wikipedia.org/wiki/List_of_archive_formats, there are a lot of compression/archival schemes that we do not yet support. For example, .7z is becoming pretty popular, and doesn't have builtin Python support.

NicolasHug · 2021-06-25T15:14:37Z

Thanks for the feedback @adamjstewart

Sorry to be so conservative, but I'd prefer to only make public the things that we know we need to make public with an actual, immediate need. I'm happy to consider making ARCHIVE_EXTRACTORS public in the future, with a tangible use-case. But until then, as library maintainers, we need to be mindful of what it means to make something public: it always means maintenance cost with more work for us, and less flexibility to add new features that users and third-parth libraries (like yours) could benefit from.

adamjstewart · 2021-06-25T15:33:59Z

Here are some tangible use-cases for datasets that could be added if we added support for these decompressors/extractors:

.7z: https://www.kaggle.com/davidjfisher/illinois-doc-labeled-faces-dataset
.tar.lz: https://www.kaggle.com/solyoh21/68857tt7g7g7g7557
.lzma: https://www.kaggle.com/wanchaoyao/sakt-models
.tar.z: https://www.kaggle.com/noaa/noaa-precipitation-15-minute
.tar.zst: https://www.kaggle.com/kmahyyg/12306-captcha-2019

NicolasHug · 2021-06-25T15:36:19Z

There is no plan to vendor them in torchvision.
Do you have plans to vendor them in torchgeo?

adamjstewart · 2021-06-25T15:58:17Z

Nope, but given that I've already found 3 extensions that torchvision did not yet support (.rar, .bz2, .tbz) and I've only added 5 datasets to torchgeo, I imagine that I'll be back in a week or two to add support for a new compression/archive scheme 😄

adamjstewart · 2021-06-25T15:59:13Z

I can make only ARCHIVE_EXTRACTORS public for now, but since they all share the same purpose, it feels weird to have only one of them public.

fmassa · 2021-06-29T13:50:57Z

I think a proper way of doing this would be to provide a registration mechanism that allow users to implement the set of methods that are required for extraction to work.

Something like

def register_extractor(arg1, arg2):
    _ ARCHIVE_EXTRACTORS[arg1] = arg2
    # do more work

From this perspective, I agree with @NicolasHug that we are actually touching internal APIs, and we can definitely do better here. Making those maps public might go in the wrong direction, as we will have to keep BC and we might want to move into more feature-complete approaches in the future, like the one followed by TensorFlow Datasets.

So from this perspective, I would prefer if we avoid making those arguments public

adamjstewart · 2021-06-30T02:37:58Z

So how do we want to move forward with this? Do we want to:

Make _ARCHIVE_EXTRACTORS public, or add a public register_extractor function?
Do this just for archive extraction or also for decompression/aliases?
Move all of the download/decompression/extraction stuff into a separate library shared by torchvision/torchtext/torchaudio/torchgeo/etc.? Either into torch itself or a separate torchutils library?

Do torchaudio and torchtext have their own logic for this kind of download/decompression stuff? It feels like there's a lot of duplication between these libraries. I thought about duplicating this same logic myself in torchgeo but the download logic is pretty complex, especially for Google Drive.

adamjstewart · 2021-07-24T03:12:44Z

I ended up writing custom extraction logic for torchgeo instead of trying to upstream it to torchvision, so this PR is no longer necessary!

fmassa · 2021-08-13T12:26:49Z

Hey @adamjstewart

Sorry for the delay in replying, many of us were on holidays in the past month.

Yes, both torchaudio and torchtext have also similar requirements, and I think that we should upstream all this logic into a common location that can be used by downstream projects.
Maybe this should live in PyTorch, or in a separate library, but this is something we need to figure out.

cc @parmeet @mthrok @NicolasHug we should discuss again about those points in our next meeting

adamjstewart · 2021-08-13T13:35:03Z

If that effort happens let me know, I would love to contribute!

parmeet · 2021-08-13T13:53:14Z

Hey @adamjstewart

Sorry for the delay in replying, many of us were on holidays in the past month.

Yes, both torchaudio and torchtext have also similar requirements, and I think that we should upstream all this logic into a common location that can be used by downstream projects.
Maybe this should live in PyTorch, or in a separate library, but this is something we need to figure out.

cc @parmeet @mthrok @NicolasHug we should discuss again about those points in our next meeting

sounds good! BTW, we already have extractors implemented as part of pytorch data APIs. May be based on user-needs more extractors can live here in parallel.

fmassa · 2021-08-13T14:03:02Z

@parmeet good call, indeed we already provide DataPipes for extracting on-the-fly some formats.

We should probably figure out how to best use it to replace our current use-cases in torchvision for the old-style datasets (DataPipes-based datasets will be using the aforementioned DataPipes)

mthrok · 2021-08-17T15:54:14Z

Yes, both torchaudio and torchtext have also similar requirements, and I think that we should upstream all this logic into a common location that can be used by downstream projects.
Maybe this should live in PyTorch, or in a separate library, but this is something we need to figure out.

BTW, we already have extractors implemented as part of pytorch data APIs. May be based on user-needs more extractors can live here in parallel.

For the sake of future discussion, I would like to add a node that we should factor in maintenance cost of having the common stuff. For this particularly case of data utility, so far it sounds to me that somewhere in a PyTorch data module is best suited. I think that's a reasonable approach.

adamjstewart · 2021-08-17T16:05:08Z

I'm not very familiar with data pipes, but it seems like those extractors are designed to only allow for reading from compressed files, they can't be used to extract a compressed file. There are certainly use cases for reading from a compressed file, but there are also use cases for uncompressing a file first before reading.

I'm a bit worried about putting all of this logic directly in PyTorch since most of the PyTorch issues I open get ignored in the sea of 6K+ issues. I'm also concerned that I would get the same kind of pushback against adding dependencies like rarfile/unrar or changing private APIs to public APIs if we did things in PyTorch. I also need a dependency on zipfile-deflate64 for another extractor.

Of course, if there isn't an issue with adding these kind of optional dependencies to PyTorch, then I'm fine with whatever solution we choose.

parmeet · 2021-08-17T16:16:37Z

cc: @ejguan @VitalyFedyunin

ejguan · 2021-08-17T17:45:36Z

@adamjstewart

Agreed with @mthrok 's argument that a shared utility DataPipe across domains should be a better strategy.

it seems like those extractors are designed to only allow for reading from compressed files, they can't be used to extract a compressed file.

In the core repo, you are right that the existing ReaderDataPipe only support read from compressed file. And, thanks to @pmeier , I am convinced to have RarReader in torchdata. In terms of ExtractorDataPipe, if there are use cases, we definitely want to have them in torchdata.

I'm also concerned that I would get the same kind of pushback against adding dependencies like rarfile/unrar or changing private APIs to public APIs if we did things in PyTorch. I also need a dependency on zipfile-deflate64 for another extractor.

We will have these DataPipes landed in a separate repo targeting data utility. And, we don't need to enforce these third-party libraries as the dependency for the torchdata repo, because we can lazily import them in the corresponding DataPipe.

class ReadFromRar(IterDataPipe):
    def __init__(self):
        import rarfile
        self.rarfile = rarfile

Make extraction and decompression maps public

158cac7

facebook-github-bot added the cla signed label Jun 24, 2021

pmeier approved these changes Jun 25, 2021

View reviewed changes

pmeier requested a review from NicolasHug June 25, 2021 06:02

NicolasHug reviewed Jun 25, 2021

View reviewed changes

AnirudhDagar mentioned this pull request Jun 25, 2021

Port test_datasets_utils to pytest #4114

Merged

NicolasHug mentioned this pull request Jun 29, 2021

Add RAR file extraction support #4098

Closed

adamjstewart mentioned this pull request Jul 11, 2021

Download/checksum/decompression/extraction utilities torchgeo/torchgeo#40

Closed

adamjstewart closed this Jul 24, 2021

adamjstewart deleted the features/public-extract-decompress branch July 24, 2021 03:12

Make extraction and decompression maps public #4113

Make extraction and decompression maps public #4113

Uh oh!

Conversation

adamjstewart commented Jun 24, 2021

Uh oh!

adamjstewart commented Jun 24, 2021

Uh oh!

adamjstewart commented Jun 24, 2021

Uh oh!

pmeier commented Jun 25, 2021

Uh oh!

pmeier left a comment

Choose a reason for hiding this comment

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

adamjstewart commented Jun 25, 2021

Uh oh!

NicolasHug commented Jun 25, 2021

Uh oh!

adamjstewart commented Jun 25, 2021

Uh oh!

NicolasHug commented Jun 25, 2021

Uh oh!

adamjstewart commented Jun 25, 2021

Uh oh!

adamjstewart commented Jun 25, 2021

Uh oh!

fmassa commented Jun 29, 2021

Uh oh!

adamjstewart commented Jun 30, 2021

Uh oh!

adamjstewart commented Jul 24, 2021

Uh oh!

fmassa commented Aug 13, 2021

Uh oh!

adamjstewart commented Aug 13, 2021

Uh oh!

parmeet commented Aug 13, 2021

Uh oh!

fmassa commented Aug 13, 2021

Uh oh!

mthrok commented Aug 17, 2021

Uh oh!

adamjstewart commented Aug 17, 2021

Uh oh!

parmeet commented Aug 17, 2021

Uh oh!

ejguan commented Aug 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants