Skip to content

Conversation

adamjstewart
Copy link
Contributor

In #4098 I introduced a new archival extraction method for RAR files. However, it was decided that we want to avoid adding extract dependencies on third-party libraries. Instead, dataset authors can write their own custom extraction and decompression routines and add them to the maps in torchvision.datasets.utils.

The only change that this PR makes is to make these maps public. If anyone wants to change the global variable names, now is your last chance!

@pmeier @NicolasHug @calebrob6

Closes #4098 for now...

@adamjstewart
Copy link
Contributor Author

Out of curiosity, is it normal that so many tests fail for reasons unrelated to the PR? CI seems more fragile than I would expect...

@adamjstewart
Copy link
Contributor Author

P.S. Has anyone ever thought about releasing a standalone library with all of these utilities? I'm sure there's a lot of duplication between torchvision, torchaudio, and whatever other extensions PyTorch has. If we did that my torchgeo library could depend on torchutils instead of wrapping around torchvision.

@pmeier
Copy link
Collaborator

pmeier commented Jun 25, 2021

Out of curiosity, is it normal that so many tests fail for reasons unrelated to the PR? CI seems more fragile than I would expect...

Unfortunately, it is. Our master branch depends on PyTorch's master. If they break something that affects our workflow, our CI is now failing and the only thing we can do is wait for a fix.

If you look into the failing runs, it is not our tests that fail, but the build workflows:

CMake Error in CMakeLists.txt:
  Imported target "torch" includes non-existent path

    "/opt/conda/conda-bld/pytorch_1624518393304/work/torch/lib"

🤷

Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @adamjstewart! One question though: should we document this somehow?

@pmeier pmeier requested a review from NicolasHug June 25, 2021 06:02
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @adamjstewart ,

In #4098 it seemed that only ARCHIVE_EXTRACTORS was needed for users to register an unrar callable.

Could you detail your use-case for the rest of the dictionaries being made public here? Unless there's a strong and clear need for making those public, I'd prefer to keep them private. It is clear to me for ARCHIVE_EXTRACTORS, but not for the rest.

One question though: should we document this somehow?

I'm usually that guy who says "this should be documented" 5 times a day, but in this case we can probably keep it hidden: those who truly need it like @adamjstewart will find it anyway, and the rest don't need to know about it. Eventually, when we have a clearer idea of our user docs and our developer docs, we can start documenting this.

@adamjstewart
Copy link
Contributor Author

should we document this somehow?

It will show up in the API docs, that's probably sufficient.

Could you detail your use-case for the rest of the dictionaries being made public here?

If ARCHIVE_EXTRACTORS was the only interface that needed to be made public, I wouldn't have had to submit #4097. With this PR, no matter what weird compression/archival scheme I encounter, I can immediately use torchvision without having to submit a new PR. Based on https://en.wikipedia.org/wiki/List_of_archive_formats, there are a lot of compression/archival schemes that we do not yet support. For example, .7z is becoming pretty popular, and doesn't have builtin Python support.

@NicolasHug
Copy link
Member

Thanks for the feedback @adamjstewart

Sorry to be so conservative, but I'd prefer to only make public the things that we know we need to make public with an actual, immediate need. I'm happy to consider making ARCHIVE_EXTRACTORS public in the future, with a tangible use-case. But until then, as library maintainers, we need to be mindful of what it means to make something public: it always means maintenance cost with more work for us, and less flexibility to add new features that users and third-parth libraries (like yours) could benefit from.

@adamjstewart
Copy link
Contributor Author

Here are some tangible use-cases for datasets that could be added if we added support for these decompressors/extractors:

@NicolasHug
Copy link
Member

There is no plan to vendor them in torchvision.
Do you have plans to vendor them in torchgeo?

@adamjstewart
Copy link
Contributor Author

Nope, but given that I've already found 3 extensions that torchvision did not yet support (.rar, .bz2, .tbz) and I've only added 5 datasets to torchgeo, I imagine that I'll be back in a week or two to add support for a new compression/archive scheme 😄

@adamjstewart
Copy link
Contributor Author

I can make only ARCHIVE_EXTRACTORS public for now, but since they all share the same purpose, it feels weird to have only one of them public.

@fmassa
Copy link
Member

fmassa commented Jun 29, 2021

I think a proper way of doing this would be to provide a registration mechanism that allow users to implement the set of methods that are required for extraction to work.

Something like

def register_extractor(arg1, arg2):
    _ ARCHIVE_EXTRACTORS[arg1] = arg2
    # do more work

From this perspective, I agree with @NicolasHug that we are actually touching internal APIs, and we can definitely do better here. Making those maps public might go in the wrong direction, as we will have to keep BC and we might want to move into more feature-complete approaches in the future, like the one followed by TensorFlow Datasets.

So from this perspective, I would prefer if we avoid making those arguments public

@adamjstewart
Copy link
Contributor Author

So how do we want to move forward with this? Do we want to:

  • Make _ARCHIVE_EXTRACTORS public, or add a public register_extractor function?
  • Do this just for archive extraction or also for decompression/aliases?
  • Move all of the download/decompression/extraction stuff into a separate library shared by torchvision/torchtext/torchaudio/torchgeo/etc.? Either into torch itself or a separate torchutils library?

Do torchaudio and torchtext have their own logic for this kind of download/decompression stuff? It feels like there's a lot of duplication between these libraries. I thought about duplicating this same logic myself in torchgeo but the download logic is pretty complex, especially for Google Drive.

@adamjstewart
Copy link
Contributor Author

I ended up writing custom extraction logic for torchgeo instead of trying to upstream it to torchvision, so this PR is no longer necessary!

@adamjstewart adamjstewart deleted the features/public-extract-decompress branch July 24, 2021 03:12
@fmassa
Copy link
Member

fmassa commented Aug 13, 2021

Hey @adamjstewart

Sorry for the delay in replying, many of us were on holidays in the past month.

Yes, both torchaudio and torchtext have also similar requirements, and I think that we should upstream all this logic into a common location that can be used by downstream projects.
Maybe this should live in PyTorch, or in a separate library, but this is something we need to figure out.

cc @parmeet @mthrok @NicolasHug we should discuss again about those points in our next meeting

@adamjstewart
Copy link
Contributor Author

If that effort happens let me know, I would love to contribute!

@parmeet
Copy link

parmeet commented Aug 13, 2021

Hey @adamjstewart

Sorry for the delay in replying, many of us were on holidays in the past month.

Yes, both torchaudio and torchtext have also similar requirements, and I think that we should upstream all this logic into a common location that can be used by downstream projects.
Maybe this should live in PyTorch, or in a separate library, but this is something we need to figure out.

cc @parmeet @mthrok @NicolasHug we should discuss again about those points in our next meeting

sounds good! BTW, we already have extractors implemented as part of pytorch data APIs. May be based on user-needs more extractors can live here in parallel.

@fmassa
Copy link
Member

fmassa commented Aug 13, 2021

@parmeet good call, indeed we already provide DataPipes for extracting on-the-fly some formats.

We should probably figure out how to best use it to replace our current use-cases in torchvision for the old-style datasets (DataPipes-based datasets will be using the aforementioned DataPipes)

@mthrok
Copy link
Contributor

mthrok commented Aug 17, 2021

Yes, both torchaudio and torchtext have also similar requirements, and I think that we should upstream all this logic into a common location that can be used by downstream projects.
Maybe this should live in PyTorch, or in a separate library, but this is something we need to figure out.

BTW, we already have extractors implemented as part of pytorch data APIs. May be based on user-needs more extractors can live here in parallel.

For the sake of future discussion, I would like to add a node that we should factor in maintenance cost of having the common stuff. For this particularly case of data utility, so far it sounds to me that somewhere in a PyTorch data module is best suited. I think that's a reasonable approach.

@adamjstewart
Copy link
Contributor Author

I'm not very familiar with data pipes, but it seems like those extractors are designed to only allow for reading from compressed files, they can't be used to extract a compressed file. There are certainly use cases for reading from a compressed file, but there are also use cases for uncompressing a file first before reading.

I'm a bit worried about putting all of this logic directly in PyTorch since most of the PyTorch issues I open get ignored in the sea of 6K+ issues. I'm also concerned that I would get the same kind of pushback against adding dependencies like rarfile/unrar or changing private APIs to public APIs if we did things in PyTorch. I also need a dependency on zipfile-deflate64 for another extractor.

Of course, if there isn't an issue with adding these kind of optional dependencies to PyTorch, then I'm fine with whatever solution we choose.

@parmeet
Copy link

parmeet commented Aug 17, 2021

cc: @ejguan @VitalyFedyunin

@ejguan
Copy link
Contributor

ejguan commented Aug 17, 2021

@adamjstewart

Agreed with @mthrok 's argument that a shared utility DataPipe across domains should be a better strategy.

it seems like those extractors are designed to only allow for reading from compressed files, they can't be used to extract a compressed file.

In the core repo, you are right that the existing ReaderDataPipe only support read from compressed file. And, thanks to @pmeier , I am convinced to have RarReader in torchdata. In terms of ExtractorDataPipe, if there are use cases, we definitely want to have them in torchdata.

I'm also concerned that I would get the same kind of pushback against adding dependencies like rarfile/unrar or changing private APIs to public APIs if we did things in PyTorch. I also need a dependency on zipfile-deflate64 for another extractor.

We will have these DataPipes landed in a separate repo targeting data utility. And, we don't need to enforce these third-party libraries as the dependency for the torchdata repo, because we can lazily import them in the corresponding DataPipe.

class ReadFromRar(IterDataPipe):
    def __init__(self):
        import rarfile
        self.rarfile = rarfile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants