-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Make extraction and decompression maps public #4113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make extraction and decompression maps public #4113
Conversation
Out of curiosity, is it normal that so many tests fail for reasons unrelated to the PR? CI seems more fragile than I would expect... |
P.S. Has anyone ever thought about releasing a standalone library with all of these utilities? I'm sure there's a lot of duplication between torchvision, torchaudio, and whatever other extensions PyTorch has. If we did that my torchgeo library could depend on torchutils instead of wrapping around torchvision. |
Unfortunately, it is. Our master branch depends on PyTorch's master. If they break something that affects our workflow, our CI is now failing and the only thing we can do is wait for a fix. If you look into the failing runs, it is not our tests that fail, but the build workflows:
🤷 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @adamjstewart! One question though: should we document this somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @adamjstewart ,
In #4098 it seemed that only ARCHIVE_EXTRACTORS
was needed for users to register an unrar callable.
Could you detail your use-case for the rest of the dictionaries being made public here? Unless there's a strong and clear need for making those public, I'd prefer to keep them private. It is clear to me for ARCHIVE_EXTRACTORS
, but not for the rest.
One question though: should we document this somehow?
I'm usually that guy who says "this should be documented" 5 times a day, but in this case we can probably keep it hidden: those who truly need it like @adamjstewart will find it anyway, and the rest don't need to know about it. Eventually, when we have a clearer idea of our user docs and our developer docs, we can start documenting this.
It will show up in the API docs, that's probably sufficient.
If |
Thanks for the feedback @adamjstewart Sorry to be so conservative, but I'd prefer to only make public the things that we know we need to make public with an actual, immediate need. I'm happy to consider making |
Here are some tangible use-cases for datasets that could be added if we added support for these decompressors/extractors:
|
There is no plan to vendor them in torchvision. |
Nope, but given that I've already found 3 extensions that torchvision did not yet support (.rar, .bz2, .tbz) and I've only added 5 datasets to torchgeo, I imagine that I'll be back in a week or two to add support for a new compression/archive scheme 😄 |
I can make only |
I think a proper way of doing this would be to provide a registration mechanism that allow users to implement the set of methods that are required for extraction to work. Something like def register_extractor(arg1, arg2):
_ ARCHIVE_EXTRACTORS[arg1] = arg2
# do more work From this perspective, I agree with @NicolasHug that we are actually touching internal APIs, and we can definitely do better here. Making those maps public might go in the wrong direction, as we will have to keep BC and we might want to move into more feature-complete approaches in the future, like the one followed by TensorFlow Datasets. So from this perspective, I would prefer if we avoid making those arguments public |
So how do we want to move forward with this? Do we want to:
Do |
I ended up writing custom extraction logic for torchgeo instead of trying to upstream it to torchvision, so this PR is no longer necessary! |
Hey @adamjstewart Sorry for the delay in replying, many of us were on holidays in the past month. Yes, both cc @parmeet @mthrok @NicolasHug we should discuss again about those points in our next meeting |
If that effort happens let me know, I would love to contribute! |
sounds good! BTW, we already have extractors implemented as part of pytorch data APIs. May be based on user-needs more extractors can live here in parallel. |
@parmeet good call, indeed we already provide DataPipes for extracting on-the-fly some formats. We should probably figure out how to best use it to replace our current use-cases in torchvision for the old-style datasets (DataPipes-based datasets will be using the aforementioned DataPipes) |
For the sake of future discussion, I would like to add a node that we should factor in maintenance cost of having the common stuff. For this particularly case of data utility, so far it sounds to me that somewhere in a PyTorch data module is best suited. I think that's a reasonable approach. |
I'm not very familiar with data pipes, but it seems like those extractors are designed to only allow for reading from compressed files, they can't be used to extract a compressed file. There are certainly use cases for reading from a compressed file, but there are also use cases for uncompressing a file first before reading. I'm a bit worried about putting all of this logic directly in PyTorch since most of the PyTorch issues I open get ignored in the sea of 6K+ issues. I'm also concerned that I would get the same kind of pushback against adding dependencies like rarfile/unrar or changing private APIs to public APIs if we did things in PyTorch. I also need a dependency on zipfile-deflate64 for another extractor. Of course, if there isn't an issue with adding these kind of optional dependencies to PyTorch, then I'm fine with whatever solution we choose. |
Agreed with @mthrok 's argument that a shared utility DataPipe across domains should be a better strategy.
In the core repo, you are right that the existing
We will have these DataPipes landed in a separate repo targeting data utility. And, we don't need to enforce these third-party libraries as the dependency for the torchdata repo, because we can lazily import them in the corresponding DataPipe. class ReadFromRar(IterDataPipe):
def __init__(self):
import rarfile
self.rarfile = rarfile |
In #4098 I introduced a new archival extraction method for RAR files. However, it was decided that we want to avoid adding extract dependencies on third-party libraries. Instead, dataset authors can write their own custom extraction and decompression routines and add them to the maps in
torchvision.datasets.utils
.The only change that this PR makes is to make these maps public. If anyone wants to change the global variable names, now is your last chance!
@pmeier @NicolasHug @calebrob6
Closes #4098 for now...