-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: use pooch for dataset downloading #8679
Conversation
cool initiative @drammock ! our datasets module has become a mess with the years. |
FYI this is stalled until fatiando/pooch#223 can be solved. |
Can we update our kiloword dataset in some trivial and maybe also helpful way to overcome this problem? For example we could split it into multiple archives if we only use a subset of the files anyway. But if this is a problem with multiple datasets then maybe we should wait for an upstream fix. |
the problem is not specific to kiloword, I just used that as an example because it's fairly small so it downloads fast during testing. Pooch's built-in zip/tar extractors allow only limited control over the resulting location of the unpacked file(s). I'm working on an upstream PR now. |
some interim updates: I'm working locally with my modified version of the pooch code (fatiando/pooch#224). it's working well for most datasets, but I haven't tackled the hard ones yet that allow partial downloading (eegbci, brainstorm, etc). The main downside is that, to get the benefit of pooch's hash comparison, we have to keep a copy of the archive file around on disk after unpacking it, thus ballooning the disk usage by between 50 and 90% (not sure what the avg compression ratio is for these datasets). If we don't keep the archives around, then we can still use pooch for the downloading/unpacking step, we just have to check first if the target dir exists, assume the correct files are in there, and return the path (bypassing pooch entirely). In that scenario, we would need a "force_download" flag or similar, which would download no matter what (since there's no old archive around to compare a hash to). In summary: it is hard for us to make use of one of pooch's nice features (conditional download based on hash) without doubling our footprint on disk. But it's probably still worth using pooch anyway because it handles downloading, verifying, and unpacking for us. |
I think this is fine. This is what
We already have |
what happens if you remove the archive? it downloads again?
… |
I think if we remove the archive, then we have to just check for the presence of the expected folder, and return it if it exists, and download/unpack the archive if the folder isn't there. A slightly more complicated option is to check if the archive is there, if so check its hash and redownload if appropriate, but if it's not there and the target folder already exists, just return the folder without downloading anything. This would allow users to set a "keep_archives" flag and then automatically get the latest version of the dataset if they were willing to use disk space to keep the archives around... But if they didn't want to do that they could still manually force a download with a "force_download" flag. |
I'd rather skip this and add it later if we want. We don't get many complaints about datasets being out of date from people, and we don't update them very often. I'd rather expand support for our dataset |
Let's do like other projects do when based on pooch. Let's not try to be
too clever here.
… |
We might be in a unique situation trying to deal with large (1GB+) sized archives as opposed to smaller ones, not sure... The simplest solution naively I think would be to use pooch to replace our |
pooch is designed for a case where there are individual files mostly living in the same repo or on the same server. So I'm not sure that the way other projects do it is a good guide for us, since as @larsoner says we have big archives often containing many files that need to all be present to be useful. I see 2 options, see pseudocode below: option 1
option 2
I'm in favor of starting with option 1 in this PR, and maybe adding option 2 later. option 2 has more complicated logic and gives users more choices than they used to have... let's focus first on replacing our custom code with pooch equivalents, and then consider whether to expand. My 2c |
+1 for option 1
yet doubling the size of the home dir for network users can be a bad idea...
|
Another interim update: in getting the LIMO dataset to use pooch, I discovered that when the OSF servers zip a folder of files for you, the resulting zip will have a different md5 hash every time. So testing the integrity of the LIMO files will be a pain, because each zip archive contains the same 2 filenames (making it rather tricky to keep track of which |
I didn't know our LIMO dataset worked that way. Also it's not great that the hash changes. Would it be feasible to make a single large archive .tar.gz like we do with other datasets? Or instead, can we download one file at a time, each with their own hash? Just thinking out loud of alternatives that allow us to verify hashes without actually looking at our code or the files... |
in theory we can check hashes of individual files. I've also opened an issue with OSF about fixing the changing hash problem. I'd rather not have one big tar, since at present you can download just one subject at a time instead of all 18 at once. |
d245fbf
to
2dc7ef4
Compare
4093de4
to
bc4acdd
Compare
Hi @drammock, I just uploaded a dataset to OSF and will upload a few more in the coming weeks. Do you know if the issue about hashes that you opened was resolved? if not, do you have an alternative location you would suggest for uploading that will work nicely with pooch? |
nope, no response at all: CenterForOpenScience/osf.io#9594
I suggest you continue using OSF for now. Our pooch implementation was stalled quite a while waiting for fatiando/pooch#224 to get merged, and when it did get merged I was too busy to pick this back up. Since then a lot has changed in |
Thanks @drammock |
@drammock I like the idea of doing an initial "replacement" of the existing Do you have an idea/sequence for how to get started on helping you here? Thanks! |
Do you mind if I start a new PR to bring things up to date w/ main? Is there a way to basically "co-author" the main commits with you? |
I agree that the diff here is huge and starting a fresh PR is the right approach. Don't worry about co-signing commits, I'm sure I'll have a few accepted suggestions during review and/or I can push a few commits to your PR if appropriate. |
|
||
|
||
EEGMI_URL = 'https://physionet.org/files/eegmmidb/1.0.0/' | ||
|
||
|
||
@deprecated('mne.datasets.eegbci.data_path() is deprecated and will be removed' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't recall why I was deprecating here (maybe to improve API consistency?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe for the first PR you should try to minimize disruption (try to swap in pooch behind the scenes, and change the user-facing API minimally if at all)
WIP. currently implementation is unfinished.
closes #8674