Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: use pooch for dataset downloading #8679

Closed
wants to merge 13 commits into from

Conversation

drammock
Copy link
Member

WIP. currently implementation is unfinished.

closes #8674

@agramfort
Copy link
Member

cool initiative @drammock ! our datasets module has become a mess with the years.

@drammock
Copy link
Member Author

drammock commented Jan 4, 2021

FYI this is stalled until fatiando/pooch#223 can be solved.

@larsoner
Copy link
Member

larsoner commented Jan 4, 2021

Can we update our kiloword dataset in some trivial and maybe also helpful way to overcome this problem? For example we could split it into multiple archives if we only use a subset of the files anyway. But if this is a problem with multiple datasets then maybe we should wait for an upstream fix.

@drammock
Copy link
Member Author

drammock commented Jan 4, 2021

the problem is not specific to kiloword, I just used that as an example because it's fairly small so it downloads fast during testing. Pooch's built-in zip/tar extractors allow only limited control over the resulting location of the unpacked file(s). I'm working on an upstream PR now.

@drammock
Copy link
Member Author

drammock commented Jan 5, 2021

some interim updates:

I'm working locally with my modified version of the pooch code (fatiando/pooch#224). it's working well for most datasets, but I haven't tackled the hard ones yet that allow partial downloading (eegbci, brainstorm, etc). The main downside is that, to get the benefit of pooch's hash comparison, we have to keep a copy of the archive file around on disk after unpacking it, thus ballooning the disk usage by between 50 and 90% (not sure what the avg compression ratio is for these datasets). If we don't keep the archives around, then we can still use pooch for the downloading/unpacking step, we just have to check first if the target dir exists, assume the correct files are in there, and return the path (bypassing pooch entirely). In that scenario, we would need a "force_download" flag or similar, which would download no matter what (since there's no old archive around to compare a hash to).

In summary: it is hard for us to make use of one of pooch's nice features (conditional download based on hash) without doubling our footprint on disk. But it's probably still worth using pooch anyway because it handles downloading, verifying, and unpacking for us.

@larsoner
Copy link
Member

larsoner commented Jan 5, 2021

then we can still use pooch for the downloading/unpacking step, we just have to check first if the target dir exists, assume the correct files are in there, and return the path (bypassing pooch entirely).

I think this is fine. This is what _data_path already does, more or less. Pooch basically being a substitute only for _fetch_file already seems worthwhile.

In that scenario, we would need a "force_download" flag or similar, which would download no matter what (since there's no old archive around to compare a hash to).

We already have force_update arg so we can just keep it.

@agramfort
Copy link
Member

agramfort commented Jan 5, 2021 via email

@drammock
Copy link
Member Author

drammock commented Jan 5, 2021

what happens if you remove the archive? it downloads again?

I think if we remove the archive, then we have to just check for the presence of the expected folder, and return it if it exists, and download/unpack the archive if the folder isn't there.

A slightly more complicated option is to check if the archive is there, if so check its hash and redownload if appropriate, but if it's not there and the target folder already exists, just return the folder without downloading anything. This would allow users to set a "keep_archives" flag and then automatically get the latest version of the dataset if they were willing to use disk space to keep the archives around... But if they didn't want to do that they could still manually force a download with a "force_download" flag.

@larsoner
Copy link
Member

larsoner commented Jan 5, 2021

A slightly more complicated option is to check if the archive is there

I'd rather skip this and add it later if we want. We don't get many complaints about datasets being out of date from people, and we don't update them very often. I'd rather expand support for our dataset version.txt rather than keep these archives around if we're looking for a way to version datasets -- it takes up way less space for people and will be faster anyway. There is some added overhead at our end to keep the text files up to date and check them but it's not too onerous I think.

@agramfort
Copy link
Member

agramfort commented Jan 5, 2021 via email

@larsoner
Copy link
Member

larsoner commented Jan 5, 2021

Let's do like other projects do when based on pooch. Let's not try to be too clever here.

We might be in a unique situation trying to deal with large (1GB+) sized archives as opposed to smaller ones, not sure...

The simplest solution naively I think would be to use pooch to replace our _fetch_file with pooch, then move to doing advanced stuff and versioning when it's supported. @drammock WDYT about this as a simpler first step?

@drammock
Copy link
Member Author

drammock commented Jan 5, 2021

pooch is designed for a case where there are individual files mostly living in the same repo or on the same server. So I'm not sure that the way other projects do it is a good guide for us, since as @larsoner says we have big archives often containing many files that need to all be present to be useful. I see 2 options, see pseudocode below:

option 1

if target_folder_exists and not force_download:
    return target_folder
else:
    use_pooch_to_download_verify_and_unpack()
    return target_folder

option 2

if force_download:
    use_pooch_to_download_verify_and_unpack()
elif local_archive_exists and hash_check_fails():
    use_pooch_to_download_verify_and_unpack()
elif not target_folder_exists:
    use_pooch_to_download_verify_and_unpack()
return target_folder   

I'm in favor of starting with option 1 in this PR, and maybe adding option 2 later. option 2 has more complicated logic and gives users more choices than they used to have... let's focus first on replacing our custom code with pooch equivalents, and then consider whether to expand. My 2c

@agramfort
Copy link
Member

agramfort commented Jan 5, 2021 via email

@drammock
Copy link
Member Author

drammock commented Jan 7, 2021

Another interim update: in getting the LIMO dataset to use pooch, I discovered that when the OSF servers zip a folder of files for you, the resulting zip will have a different md5 hash every time. So testing the integrity of the LIMO files will be a pain, because each zip archive contains the same 2 filenames (making it rather tricky to keep track of which LIMO.mat has which md5). I suppose that's why the current code doesn't bother checking hashes for limo files 😆

@larsoner
Copy link
Member

larsoner commented Jan 7, 2021

I didn't know our LIMO dataset worked that way. Also it's not great that the hash changes.

Would it be feasible to make a single large archive .tar.gz like we do with other datasets?

Or instead, can we download one file at a time, each with their own hash?

Just thinking out loud of alternatives that allow us to verify hashes without actually looking at our code or the files...

@drammock
Copy link
Member Author

drammock commented Jan 7, 2021

in theory we can check hashes of individual files. I've also opened an issue with OSF about fixing the changing hash problem. I'd rather not have one big tar, since at present you can download just one subject at a time instead of all 18 at once.

@drammock drammock force-pushed the use-pooch branch 5 times, most recently from d245fbf to 2dc7ef4 Compare January 8, 2021 23:20
@drammock drammock mentioned this pull request Jan 19, 2021
Base automatically changed from master to main January 23, 2021 18:27
@drammock drammock force-pushed the use-pooch branch 2 times, most recently from 4093de4 to bc4acdd Compare February 16, 2021 22:45
@rob-luke
Copy link
Member

Hi @drammock, I just uploaded a dataset to OSF and will upload a few more in the coming weeks. Do you know if the issue about hashes that you opened was resolved? if not, do you have an alternative location you would suggest for uploading that will work nicely with pooch?

@drammock
Copy link
Member Author

Hi @drammock, I just uploaded a dataset to OSF and will upload a few more in the coming weeks. Do you know if the issue about hashes that you opened was resolved?

nope, no response at all: CenterForOpenScience/osf.io#9594

if not, do you have an alternative location you would suggest for uploading that will work nicely with pooch?

I suggest you continue using OSF for now. Our pooch implementation was stalled quite a while waiting for fatiando/pooch#224 to get merged, and when it did get merged I was too busy to pick this back up. Since then a lot has changed in main so reviving this PR would basically mean starting over... so no point doing something inconvenient to accommodate a refactoring that may or may not ever get done.

@rob-luke
Copy link
Member

Thanks @drammock

@adam2392
Copy link
Member

@drammock I like the idea of doing an initial "replacement" of the existing _fetch_file. I read through the discussion here. However, your diff is very large in this draft.

Do you have an idea/sequence for how to get started on helping you here?

Thanks!

@adam2392
Copy link
Member

Do you mind if I start a new PR to bring things up to date w/ main? Is there a way to basically "co-author" the main commits with you?

@drammock
Copy link
Member Author

I agree that the diff here is huge and starting a fresh PR is the right approach. Don't worry about co-signing commits, I'm sure I'll have a few accepted suggestions during review and/or I can push a few commits to your PR if appropriate.



EEGMI_URL = 'https://physionet.org/files/eegmmidb/1.0.0/'


@deprecated('mne.datasets.eegbci.data_path() is deprecated and will be removed'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drammock should I deprecate these functions for these datasets in #9742 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't recall why I was deprecating here (maybe to improve API consistency?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe for the first PR you should try to minimize disruption (try to swap in pooch behind the scenes, and change the user-facing API minimally if at all)

@drammock drammock deleted the use-pooch branch September 16, 2021 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFC, MAINT: use pooch for download management?
5 participants