WIP: use pooch for dataset downloading #8679

drammock · 2020-12-24T00:03:52Z

WIP. currently implementation is unfinished.

agramfort · 2020-12-24T11:40:12Z

cool initiative @drammock ! our datasets module has become a mess with the years.

drammock · 2021-01-04T15:27:38Z

FYI this is stalled until fatiando/pooch#223 can be solved.

larsoner · 2021-01-04T16:10:42Z

Can we update our kiloword dataset in some trivial and maybe also helpful way to overcome this problem? For example we could split it into multiple archives if we only use a subset of the files anyway. But if this is a problem with multiple datasets then maybe we should wait for an upstream fix.

drammock · 2021-01-04T16:25:33Z

the problem is not specific to kiloword, I just used that as an example because it's fairly small so it downloads fast during testing. Pooch's built-in zip/tar extractors allow only limited control over the resulting location of the unpacked file(s). I'm working on an upstream PR now.

drammock · 2021-01-05T00:20:36Z

some interim updates:

I'm working locally with my modified version of the pooch code (fatiando/pooch#224). it's working well for most datasets, but I haven't tackled the hard ones yet that allow partial downloading (eegbci, brainstorm, etc). The main downside is that, to get the benefit of pooch's hash comparison, we have to keep a copy of the archive file around on disk after unpacking it, thus ballooning the disk usage by between 50 and 90% (not sure what the avg compression ratio is for these datasets). If we don't keep the archives around, then we can still use pooch for the downloading/unpacking step, we just have to check first if the target dir exists, assume the correct files are in there, and return the path (bypassing pooch entirely). In that scenario, we would need a "force_download" flag or similar, which would download no matter what (since there's no old archive around to compare a hash to).

In summary: it is hard for us to make use of one of pooch's nice features (conditional download based on hash) without doubling our footprint on disk. But it's probably still worth using pooch anyway because it handles downloading, verifying, and unpacking for us.

larsoner · 2021-01-05T00:26:27Z

then we can still use pooch for the downloading/unpacking step, we just have to check first if the target dir exists, assume the correct files are in there, and return the path (bypassing pooch entirely).

I think this is fine. This is what _data_path already does, more or less. Pooch basically being a substitute only for _fetch_file already seems worthwhile.

In that scenario, we would need a "force_download" flag or similar, which would download no matter what (since there's no old archive around to compare a hash to).

We already have force_update arg so we can just keep it.

agramfort · 2021-01-05T13:36:49Z

what happens if you remove the archive? it downloads again?

…

drammock · 2021-01-05T14:21:42Z

what happens if you remove the archive? it downloads again?

I think if we remove the archive, then we have to just check for the presence of the expected folder, and return it if it exists, and download/unpack the archive if the folder isn't there.

A slightly more complicated option is to check if the archive is there, if so check its hash and redownload if appropriate, but if it's not there and the target folder already exists, just return the folder without downloading anything. This would allow users to set a "keep_archives" flag and then automatically get the latest version of the dataset if they were willing to use disk space to keep the archives around... But if they didn't want to do that they could still manually force a download with a "force_download" flag.

larsoner · 2021-01-05T14:26:58Z

A slightly more complicated option is to check if the archive is there

I'd rather skip this and add it later if we want. We don't get many complaints about datasets being out of date from people, and we don't update them very often. I'd rather expand support for our dataset version.txt rather than keep these archives around if we're looking for a way to version datasets -- it takes up way less space for people and will be faster anyway. There is some added overhead at our end to keep the text files up to date and check them but it's not too onerous I think.

agramfort · 2021-01-05T15:14:54Z

Let's do like other projects do when based on pooch. Let's not try to be too clever here.

…

larsoner · 2021-01-05T15:20:54Z

Let's do like other projects do when based on pooch. Let's not try to be too clever here.

We might be in a unique situation trying to deal with large (1GB+) sized archives as opposed to smaller ones, not sure...

The simplest solution naively I think would be to use pooch to replace our _fetch_file with pooch, then move to doing advanced stuff and versioning when it's supported. @drammock WDYT about this as a simpler first step?

drammock · 2021-01-05T15:56:59Z

pooch is designed for a case where there are individual files mostly living in the same repo or on the same server. So I'm not sure that the way other projects do it is a good guide for us, since as @larsoner says we have big archives often containing many files that need to all be present to be useful. I see 2 options, see pseudocode below:

option 1

if target_folder_exists and not force_download:
    return target_folder
else:
    use_pooch_to_download_verify_and_unpack()
    return target_folder

option 2

if force_download:
    use_pooch_to_download_verify_and_unpack()
elif local_archive_exists and hash_check_fails():
    use_pooch_to_download_verify_and_unpack()
elif not target_folder_exists:
    use_pooch_to_download_verify_and_unpack()
return target_folder

I'm in favor of starting with option 1 in this PR, and maybe adding option 2 later. option 2 has more complicated logic and gives users more choices than they used to have... let's focus first on replacing our custom code with pooch equivalents, and then consider whether to expand. My 2c

agramfort · 2021-01-05T15:59:17Z

+1 for option 1 yet doubling the size of the home dir for network users can be a bad idea...

drammock · 2021-01-07T00:31:23Z

Another interim update: in getting the LIMO dataset to use pooch, I discovered that when the OSF servers zip a folder of files for you, the resulting zip will have a different md5 hash every time. So testing the integrity of the LIMO files will be a pain, because each zip archive contains the same 2 filenames (making it rather tricky to keep track of which LIMO.mat has which md5). I suppose that's why the current code doesn't bother checking hashes for limo files 😆

larsoner · 2021-01-07T00:43:32Z

I didn't know our LIMO dataset worked that way. Also it's not great that the hash changes.

Would it be feasible to make a single large archive .tar.gz like we do with other datasets?

Or instead, can we download one file at a time, each with their own hash?

Just thinking out loud of alternatives that allow us to verify hashes without actually looking at our code or the files...

drammock · 2021-01-07T02:21:15Z

in theory we can check hashes of individual files. I've also opened an issue with OSF about fixing the changing hash problem. I'd rather not have one big tar, since at present you can download just one subject at a time instead of all 18 at once.

rob-luke · 2021-07-27T10:51:41Z

Hi @drammock, I just uploaded a dataset to OSF and will upload a few more in the coming weeks. Do you know if the issue about hashes that you opened was resolved? if not, do you have an alternative location you would suggest for uploading that will work nicely with pooch?

drammock · 2021-07-27T14:51:28Z

Hi @drammock, I just uploaded a dataset to OSF and will upload a few more in the coming weeks. Do you know if the issue about hashes that you opened was resolved?

nope, no response at all: CenterForOpenScience/osf.io#9594

if not, do you have an alternative location you would suggest for uploading that will work nicely with pooch?

I suggest you continue using OSF for now. Our pooch implementation was stalled quite a while waiting for fatiando/pooch#224 to get merged, and when it did get merged I was too busy to pick this back up. Since then a lot has changed in main so reviving this PR would basically mean starting over... so no point doing something inconvenient to accommodate a refactoring that may or may not ever get done.

rob-luke · 2021-07-28T02:05:03Z

Thanks @drammock

adam2392 · 2021-09-15T02:29:06Z

@drammock I like the idea of doing an initial "replacement" of the existing _fetch_file. I read through the discussion here. However, your diff is very large in this draft.

Do you have an idea/sequence for how to get started on helping you here?

Thanks!

adam2392 · 2021-09-15T13:31:43Z

Do you mind if I start a new PR to bring things up to date w/ main? Is there a way to basically "co-author" the main commits with you?

drammock · 2021-09-15T15:00:36Z

I agree that the diff here is huge and starting a fresh PR is the right approach. Don't worry about co-signing commits, I'm sure I'll have a few accepted suggestions during review and/or I can push a few commits to your PR if appropriate.

adam2392 · 2021-09-15T20:05:08Z

mne/datasets/eegbci/eegbci.py

 EEGMI_URL = 'https://physionet.org/files/eegmmidb/1.0.0/'


+@deprecated('mne.datasets.eegbci.data_path() is deprecated and will be removed'


@drammock should I deprecate these functions for these datasets in #9742 ?

I can't recall why I was deprecating here (maybe to improve API consistency?)

maybe for the first PR you should try to minimize disruption (try to swap in pooch behind the scenes, and change the user-facing API minimally if at all)

drammock force-pushed the use-pooch branch from b6dfaa1 to c9d8e18 Compare December 30, 2020 00:10

drammock mentioned this pull request Jan 4, 2021

MRG, MAINT: Better downloading for testing and misc #8696

Merged

larsoner mentioned this pull request Jan 4, 2021

MRG: Use caching in Github Actions and Azure Pipelines #8695

Merged

2 tasks

drammock force-pushed the use-pooch branch from c9d8e18 to 863d5e8 Compare January 5, 2021 00:18

drammock force-pushed the use-pooch branch from 863d5e8 to e2fcce4 Compare January 5, 2021 23:50

drammock force-pushed the use-pooch branch 5 times, most recently from d245fbf to 2dc7ef4 Compare January 8, 2021 23:20

drammock mentioned this pull request Jan 19, 2021

MNE sEEG example #8751

Closed

Base automatically changed from master to main January 23, 2021 18:27

drammock added 2 commits February 15, 2021 16:55

use pooch for main datasets

781ef9a

use pooch for EEGBCI

fa8f6ce

drammock and others added 10 commits February 15, 2021 16:56

use pooch for LIMO

74a9181

package-level changes (requirements, manifest, etc)

d6d8e0c

use pooch for HF_SEF

564a13b

add pooch to requirements_testing

01e2d36

refactor helper functions

3882610

convert pooch to soft import

4808d2b

refactor install checkers to use _soft_import

46096b2

fix tests; fix overwriting of testing/misc

de4e631

FIX: Return

b9b23f3

cleanup after rebase against main

4f30adf

drammock force-pushed the use-pooch branch 2 times, most recently from 4093de4 to bc4acdd Compare February 16, 2021 22:45

fix CIs?

3a02e8a

drammock force-pushed the use-pooch branch from bc4acdd to 3a02e8a Compare February 16, 2021 22:52

adam2392 mentioned this pull request Sep 15, 2021

Adding an optional token to the dataset fetcher code to allow optional fetching from private repositories #9736

Closed

drammock closed this Sep 15, 2021

adam2392 mentioned this pull request Sep 15, 2021

[MAINT, MRG] Enable pooch to perform fetching of datasets #9742

Merged

6 tasks

adam2392 reviewed Sep 15, 2021

View reviewed changes

drammock deleted the use-pooch branch September 16, 2021 19:41

		EEGMI_URL = 'https://physionet.org/files/eegmmidb/1.0.0/'


		@deprecated('mne.datasets.eegbci.data_path() is deprecated and will be removed'

Uh oh!

WIP: use pooch for dataset downloading #8679

WIP: use pooch for dataset downloading #8679

Uh oh!

Conversation

drammock commented Dec 24, 2020

Uh oh!

agramfort commented Dec 24, 2020

Uh oh!

drammock commented Jan 4, 2021

Uh oh!

larsoner commented Jan 4, 2021

Uh oh!

drammock commented Jan 4, 2021

Uh oh!

drammock commented Jan 5, 2021

Uh oh!

larsoner commented Jan 5, 2021

Uh oh!

agramfort commented Jan 5, 2021 via email

Uh oh!

drammock commented Jan 5, 2021

Uh oh!

larsoner commented Jan 5, 2021

Uh oh!

agramfort commented Jan 5, 2021 via email

Uh oh!

larsoner commented Jan 5, 2021

Uh oh!

drammock commented Jan 5, 2021

option 1

option 2

Uh oh!

agramfort commented Jan 5, 2021 via email

Uh oh!

drammock commented Jan 7, 2021

Uh oh!

larsoner commented Jan 7, 2021

Uh oh!

drammock commented Jan 7, 2021

Uh oh!

rob-luke commented Jul 27, 2021

Uh oh!

drammock commented Jul 27, 2021

Uh oh!

rob-luke commented Jul 28, 2021

Uh oh!

adam2392 commented Sep 15, 2021

Uh oh!

adam2392 commented Sep 15, 2021

Uh oh!

drammock commented Sep 15, 2021

Uh oh!

adam2392 Sep 15, 2021

Choose a reason for hiding this comment

Uh oh!

drammock Sep 15, 2021

Choose a reason for hiding this comment

Uh oh!

drammock Sep 15, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants