-
Notifications
You must be signed in to change notification settings - Fork 4
Add simple generate summaries and totals functions that group by directory. #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ctory name instead of by dandiset.
for more information, see https://pre-commit.ci
Indeed I struggled with what to call these as well (even the phrase 'archive' is a bit close to our own use cases...) The goal of the tool (though I need to make this clearer in some dev documentation on data structures...) is to create a mirror of the S3 bucket contents, where every object in the bucket is it's own directory with the 3 The most general way I can think of referring to what we call data/dandisets is then, 'top level', as in 'top level summaries and totals' and we take that to mean datasets for OpenNeuro (because your top level is datasets) and DANDI of course has a special separate structure that has to be manually assembled (technically we even have other unmentionable things at our top-level besides What do you think?
PR looks great! I will add more tests to enhance coverage in a follow-up, but I'd say this is good to merge once we get the 'mode' of extraction file patterns ironed out in the above comment |
src/s3_log_extraction/summarize/_generate_all_dataset_summaries.py
Outdated
Show resolved
Hide resolved
…s.py Co-authored-by: Cody Baker <51133164+CodyCBakerPhD@users.noreply.github.com>
… rng.integers call
…g-extraction into enh/directory_based_totals
LGTM @rwblair ready to merge? I will add tests for all of this in a followup after the performance benchmarking |
src/s3_log_extraction/extractors/_dandi_s3_log_access_extractor.py
Outdated
Show resolved
Hide resolved
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
I used
dataset
in place ofdandiset
but that's not honest. It just so happens that all of Openneuro's datasets have their own root level prefix in the bucket. Because of that I'm not sure they're a good default.Not worth merging as is, but more of a starting point to discuss what default behavior of update summaries and update totals should be.