Add simple generate summaries and totals functions that group by directory. #103

rwblair · 2025-06-10T18:00:30Z

I used dataset in place of dandiset but that's not honest. It just so happens that all of Openneuro's datasets have their own root level prefix in the bucket. Because of that I'm not sure they're a good default.

Not worth merging as is, but more of a starting point to discuss what default behavior of update summaries and update totals should be.

…ctory name instead of by dandiset.

for more information, see https://pre-commit.ci

src/s3_log_extraction/extractors/_s3_log_access_extractor.py

CodyCBakerPhD · 2025-06-11T01:38:22Z

I used dataset in place of dandiset but that's not honest. It just so happens that all of Openneuro's datasets have their own root level prefix in the bucket. Because of that I'm not sure they're a good default.

Indeed I struggled with what to call these as well (even the phrase 'archive' is a bit close to our own use cases...)

The goal of the tool (though I need to make this clearer in some dev documentation on data structures...) is to create a mirror of the S3 bucket contents, where every object in the bucket is it's own directory with the 3 .txt files contained therein

The most general way I can think of referring to what we call data/dandisets is then, 'top level', as in 'top level summaries and totals' and we take that to mean datasets for OpenNeuro (because your top level is datasets) and DANDI of course has a special separate structure that has to be manually assembled (technically we even have other unmentionable things at our top-level besides blobs and zarr) to match the same concept

What do you think?

Not worth merging as is, but more of a starting point to discuss what default behavior of update summaries and update totals should be.

PR looks great! I will add more tests to enhance coverage in a follow-up, but I'd say this is good to merge once we get the 'mode' of extraction file patterns ironed out in the above comment

src/s3_log_extraction/summarize/_generate_all_dataset_summaries.py

…s.py Co-authored-by: Cody Baker <51133164+CodyCBakerPhD@users.noreply.github.com>

… rng.integers call

…g-extraction into enh/directory_based_totals

CodyCBakerPhD · 2025-06-20T18:35:28Z

LGTM @rwblair ready to merge?

I will add tests for all of this in a followup after the performance benchmarking

src/s3_log_extraction/extractors/_dandi_s3_log_access_extractor.py

…r.py

for more information, see https://pre-commit.ci

src/s3_log_extraction/_command_line_interface/_cli.py

for more information, see https://pre-commit.ci

rwblair and others added 2 commits June 10, 2025 12:45

Add simple generate summaries and totals functions that group by dire…

9fa4307

…ctory name instead of by dandiset.

[pre-commit.ci] auto fixes from pre-commit.com hooks

a3e1f71

for more information, see https://pre-commit.ci

CodyCBakerPhD reviewed Jun 11, 2025

View reviewed changes

src/s3_log_extraction/extractors/_s3_log_access_extractor.py Outdated Show resolved Hide resolved

CodyCBakerPhD assigned rwblair Jun 11, 2025

CodyCBakerPhD added 2 commits June 10, 2025 21:38

Merge branch 'main' into enh/directory_based_totals

948b171

Merge branch 'main' into enh/directory_based_totals

2b71944

CodyCBakerPhD reviewed Jun 16, 2025

View reviewed changes

src/s3_log_extraction/summarize/_generate_all_dataset_summaries.py Outdated Show resolved Hide resolved

rwblair and others added 4 commits June 18, 2025 14:50

Update src/s3_log_extraction/summarize/_generate_all_dataset_summarie…

e4566a4

…s.py Co-authored-by: Cody Baker <51133164+CodyCBakerPhD@users.noreply.github.com>

pass dtype into rng.integers on new_index collision to match original…

91c663f

… rng.integers call

make regex for log file name pattern a property of the extractor class

94381d7

Merge branch 'enh/directory_based_totals' of github.com:rwblair/s3-lo…

1ff2b9c

…g-extraction into enh/directory_based_totals

Merge branch 'main' into enh/directory_based_totals

385d211

CodyCBakerPhD reviewed Jun 20, 2025

View reviewed changes

src/s3_log_extraction/extractors/_dandi_s3_log_access_extractor.py Outdated Show resolved Hide resolved

CodyCBakerPhD and others added 5 commits June 20, 2025 16:48

Update src/s3_log_extraction/extractors/_dandi_s3_log_access_extracto…

0e719e6

…r.py

Merge branch 'main' into enh/directory_based_totals

aeca81f

Merge branch 'main' into enh/directory_based_totals

15ce9bf

[pre-commit.ci] auto fixes from pre-commit.com hooks

c06645a

for more information, see https://pre-commit.ci

Merge branch 'main' into enh/directory_based_totals

ad3ce93

CodyCBakerPhD reviewed Jun 26, 2025

View reviewed changes

src/s3_log_extraction/_command_line_interface/_cli.py Outdated Show resolved Hide resolved

CodyCBakerPhD and others added 9 commits June 26, 2025 12:33

resolve conflict

51f94da

chore: resolve conflict

d5aa76f

Merge branch 'main' into enh/directory_based_totals

bb39862

[pre-commit.ci] auto fixes from pre-commit.com hooks

89f79a4

for more information, see https://pre-commit.ci

Merge branch 'main' into enh/directory_based_totals

345f852

Merge branch 'main' into enh/directory_based_totals

0edea33

[pre-commit.ci] auto fixes from pre-commit.com hooks

9ae6c73

for more information, see https://pre-commit.ci

Update __init__.py

a892f1d

Update _cli.py

a7dd8a3

pre-commit-ci bot and others added 8 commits July 10, 2025 05:42

[pre-commit.ci] auto fixes from pre-commit.com hooks

f4f6066

for more information, see https://pre-commit.ci

Merge branch 'main' into enh/directory_based_totals

62a7cdc

Merge branch 'main' into enh/directory_based_totals

f8d9ac1

Merge branch 'main' into enh/directory_based_totals

4d98bee

Merge branch 'main' into enh/directory_based_totals

0834b64

Merge branch 'main' into enh/directory_based_totals

fefb9a9

Merge branch 'main' into enh/directory_based_totals

f29b0cb

Merge branch 'main' into enh/directory_based_totals

b38db38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add simple generate summaries and totals functions that group by directory. #103

Add simple generate summaries and totals functions that group by directory. #103

Uh oh!

rwblair commented Jun 10, 2025

Uh oh!

Uh oh!

CodyCBakerPhD commented Jun 11, 2025

Uh oh!

Uh oh!

CodyCBakerPhD commented Jun 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add simple generate summaries and totals functions that group by directory. #103

Are you sure you want to change the base?

Add simple generate summaries and totals functions that group by directory. #103

Uh oh!

Conversation

rwblair commented Jun 10, 2025

Uh oh!

Uh oh!

CodyCBakerPhD commented Jun 11, 2025

Uh oh!

Uh oh!

CodyCBakerPhD commented Jun 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!