Adopt `dandischema.digests.zarr.get_checksum()` to dandi-cli: use zarr_checksum library constructs instead of copies in dandischema #1371

jwodder · 2023-12-08T15:00:59Z

And rewrite it in terms of zarr_checksum

Part of dandi/dandi-schema#209

And rewrite it in terms of `zarr_checksum` Part of dandi/dandi-schema#209

codecov · 2023-12-08T15:09:44Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (b9a1099) 88.45% compared to head (4c98c10) 88.53%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1371      +/-   ##
==========================================
+ Coverage   88.45%   88.53%   +0.08%     
==========================================
  Files          77       77              
  Lines       10463    10471       +8     
==========================================
+ Hits         9255     9271      +16     
+ Misses       1208     1200       -8

Flag	Coverage Δ
unittests	`88.53% <100.00%> (+0.08%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

yarikoptic · 2023-12-08T15:57:05Z

dandi/files/zarr.py

                    files.append(p)
            return ZarrStat(
                size=size,
-                digest=Digest.dandi_zarr(get_checksum(file_md5s, dir_md5s)),
+                digest=Digest.dandi_zarr(checksum_zarr_dir(file_info, dir_info)),


do we really need to "manually" loop here via dirpath.iterdir() or could make use of https://github.com/dandi/zarr_checksum#python

compute_zarr_checksum(yield_files_local("local_path"))

? i.e. could we reuse yield_files_local to collect files and then just pass to compute_zarr_checksum instead of reimplementing it as checksum_zarr_dir?

yield_files_local() appears to only yield files recognized by the zarr module, while our current code recognizes all files that are actually in a *.zarr directory tree, including "extraneous" files.

hm, that sounds like a possible cause of differences/confusion which we ideally clarify/unify while at it.

I started writing items below, but then looked at the zarr-python implementation of traversal at https://github.com/zarr-developers/zarr-python/blob/HEAD/zarr/storage.py#L1174 and it seems to list all files really. Although "not guaranteed", I guess, but are you sure that it actually excludes any "extraneous" files @jwodder?

some thoughts I started to compose:

in dandi-cli we typically (unless DEVEL option allow_any_path or bids datasets) upload only recognized by us files too, i.e. by default we ignore "extraneous" files, so zarr_checksum's default (and only ATM) behavior is somewhat inline with that, while our own support of zarr diverges from it

in principle we should not silently ignore extraneous files within .zarr -- but I guess zarr validation doesn't fail (does not care) about them right?

do we have any knowledge of such extraneous files uploaded already? might be good to know and since we have dandizarrs datalad datasets on drogon -- we might be able to "quickly" (nothing is quick when dealing with that many files) check.

are you sure that it actually excludes any "extraneous" files

I assumed the code was doing something zarr-specific based on the fact that it was using the zarr library. If zarr_checksum just checksums every file present, then there's no need to bring in zarr and its heavy dependencies. (CC @AlmightyYakob)

but I guess zarr validation doesn't fail (does not care) about them right?

I believe extraneous files are ignored by Zarr validation.

do we have any knowledge of such extraneous files uploaded already?

I am not aware of any, and I do not know how to identify such files in the first place.

It does seem that NestedDirectoryStore.keys essentially calls os.walk on the local file path. I also assumed more was at play.

What exactly are your remaining concerns, if any, about this PR?

well, my main and remaining concern is the duplication and thus possible divergence of implementations/behavior. As this PR makes it already better (we will avoid using duplicate implementations within dandi-schema and use more of zarr_checksum ones) and would take us one step closer to strip zarr digest specifics from dandi-schema, let's proceed forward as is.

But while reviewing this, here is some thoughts, issues I filed etc:

I agree that (at least for now) we should not look into moving checksumming into zarr Python library

Generalize zarr_checksum into folder_checksum with a set of format-specific (e.g. zarr) adapters? zarr_checksum#50

we already have differences in implementations and IMHO a real potential to hit difference in behavior sooner or later.

here in dandi-cli we go for the os.walk directly, but then use exclude_from_zarr to exclude .git etc. Shouldn't such excludes be defined in zarr_checksum instead? Commented on

Remove zarr dependency if possible zarr_checksum#41 (comment)

in zarr_checksum via NestedDirectoryStore.keys which ATM also uses os.walk without any filtering etc, but that could change later and we might not even spot it since we cannot foresee how it might change
Shouldn't we centralize the walker for zarr, e.g. within zarr_checksum?:

need for duplicated in dandi-cli implementation is argued to be warranted because of

more efficient threaded walk. Ideally IMHO zarr_checksum should just become as efficient, ie carry such an implementation. @jwodder - do you see why that should not or could not be done?

we have it "integrated" with EntryUploadTracker, which (CMIIW) is used (along with to_delete) to collect first a list of zarr subparts which changed and thus need to be uploaded (deleted), to avoid a full upload. And that is during that (threaded!) walk we are also computing the checksum eventually over all elements of the zarr folder - uploaded or not.
But both of those aspects are unrelated to the code touched by this PR! . Here we do regular non-threaded walk and nohow interact with upload. In my original comment I pointed out to the fact that we seems to just need to have custom looping / summarization to derive digest and other stats relevant for our ZarrStat and contained by it LocalZarrEntry's. But those at large mimic the ZarrDirectoryDigest and ZarrChecksum . But unlike those of zarr_checksum not containing entire hierarchy, ZarrStat actually does contain all the files flattened while recursively traversing. So, for now I guess we are doomed to have this duplicated implementation but I think in the longer run we might want to see how we could avoid it.

@yarikoptic

more efficient threaded walk. Ideally IMHO zarr_checksum should just become as efficient, ie carry such an implementation. @jwodder - do you see why that should not or could not be done?

Which efficient threaded walk are you referring to? get_zarr_checksum() or the digesting-for-uploading machinery? I'm fine with the former being in zarr_checksum, but the latter currently does not belong there.

Shouldn't we centralize the walker for zarr, e.g. within zarr_checksum?

Are you going to create an issue about this? If not, your concern is just going to go into the void.

So, for now I guess we are doomed to have this duplicated implementation but I think in the longer run we might want to see how we could avoid it.

The entire ZarrAsset.stat() code is currently unused by the library. (I'd look up what commit removed the last use, but I can't get git log -S to honor any word boundary syntax I know of.)

The entire ZarrAsset.stat() code is currently unused by the library. (I'd look up what commit removed the last use, but I can't get git log -S to honor any word boundary syntax I know of.)

something like 47c524c which seemed was a big refactoring which stopped using .stat().?

No, that commit only got rid of stat'ing Zarr entries on the Archive.

more efficient threaded walk. Ideally IMHO zarr_checksum should just become as efficient, ie carry such an implementation. @jwodder - do you see why that should not or could not be done?

Which efficient threaded walk are you referring to? get_zarr_checksum() or the digesting-for-uploading machinery? I'm fine with the former being in zarr_checksum, but the latter currently does not belong there.

We are on the same page here. filed dandi/zarr_checksum#52

Shouldn't we centralize the walker for zarr, e.g. within zarr_checksum?

Are you going to create an issue about this? If not, your concern is just going to go into the void.

Centralize walking of the zarr folders for checksum/upload purposes #1387

yarikoptic · 2024-01-03T19:05:26Z

dandi/support/digests.py

+    """
+    manifest = ZarrChecksumManifest(
+        files=[
+            ZarrChecksum(digest=digest, name=name, size=size)


note to myself here: that instead of using dandi-schema's ZarrJSONChecksumSerializer as it was used in original defined in dandi-schema get_checksum we would start using zarr_checksum defined ZarrChecksumManifest and ZarrChecksum.
That would allow to prune that dandischema.digests.zarr

yarikoptic · 2024-01-03T19:11:50Z

dandi/support/digests.py

+    manifest = ZarrChecksumManifest(
+        files=[
+            ZarrChecksum(digest=digest, name=name, size=size)
+            for name, (digest, size) in files.items()


note to myself (worth adding as a comment???): if previously we explicitly sorted in

https://github.com/dandi/dandi-schema/blob/bdb7dd681435f1092814041eb01eb4fabe87bf57/dandischema/digests/zarr.py#L150

now it is not needed since ZarrChecksumManifest.generate_digest would do it: https://github.com/dandi/zarr_checksum//blob/HEAD/zarr_checksum/checksum.py#L80

github-actions · 2024-01-09T19:08:58Z

🚀 PR was released in 0.59.0 🚀

jwodder added the internal Changes only affect the internal API label Dec 8, 2023

jwodder mentioned this pull request Dec 8, 2023

Remove code for calculating Zarr checksums dandi/dandi-schema#209

Closed

5 tasks

jwodder force-pushed the get-checksum branch from c22aecd to 83b5e57 Compare December 8, 2023 15:03

Copy dandischema.digests.zarr.get_checksum() to dandi-cli

4c98c10

And rewrite it in terms of `zarr_checksum` Part of dandi/dandi-schema#209

jwodder force-pushed the get-checksum branch from 83b5e57 to 4c98c10 Compare December 8, 2023 15:04

yarikoptic reviewed Dec 8, 2023

View reviewed changes

This was referenced Dec 8, 2023

Remove zarr dependency if possible dandi/zarr_checksum#41

Open

Make boto3 and zarr extra dependencies dandi/zarr_checksum#39

Merged

yarikoptic reviewed Jan 3, 2024

View reviewed changes

yarikoptic changed the title ~~Copy dandischema.digests.zarr.get_checksum() to dandi-cli~~ Adopt dandischema.digests.zarr.get_checksum() to dandi-cli Jan 3, 2024

yarikoptic changed the title ~~Adopt dandischema.digests.zarr.get_checksum() to dandi-cli~~ Adopt dandischema.digests.zarr.get_checksum() to dandi-cli: use zarr_checksum library constructs instead of copies in dandischema Jan 3, 2024

yarikoptic merged commit 72816ab into master Jan 3, 2024
28 checks passed

yarikoptic deleted the get-checksum branch January 3, 2024 19:48

yarikoptic mentioned this pull request Jan 3, 2024

Centralize walking of the zarr folders for checksum/upload purposes #1387

Open

github-actions bot added the released label Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adopt `dandischema.digests.zarr.get_checksum()` to dandi-cli: use zarr_checksum library constructs instead of copies in dandischema #1371

Adopt `dandischema.digests.zarr.get_checksum()` to dandi-cli: use zarr_checksum library constructs instead of copies in dandischema #1371

jwodder commented Dec 8, 2023 •

edited

Loading

codecov bot commented Dec 8, 2023 •

edited

Loading

yarikoptic Dec 8, 2023

jwodder Dec 8, 2023

yarikoptic Dec 8, 2023

jwodder Dec 8, 2023

jjnesbitt Dec 8, 2023

yarikoptic Jan 3, 2024

jwodder Jan 3, 2024

yarikoptic Jan 3, 2024

jwodder Jan 3, 2024

yarikoptic Jan 3, 2024

yarikoptic Jan 3, 2024

yarikoptic Jan 3, 2024

github-actions bot commented Jan 9, 2024

Adopt dandischema.digests.zarr.get_checksum() to dandi-cli: use zarr_checksum library constructs instead of copies in dandischema #1371

Adopt dandischema.digests.zarr.get_checksum() to dandi-cli: use zarr_checksum library constructs instead of copies in dandischema #1371

Conversation

jwodder commented Dec 8, 2023 • edited Loading

codecov bot commented Dec 8, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 9, 2024

Adopt `dandischema.digests.zarr.get_checksum()` to dandi-cli: use zarr_checksum library constructs instead of copies in dandischema #1371

Adopt `dandischema.digests.zarr.get_checksum()` to dandi-cli: use zarr_checksum library constructs instead of copies in dandischema #1371

jwodder commented Dec 8, 2023 •

edited

Loading

codecov bot commented Dec 8, 2023 •

edited

Loading