Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adopt dandischema.digests.zarr.get_checksum() to dandi-cli: use zarr_checksum library constructs instead of copies in dandischema #1371

Merged
merged 1 commit into from
Jan 3, 2024

Conversation

jwodder
Copy link
Member

@jwodder jwodder commented Dec 8, 2023

And rewrite it in terms of zarr_checksum

Part of dandi/dandi-schema#209

@jwodder jwodder added the internal Changes only affect the internal API label Dec 8, 2023
And rewrite it in terms of `zarr_checksum`

Part of dandi/dandi-schema#209
Copy link

codecov bot commented Dec 8, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (b9a1099) 88.45% compared to head (4c98c10) 88.53%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1371      +/-   ##
==========================================
+ Coverage   88.45%   88.53%   +0.08%     
==========================================
  Files          77       77              
  Lines       10463    10471       +8     
==========================================
+ Hits         9255     9271      +16     
+ Misses       1208     1200       -8     
Flag Coverage Δ
unittests 88.53% <100.00%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

files.append(p)
return ZarrStat(
size=size,
digest=Digest.dandi_zarr(get_checksum(file_md5s, dir_md5s)),
digest=Digest.dandi_zarr(checksum_zarr_dir(file_info, dir_info)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to "manually" loop here via dirpath.iterdir() or could make use of https://github.com/dandi/zarr_checksum#python

compute_zarr_checksum(yield_files_local("local_path"))

? i.e. could we reuse yield_files_local to collect files and then just pass to compute_zarr_checksum instead of reimplementing it as checksum_zarr_dir?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yield_files_local() appears to only yield files recognized by the zarr module, while our current code recognizes all files that are actually in a *.zarr directory tree, including "extraneous" files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, that sounds like a possible cause of differences/confusion which we ideally clarify/unify while at it.

I started writing items below, but then looked at the zarr-python implementation of traversal at https://github.com/zarr-developers/zarr-python/blob/HEAD/zarr/storage.py#L1174 and it seems to list all files really. Although "not guaranteed", I guess, but are you sure that it actually excludes any "extraneous" files @jwodder?

some thoughts I started to compose:

  • in dandi-cli we typically (unless DEVEL option allow_any_path or bids datasets) upload only recognized by us files too, i.e. by default we ignore "extraneous" files, so zarr_checksum's default (and only ATM) behavior is somewhat inline with that, while our own support of zarr diverges from it
  • in principle we should not silently ignore extraneous files within .zarr -- but I guess zarr validation doesn't fail (does not care) about them right?
  • do we have any knowledge of such extraneous files uploaded already? might be good to know and since we have dandizarrs datalad datasets on drogon -- we might be able to "quickly" (nothing is quick when dealing with that many files) check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure that it actually excludes any "extraneous" files

I assumed the code was doing something zarr-specific based on the fact that it was using the zarr library. If zarr_checksum just checksums every file present, then there's no need to bring in zarr and its heavy dependencies. (CC @AlmightyYakob)

but I guess zarr validation doesn't fail (does not care) about them right?

I believe extraneous files are ignored by Zarr validation.

do we have any knowledge of such extraneous files uploaded already?

I am not aware of any, and I do not know how to identify such files in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem that NestedDirectoryStore.keys essentially calls os.walk on the local file path. I also assumed more was at play.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly are your remaining concerns, if any, about this PR?

well, my main and remaining concern is the duplication and thus possible divergence of implementations/behavior. As this PR makes it already better (we will avoid using duplicate implementations within dandi-schema and use more of zarr_checksum ones) and would take us one step closer to strip zarr digest specifics from dandi-schema, let's proceed forward as is.

But while reviewing this, here is some thoughts, issues I filed etc:

  • I agree that (at least for now) we should not look into moving checksumming into zarr Python library
  • Generalize zarr_checksum into folder_checksum with a set of format-specific (e.g. zarr) adapters? zarr_checksum#50
  • we already have differences in implementations and IMHO a real potential to hit difference in behavior sooner or later.
    1. here in dandi-cli we go for the os.walk directly, but then use exclude_from_zarr to exclude .git etc. Shouldn't such excludes be defined in zarr_checksum instead? Commented on
    2. in zarr_checksum via NestedDirectoryStore.keys which ATM also uses os.walk without any filtering etc, but that could change later and we might not even spot it since we cannot foresee how it might change
      Shouldn't we centralize the walker for zarr, e.g. within zarr_checksum?:
  • need for duplicated in dandi-cli implementation is argued to be warranted because of
    1. more efficient threaded walk. Ideally IMHO zarr_checksum should just become as efficient, ie carry such an implementation. @jwodder - do you see why that should not or could not be done?
    2. we have it "integrated" with EntryUploadTracker, which (CMIIW) is used (along with to_delete) to collect first a list of zarr subparts which changed and thus need to be uploaded (deleted), to avoid a full upload. And that is during that (threaded!) walk we are also computing the checksum eventually over all elements of the zarr folder - uploaded or not.
      But both of those aspects are unrelated to the code touched by this PR! . Here we do regular non-threaded walk and nohow interact with upload. In my original comment I pointed out to the fact that we seems to just need to have custom looping / summarization to derive digest and other stats relevant for our ZarrStat and contained by it LocalZarrEntry's. But those at large mimic the ZarrDirectoryDigest and ZarrChecksum . But unlike those of zarr_checksum not containing entire hierarchy, ZarrStat actually does contain all the files flattened while recursively traversing. So, for now I guess we are doomed to have this duplicated implementation but I think in the longer run we might want to see how we could avoid it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarikoptic

more efficient threaded walk. Ideally IMHO zarr_checksum should just become as efficient, ie carry such an implementation. @jwodder - do you see why that should not or could not be done?

Which efficient threaded walk are you referring to? get_zarr_checksum() or the digesting-for-uploading machinery? I'm fine with the former being in zarr_checksum, but the latter currently does not belong there.

Shouldn't we centralize the walker for zarr, e.g. within zarr_checksum?

Are you going to create an issue about this? If not, your concern is just going to go into the void.

So, for now I guess we are doomed to have this duplicated implementation but I think in the longer run we might want to see how we could avoid it.

The entire ZarrAsset.stat() code is currently unused by the library. (I'd look up what commit removed the last use, but I can't get git log -S to honor any word boundary syntax I know of.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire ZarrAsset.stat() code is currently unused by the library. (I'd look up what commit removed the last use, but I can't get git log -S to honor any word boundary syntax I know of.)

something like 47c524c which seemed was a big refactoring which stopped using .stat().?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that commit only got rid of stat'ing Zarr entries on the Archive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more efficient threaded walk. Ideally IMHO zarr_checksum should just become as efficient, ie carry such an implementation. @jwodder - do you see why that should not or could not be done?

Which efficient threaded walk are you referring to? get_zarr_checksum() or the digesting-for-uploading machinery? I'm fine with the former being in zarr_checksum, but the latter currently does not belong there.

We are on the same page here. filed dandi/zarr_checksum#52

Shouldn't we centralize the walker for zarr, e.g. within zarr_checksum?

Are you going to create an issue about this? If not, your concern is just going to go into the void.

"""
manifest = ZarrChecksumManifest(
files=[
ZarrChecksum(digest=digest, name=name, size=size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to myself here: that instead of using dandi-schema's ZarrJSONChecksumSerializer as it was used in original defined in dandi-schema get_checksum we would start using zarr_checksum defined ZarrChecksumManifest and ZarrChecksum.
That would allow to prune that dandischema.digests.zarr

manifest = ZarrChecksumManifest(
files=[
ZarrChecksum(digest=digest, name=name, size=size)
for name, (digest, size) in files.items()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to myself (worth adding as a comment???): if previously we explicitly sorted in

https://github.com/dandi/dandi-schema/blob/bdb7dd681435f1092814041eb01eb4fabe87bf57/dandischema/digests/zarr.py#L150

now it is not needed since ZarrChecksumManifest.generate_digest would do it: https://github.com/dandi/zarr_checksum//blob/HEAD/zarr_checksum/checksum.py#L80

@yarikoptic yarikoptic changed the title Copy dandischema.digests.zarr.get_checksum() to dandi-cli Adopt dandischema.digests.zarr.get_checksum() to dandi-cli Jan 3, 2024
@yarikoptic yarikoptic changed the title Adopt dandischema.digests.zarr.get_checksum() to dandi-cli Adopt dandischema.digests.zarr.get_checksum() to dandi-cli: use zarr_checksum library constructs instead of copies in dandischema Jan 3, 2024
@yarikoptic yarikoptic merged commit 72816ab into master Jan 3, 2024
28 checks passed
@yarikoptic yarikoptic deleted the get-checksum branch January 3, 2024 19:48
Copy link

github-actions bot commented Jan 9, 2024

🚀 PR was released in 0.59.0 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal Changes only affect the internal API released
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants