-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adopt dandischema.digests.zarr.get_checksum()
to dandi-cli: use zarr_checksum library constructs instead of copies in dandischema
#1371
Conversation
And rewrite it in terms of `zarr_checksum` Part of dandi/dandi-schema#209
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1371 +/- ##
==========================================
+ Coverage 88.45% 88.53% +0.08%
==========================================
Files 77 77
Lines 10463 10471 +8
==========================================
+ Hits 9255 9271 +16
+ Misses 1208 1200 -8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
files.append(p) | ||
return ZarrStat( | ||
size=size, | ||
digest=Digest.dandi_zarr(get_checksum(file_md5s, dir_md5s)), | ||
digest=Digest.dandi_zarr(checksum_zarr_dir(file_info, dir_info)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need to "manually" loop here via dirpath.iterdir()
or could make use of https://github.com/dandi/zarr_checksum#python
compute_zarr_checksum(yield_files_local("local_path"))
? i.e. could we reuse yield_files_local
to collect files
and then just pass to compute_zarr_checksum
instead of reimplementing it as checksum_zarr_dir
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yield_files_local()
appears to only yield files recognized by the zarr
module, while our current code recognizes all files that are actually in a *.zarr
directory tree, including "extraneous" files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, that sounds like a possible cause of differences/confusion which we ideally clarify/unify while at it.
I started writing items below, but then looked at the zarr-python implementation of traversal at https://github.com/zarr-developers/zarr-python/blob/HEAD/zarr/storage.py#L1174 and it seems to list all files really. Although "not guaranteed", I guess, but are you sure that it actually excludes any "extraneous" files @jwodder?
some thoughts I started to compose:
- in dandi-cli we typically (unless DEVEL option allow_any_path or bids datasets) upload only recognized by us files too, i.e. by default we ignore "extraneous" files, so
zarr_checksum
's default (and only ATM) behavior is somewhat inline with that, while our own support ofzarr
diverges from it - in principle we should not silently ignore extraneous files within
.zarr
-- but I guess zarr validation doesn't fail (does not care) about them right? - do we have any knowledge of such extraneous files uploaded already? might be good to know and since we have dandizarrs datalad datasets on drogon -- we might be able to "quickly" (nothing is quick when dealing with that many files) check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure that it actually excludes any "extraneous" files
I assumed the code was doing something zarr-specific based on the fact that it was using the zarr
library. If zarr_checksum
just checksums every file present, then there's no need to bring in zarr
and its heavy dependencies. (CC @AlmightyYakob)
but I guess zarr validation doesn't fail (does not care) about them right?
I believe extraneous files are ignored by Zarr validation.
do we have any knowledge of such extraneous files uploaded already?
I am not aware of any, and I do not know how to identify such files in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem that NestedDirectoryStore.keys
essentially calls os.walk
on the local file path. I also assumed more was at play.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly are your remaining concerns, if any, about this PR?
well, my main and remaining concern is the duplication and thus possible divergence of implementations/behavior. As this PR makes it already better (we will avoid using duplicate implementations within dandi-schema and use more of zarr_checksum ones) and would take us one step closer to strip zarr digest specifics from dandi-schema, let's proceed forward as is.
But while reviewing this, here is some thoughts, issues I filed etc:
- I agree that (at least for now) we should not look into moving checksumming into zarr Python library
- Generalize
zarr_checksum
intofolder_checksum
with a set of format-specific (e.g. zarr) adapters? zarr_checksum#50 - we already have differences in implementations and IMHO a real potential to hit difference in behavior sooner or later.
- here in
dandi-cli
we go for theos.walk
directly, but then useexclude_from_zarr
to exclude.git
etc. Shouldn't such excludes be defined inzarr_checksum
instead? Commented on - in
zarr_checksum
viaNestedDirectoryStore.keys
which ATM also usesos.walk
without any filtering etc, but that could change later and we might not even spot it since we cannot foresee how it might change
Shouldn't we centralize the walker for zarr, e.g. withinzarr_checksum
?:
- here in
- need for duplicated in
dandi-cli
implementation is argued to be warranted because of- more efficient threaded walk. Ideally IMHO
zarr_checksum
should just become as efficient, ie carry such an implementation. @jwodder - do you see why that should not or could not be done? - we have it "integrated" with
EntryUploadTracker
, which (CMIIW) is used (along withto_delete
) to collect first a list of zarr subparts which changed and thus need to be uploaded (deleted), to avoid a full upload. And that is during that (threaded!) walk we are also computing the checksum eventually over all elements of the zarr folder - uploaded or not.
But both of those aspects are unrelated to the code touched by this PR! . Here we do regular non-threaded walk and nohow interact with upload. In my original comment I pointed out to the fact that we seems to just need to have custom looping / summarization to derive digest and other stats relevant for our ZarrStat and contained by it LocalZarrEntry's. But those at large mimic theZarrDirectoryDigest
andZarrChecksum
. But unlike those ofzarr_checksum
not containing entire hierarchy,ZarrStat
actually does contain all the files flattened while recursively traversing. So, for now I guess we are doomed to have this duplicated implementation but I think in the longer run we might want to see how we could avoid it.
- more efficient threaded walk. Ideally IMHO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more efficient threaded walk. Ideally IMHO
zarr_checksum
should just become as efficient, ie carry such an implementation. @jwodder - do you see why that should not or could not be done?
Which efficient threaded walk are you referring to? get_zarr_checksum()
or the digesting-for-uploading machinery? I'm fine with the former being in zarr_checksum
, but the latter currently does not belong there.
Shouldn't we centralize the walker for zarr, e.g. within zarr_checksum?
Are you going to create an issue about this? If not, your concern is just going to go into the void.
So, for now I guess we are doomed to have this duplicated implementation but I think in the longer run we might want to see how we could avoid it.
The entire ZarrAsset.stat()
code is currently unused by the library. (I'd look up what commit removed the last use, but I can't get git log -S
to honor any word boundary syntax I know of.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entire
ZarrAsset.stat()
code is currently unused by the library. (I'd look up what commit removed the last use, but I can't getgit log -S
to honor any word boundary syntax I know of.)
something like 47c524c which seemed was a big refactoring which stopped using .stat().
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that commit only got rid of stat'ing Zarr entries on the Archive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more efficient threaded walk. Ideally IMHO
zarr_checksum
should just become as efficient, ie carry such an implementation. @jwodder - do you see why that should not or could not be done?Which efficient threaded walk are you referring to?
get_zarr_checksum()
or the digesting-for-uploading machinery? I'm fine with the former being inzarr_checksum
, but the latter currently does not belong there.
We are on the same page here. filed dandi/zarr_checksum#52
Shouldn't we centralize the walker for zarr, e.g. within zarr_checksum?
Are you going to create an issue about this? If not, your concern is just going to go into the void.
""" | ||
manifest = ZarrChecksumManifest( | ||
files=[ | ||
ZarrChecksum(digest=digest, name=name, size=size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to myself here: that instead of using dandi-schema's ZarrJSONChecksumSerializer as it was used in original defined in dandi-schema get_checksum
we would start using zarr_checksum
defined ZarrChecksumManifest
and ZarrChecksum
.
That would allow to prune that dandischema.digests.zarr
manifest = ZarrChecksumManifest( | ||
files=[ | ||
ZarrChecksum(digest=digest, name=name, size=size) | ||
for name, (digest, size) in files.items() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to myself (worth adding as a comment???): if previously we explicitly sorted in
now it is not needed since ZarrChecksumManifest.generate_digest would do it: https://github.com/dandi/zarr_checksum//blob/HEAD/zarr_checksum/checksum.py#L80
dandischema.digests.zarr.get_checksum()
to dandi-clidandischema.digests.zarr.get_checksum()
to dandi-cli
dandischema.digests.zarr.get_checksum()
to dandi-clidandischema.digests.zarr.get_checksum()
to dandi-cli: use zarr_checksum library constructs instead of copies in dandischema
🚀 PR was released in |
And rewrite it in terms of
zarr_checksum
Part of dandi/dandi-schema#209