Skip to content

do not fscache individual files digests for zarr-checksum #914

Closed
@yarikoptic

Description

A follow up to #913 which might have more timing information. For that issue timings I disabled (after first physically moving aside the entire fsccher's cache for dandi-digests) fscacher, and the run took about 23 seconds. Whenever I stopped disabling cache, the run took over 5 minutes (so fscacher gave 1500% overhead if I got it right):

(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 19:02:54,861 [    INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216185712Z-5383.log

real    5m43.311s
user    0m24.188s
sys     0m13.128s

and rerunning, took 4 sec (which is better than original 24sec, but slower than just full recompute could be, see #913):

(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 19:05:03,854 [    INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216190459Z-5515.log

real    0m4.576s
user    0m2.878s
sys     0m4.593s

and that is using fscacher of con/fscacher#67 .... may be that one needs really needs to become more efficient .

FWIW

with this patch I disabled caching individual files digests but added one for zarr folder
(dandi-devel) jovyan@jupyter-yarikoptic:~/dandi-cli$ git diff
diff --git a/dandi/support/digests.py b/dandi/support/digests.py
index 2226ea8..74e5199 100644
--- a/dandi/support/digests.py
+++ b/dandi/support/digests.py
@@ -81,7 +81,7 @@ class Digester:
 checksums = PersistentCache(name="dandi-checksums", envvar="DANDI_CACHE")
 
 
-@checksums.memoize_path
+#@checksums.memoize_path
 def get_digest(filepath: Union[str, Path], digest: str = "sha256") -> str:
     if digest == "dandi-etag":
         return cast(str, get_dandietag(filepath).as_str())
@@ -96,6 +96,7 @@ def get_dandietag(filepath: Union[str, Path]) -> DandiETag:
     return DandiETag.from_file(filepath)
 
 
+@checksums.memoize_path
 def get_zarr_checksum(
     path: Path,
     basepath: Optional[Path] = None,
and it ran "fast" in those original 22 sec, and reloaded result in fscacher in its 3-4 sec
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ mv  ~/.cache/fscacher/dandi-checksums{,.aside2}
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 19:22:10,832 [    INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216192149Z-6188.log

real    0m22.338s
user    0m8.445s
sys     0m6.815s
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 19:23:02,610 [    INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216192259Z-8240.log

real    0m3.401s
user    0m2.926s
sys     0m5.002s

Meanwhile, I think it would be worth to just disable the fscaching of both individual file digests in the zarr archive (overhead from storing that many digests on initial run seems too great to ignore) and zarr folder altogether (until we make fscaching of folder more efficient).

Not sure if alternatively we should/could come up with some smarter policy/specification for what to cache, e.g. we could parametrize memoize_path to cache only if a file is larger than some specified size (e.g. 500KB or 1MB in the case of digests), but it would still need some io.stat to make the decision etc thus altogether might still have some overhead... but I think it might make it more flexible/generic.

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions