Description
A follow up to #913 which might have more timing information. For that issue timings I disabled (after first physically moving aside the entire fsccher's cache for dandi-digests) fscacher, and the run took about 23 seconds. Whenever I stopped disabling cache, the run took over 5 minutes (so fscacher gave 1500% overhead if I got it right):
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 19:02:54,861 [ INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216185712Z-5383.log
real 5m43.311s
user 0m24.188s
sys 0m13.128s
and rerunning, took 4 sec (which is better than original 24sec, but slower than just full recompute could be, see #913):
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 19:05:03,854 [ INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216190459Z-5515.log
real 0m4.576s
user 0m2.878s
sys 0m4.593s
and that is using fscacher of con/fscacher#67 .... may be that one needs really needs to become more efficient .
FWIW
with this patch I disabled caching individual files digests but added one for zarr folder
(dandi-devel) jovyan@jupyter-yarikoptic:~/dandi-cli$ git diff
diff --git a/dandi/support/digests.py b/dandi/support/digests.py
index 2226ea8..74e5199 100644
--- a/dandi/support/digests.py
+++ b/dandi/support/digests.py
@@ -81,7 +81,7 @@ class Digester:
checksums = PersistentCache(name="dandi-checksums", envvar="DANDI_CACHE")
-@checksums.memoize_path
+#@checksums.memoize_path
def get_digest(filepath: Union[str, Path], digest: str = "sha256") -> str:
if digest == "dandi-etag":
return cast(str, get_dandietag(filepath).as_str())
@@ -96,6 +96,7 @@ def get_dandietag(filepath: Union[str, Path]) -> DandiETag:
return DandiETag.from_file(filepath)
+@checksums.memoize_path
def get_zarr_checksum(
path: Path,
basepath: Optional[Path] = None,
and it ran "fast" in those original 22 sec, and reloaded result in fscacher in its 3-4 sec
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ mv ~/.cache/fscacher/dandi-checksums{,.aside2}
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 19:22:10,832 [ INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216192149Z-6188.log
real 0m22.338s
user 0m8.445s
sys 0m6.815s
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 19:23:02,610 [ INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216192259Z-8240.log
real 0m3.401s
user 0m2.926s
sys 0m5.002s
Meanwhile, I think it would be worth to just disable the fscaching of both individual file digests in the zarr archive (overhead from storing that many digests on initial run seems too great to ignore) and zarr folder altogether (until we make fscaching of folder more efficient).
Not sure if alternatively we should/could come up with some smarter policy/specification for what to cache, e.g. we could parametrize memoize_path
to cache only if a file is larger than some specified size (e.g. 500KB or 1MB in the case of digests), but it would still need some io.stat
to make the decision etc thus altogether might still have some overhead... but I think it might make it more flexible/generic.