Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNT: Reimplement file-hash caching in new hashing framework #683

Open
tclose opened this issue Aug 3, 2023 · 2 comments
Open

MNT: Reimplement file-hash caching in new hashing framework #683

tclose opened this issue Aug 3, 2023 · 2 comments
Labels
maintenance Refactors and improvements to code quality.

Comments

@tclose
Copy link
Contributor

tclose commented Aug 3, 2023

A casualty of the new hashing framework introduced by pydra#662 was the removal of file-hash caching (only calculating the file-hash once per task. For large files this could be a significant performance regression so it would be good to work out how to add it back in.

Suggestions

  1. Just cache the checksum in the task object (if we place guards on it changing post-execution, see pydra:#681, this might be sufficient)
  2. Return file mtime as part of bytes_repr and use this to create a local cache. This mapping could be potentially stored on disk for persistence between runs
@tclose
Copy link
Contributor Author

tclose commented Aug 13, 2023

@effigies

My idea for this feature is for the bytes_repr function overload of file-classes (and potentially any other type that you want to cache the hashes for) to yield a "time-stamp" object consisting of a key (file path) and modification time as the first item in the generator.

The calling hash_single function can check for these key/mtime pairs in a "hashes cache" dict loaded from the cache directory, and return the cached hash if present. Otherwise it proceeds through the remaining byte chunks, calculates the hash and then saves it into the hashes cache using the key/mtime

@effigies
Copy link
Contributor

Yes, that makes sense to me. I would probably genericly type the yield type as Union[CacheKey, bytes], where CacheKey is a tuple[Hashable, ...] newtype. You could imagine some type wanting something besides (Path, int), e.g., an S3 path with a version key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Refactors and improvements to code quality.
Projects
None yet
Development

No branches or pull requests

2 participants