Skip to content

checksum_states improvements #390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Dec 10, 2020

Conversation

djarecka
Copy link
Collaborator

@djarecka djarecka commented Dec 9, 2020

Acknowledgment

  • I acknowledge that this contribution will be available under the Apache 2 license.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Summary

  • improving checksum_states so it doesn't have to calculate the content hash of big files for every single element of the task state (function used in result), using files_hash that keeps track of all the files for the task (combining files from different state elements)

Checklist

  • All tests passing
  • I have added tests to cover my changes
  • I have updated documentation (if necessary)
  • My code follows the code style of this project
    (we are using black: you can pip install pre-commit,
    run pre-commit install in the pydra directory
    and black will be run automatically with each commit)

… values of all files and doesn't recalculate for each element of the state
@codecov
Copy link

codecov bot commented Dec 9, 2020

Codecov Report

Merging #390 (a4726a2) into master (6bf13d6) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #390      +/-   ##
==========================================
+ Coverage   82.54%   82.56%   +0.01%     
==========================================
  Files          19       19              
  Lines        3827     3831       +4     
  Branches     1045     1047       +2     
==========================================
+ Hits         3159     3163       +4     
  Misses        480      480              
  Partials      188      188              
Flag Coverage Δ
unittests 82.48% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pydra/engine/core.py 88.40% <100.00%> (+0.07%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6bf13d6...a4726a2. Read the comment docs.

@djarecka djarecka requested review from satra and nicolocin December 9, 2020 00:08
@satra
Copy link
Contributor

satra commented Dec 9, 2020

i do think this is where you would use a separate LRU cache, so you can reuse hashes across tasks. so i would recommend implementing a more general solution to this.

@djarecka
Copy link
Collaborator Author

djarecka commented Dec 9, 2020

so you want to save hashes in files?

@satra
Copy link
Contributor

satra commented Dec 9, 2020

easiest may indeed be a single "database" file in the cache directory that can be protected against concurrent writes by softlock. but would be harder to maintain an LRU cache since items would need to be shuffled.

the alternative would be to create a BaseSpec class attribute that is set before anything is initialized. i don't know the implications for pickling, how to achieve concurrent/async updates etc.,. but i suspect this must have been looked at by people in the community.

class LRUwrapper:
    cache = None
    def __repr__(self):
         return str(cache)

class BaseSpec:
     filecache = LRUwrapper()

@djarecka
Copy link
Collaborator Author

djarecka commented Dec 9, 2020

it depends how many things we want to fix with this PR. The part I was addressing here is run by the main node.

@satra
Copy link
Contributor

satra commented Dec 10, 2020

@djarecka - it's fine for this PR to fix the present issue, but i would suggest at least filing an issue for improving hashing efficiency

@djarecka
Copy link
Collaborator Author

@satra - oh, ok. I was double checking this on openmind first and was planning to try to follow your suggestion, but it might be better in a new PR that modifies all places that use hash. will open an issue

@satra satra merged commit d5d6236 into nipype:master Dec 10, 2020
@djarecka djarecka deleted the enh/faster_checksum_states branch December 30, 2022 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants