checksum_states improvements #390

djarecka · 2020-12-09T00:01:45Z

Acknowledgment

I acknowledge that this contribution will be available under the Apache 2 license.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Summary

improving checksum_states so it doesn't have to calculate the content hash of big files for every single element of the task state (function used in result), using files_hash that keeps track of all the files for the task (combining files from different state elements)

Checklist

All tests passing
I have added tests to cover my changes
I have updated documentation (if necessary)
My code follows the code style of this project
(we are using black: you can pip install pre-commit,
run pre-commit install in the pydra directory
and black will be run automatically with each commit)

… values of all files and doesn't recalculate for each element of the state

codecov · 2020-12-09T00:04:29Z

Codecov Report

Merging #390 (a4726a2) into master (6bf13d6) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #390      +/-   ##
==========================================
+ Coverage   82.54%   82.56%   +0.01%     
==========================================
  Files          19       19              
  Lines        3827     3831       +4     
  Branches     1045     1047       +2     
==========================================
+ Hits         3159     3163       +4     
  Misses        480      480              
  Partials      188      188

Flag	Coverage Δ
unittests	`82.48% <100.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pydra/engine/core.py	`88.40% <100.00%> (+0.07%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6bf13d6...a4726a2. Read the comment docs.

satra · 2020-12-09T00:49:02Z

i do think this is where you would use a separate LRU cache, so you can reuse hashes across tasks. so i would recommend implementing a more general solution to this.

djarecka · 2020-12-09T01:07:31Z

so you want to save hashes in files?

satra · 2020-12-09T02:24:59Z

easiest may indeed be a single "database" file in the cache directory that can be protected against concurrent writes by softlock. but would be harder to maintain an LRU cache since items would need to be shuffled.

the alternative would be to create a BaseSpec class attribute that is set before anything is initialized. i don't know the implications for pickling, how to achieve concurrent/async updates etc.,. but i suspect this must have been looked at by people in the community.

class LRUwrapper:
    cache = None
    def __repr__(self):
         return str(cache)

class BaseSpec:
     filecache = LRUwrapper()

djarecka · 2020-12-09T02:32:02Z

it depends how many things we want to fix with this PR. The part I was addressing here is run by the main node.

satra · 2020-12-10T01:23:41Z

@djarecka - it's fine for this PR to fix the present issue, but i would suggest at least filing an issue for improving hashing efficiency

djarecka · 2020-12-10T01:34:58Z

@satra - oh, ok. I was double checking this on openmind first and was planning to try to follow your suggestion, but it might be better in a new PR that modifies all places that use hash. will open an issue

changing checksum_states (used in result) so it keeps a track of hash…

f74fc7f

… values of all files and doesn't recalculate for each element of the state

djarecka requested review from satra and nicolocin December 9, 2020 00:08

djarecka added 2 commits December 9, 2020 20:14

adding a test with a bigger splitter over args for singularity

c9336f6

adding a test with a bigger splitter over args for singularity

1460da9

satra approved these changes Dec 10, 2020

View reviewed changes

fixing the test with bigger splitter, so it runs only with slurm for now

3c6f8cb

djarecka mentioned this pull request Dec 10, 2020

adding LRU cache for big files hashes #392

Closed

Merge branch 'master' into enh/faster_checksum_states

a4726a2

satra merged commit d5d6236 into nipype:master Dec 10, 2020

djarecka deleted the enh/faster_checksum_states branch December 30, 2022 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

checksum_states improvements #390

checksum_states improvements #390

Uh oh!

djarecka commented Dec 9, 2020 •

edited

Loading

Uh oh!

codecov bot commented Dec 9, 2020 •

edited

Loading

Uh oh!

satra commented Dec 9, 2020

Uh oh!

djarecka commented Dec 9, 2020

Uh oh!

satra commented Dec 9, 2020

Uh oh!

djarecka commented Dec 9, 2020

Uh oh!

satra commented Dec 10, 2020

Uh oh!

djarecka commented Dec 10, 2020

Uh oh!

Uh oh!

checksum_states improvements #390

checksum_states improvements #390

Uh oh!

Conversation

djarecka commented Dec 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Acknowledgment

Types of changes

Summary

Checklist

Uh oh!

codecov bot commented Dec 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

satra commented Dec 9, 2020

Uh oh!

djarecka commented Dec 9, 2020

Uh oh!

satra commented Dec 9, 2020

Uh oh!

djarecka commented Dec 9, 2020

Uh oh!

satra commented Dec 10, 2020

Uh oh!

djarecka commented Dec 10, 2020

Uh oh!

Uh oh!

djarecka commented Dec 9, 2020 •

edited

Loading

codecov bot commented Dec 9, 2020 •

edited

Loading