Ensure immutability of `CalcJobNode` hash before and after storing #3130

sphuber · 2019-07-04T14:36:29Z

The Node._get_objects_to_hash method, which returns all the objects of
a Node instance that should be included in computing its hash, also
included the file repository. This proved problematic for the
CalcJobNode sub class. The hash of a node is computed during the store
method, in order to determine if it can be cached from an existing node
with an identical hash. However, at that point, the repository of the
node is empty, but as soon as the node is stored and process is started,
the input files that are generated by the CalcJob class based on the
inputs, are stored in the repository of the node. If the hash were to be
recomputed at that point, it would be different from the original hash
computed during storing. This breaks the caching mechanism.

Since the input files that are written to the repository by the CalcJob
are derivatives of the input nodes (which are included in the hash), and
therefore do not semantically add anything to the provenance, we can
simply ignore them. Since this only applies to the CalcJobNode, we
simply override the method in that sub class and omit the repository.

N.B.: in this commit we also loosen the condition of when a completed
process node is considered to be a valid cache. Before only processes
that finished successfully (i.e. with process state finished and an
exit status 0) were considered valid caches. This is loosened to also
accept non-zero exit statuses. The new rule then considers all processes
that have finished as a valid cache, excluding only excepted and
killed processes.

The `Node._get_objects_to_hash` method, which returns all the objects of a `Node` instance that should be included in computing its hash, also included the file repository. This proved problematic for the `CalcJobNode` sub class. The hash of a node is computed during the store method, in order to determine if it can be cached from an existing node with an identical hash. However, at that point, the repository of the node is empty, but as soon as the node is stored and process is started, the input files that are generated by the `CalcJob` class based on the inputs, are stored in the repository of the node. If the hash were to be recomputed at that point, it would be different from the original hash computed during storing. This breaks the caching mechanism. Since the input files that are written to the repository by the `CalcJob` are derivatives of the input nodes (which are included in the hash), and therefore do not semantically add anything to the provenance, we can simply ignore them. Since this only applies to the `CalcJobNode`, we simply override the method in that sub class and omit the repository. N.B.: in this commit we also loosen the condition of when a completed process node is considered to be a valid cache. Before only processes that finished successfully (i.e. with process state `finished` and an exit status `0`) were considered valid caches. This is loosened to also accept non-zero exit statuses. The new rule then considers all processes that have `finished` as a valid cache, excluding only `excepted` and `killed` processes.

ltalirz

thanks @sphuber for the fix!

…iidateam#3130) The `Node._get_objects_to_hash` method, which returns all the objects of a `Node` instance that should be included in computing its hash, also included the file repository. This proved problematic for the `CalcJobNode` sub class. The hash of a node is computed during the store method, in order to determine if it can be cached from an existing node with an identical hash. However, at that point, the repository of the node is empty, but as soon as the node is stored and process is started, the input files that are generated by the `CalcJob` class based on the inputs, are stored in the repository of the node. If the hash were to be recomputed at that point, it would be different from the original hash computed during storing. This breaks the caching mechanism. Since the input files that are written to the repository by the `CalcJob` are derivatives of the input nodes (which are included in the hash), and therefore do not semantically add anything to the provenance, we can simply ignore them. Since this only applies to the `CalcJobNode`, we simply override the method in that sub class and omit the repository. N.B.: in this commit we also loosen the condition of when a completed process node is considered to be a valid cache. Before only processes that finished successfully (i.e. with process state `finished` and an exit status `0`) were considered valid caches. This is loosened to also accept non-zero exit statuses. The new rule then considers all processes that have `finished` as a valid cache, excluding only `excepted` and `killed` processes.

sphuber force-pushed the fix_3125_calcjob_hash branch from 8c4385e to ceb3c78 Compare July 4, 2019 15:50

ltalirz approved these changes Jul 5, 2019

View reviewed changes

ltalirz merged commit 6d4f1ec into aiidateam:develop Jul 5, 2019

sphuber deleted the fix_3125_calcjob_hash branch July 5, 2019 07:31

sphuber mentioned this pull request Jul 5, 2019

update caching docs #3111

Merged

greschd mentioned this pull request Feb 11, 2020

improvements to caching documentation / ease of use #2549

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure immutability of `CalcJobNode` hash before and after storing #3130

Ensure immutability of `CalcJobNode` hash before and after storing #3130

sphuber commented Jul 4, 2019

ltalirz left a comment

Ensure immutability of CalcJobNode hash before and after storing #3130

Ensure immutability of CalcJobNode hash before and after storing #3130

Conversation

sphuber commented Jul 4, 2019

ltalirz left a comment

Choose a reason for hiding this comment

Ensure immutability of `CalcJobNode` hash before and after storing #3130

Ensure immutability of `CalcJobNode` hash before and after storing #3130