Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CalcJob: add the option to stash files after job completion #4424

Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Oct 6, 2020

Fixes #2150

Tests and documentation still to be added once API is agreed upon.

@sphuber sphuber force-pushed the fix/2150/add-archive-transport-task branch 2 times, most recently from 6ad761d to 1693fa7 Compare October 6, 2020 21:26
@codecov
Copy link

codecov bot commented Oct 6, 2020

Codecov Report

Merging #4424 (f05e9ea) into develop (241f251) will decrease coverage by 0.08%.
The diff coverage is 56.48%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4424      +/-   ##
===========================================
- Coverage    79.57%   79.50%   -0.07%     
===========================================
  Files          515      519       +4     
  Lines        36941    37076     +135     
===========================================
+ Hits         29392    29473      +81     
- Misses        7549     7603      +54     
Flag Coverage Δ
django 74.24% <56.48%> (-0.05%) ⬇️
sqlalchemy 73.15% <56.48%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/engine/daemon/execmanager.py 62.79% <3.58%> (-6.96%) ⬇️
aiida/orm/nodes/data/remote/base.py 65.96% <14.29%> (ø)
aiida/engine/processes/calcjobs/tasks.py 65.43% <24.49%> (-5.95%) ⬇️
aiida/engine/processes/calcjobs/calcjob.py 86.43% <89.29%> (+0.95%) ⬆️
aiida/orm/nodes/data/remote/stash/folder.py 96.43% <96.43%> (ø)
aiida/common/datastructures.py 96.88% <100.00%> (+0.33%) ⬆️
aiida/orm/nodes/data/__init__.py 100.00% <100.00%> (ø)
aiida/orm/nodes/data/remote/__init__.py 100.00% <100.00%> (ø)
aiida/orm/nodes/data/remote/stash/__init__.py 100.00% <100.00%> (ø)
aiida/orm/nodes/data/remote/stash/base.py 100.00% <100.00%> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 241f251...f05e9ea. Read the comment docs.

@sphuber sphuber force-pushed the fix/2150/add-archive-transport-task branch 3 times, most recently from 3334c48 to 38fc5ec Compare November 27, 2020 14:18
@sphuber sphuber marked this pull request as ready for review November 27, 2020 14:22
@sphuber sphuber changed the title CalcJob: add the option to archive files after job completion CalcJob: add the option to stash files after job completion Nov 30, 2020
@sphuber sphuber force-pushed the fix/2150/add-archive-transport-task branch from 38fc5ec to 66604f9 Compare December 8, 2020 19:36
Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestion for changes (I didn't do a complete review)

@ramirezfranciscof
Copy link
Member

I think I am a bit wary of introducing more explicit file handling inside of the engine. I rememeber that not so far ago we were discussing some limitations that were the result of the engine copying files directly instead of using the methods of RemoteData. Was this for extending the copying of files from one remote machine to another? I don't remember now; I do remember the underlying issue was that these methods would open and close transports in an inefficient way and that it might be convenient to have methods that USE already open transports.

Perhaps if we were to deal with that now (since we are anyways modifying the RemoteData class) we could minimize the logic contained inside of the engine, who would just orchestrate the methods of the target remote node. Moreover, this would also prevent having the stash and unstash logic separated (so both would live inside the respective remote class instead of being one in the engine and the other inside the transfer calcjob).

So instead of having a RemoteFolderData where we run jobs and RemoteStashData where we stash them, we would have a RemoteFolderData to run jobs but which could also be used as a class for the stashing (if you just copy), or you could also stash in a RemoteZipData, or a RemoteScriptData (which uses a bashscript for its methods), etc.

Does this idea make any sense?

@sphuber
Copy link
Contributor Author

sphuber commented Dec 10, 2020

I think I am a bit wary of introducing more explicit file handling inside of the engine. I rememeber that not so far ago we were discussing some limitations that were the result of the engine copying files directly instead of using the methods of RemoteData. Was this for extending the copying of files from one remote machine to another? I don't remember now; I do remember the underlying issue was that these methods would open and close transports in an inefficient way and that it might be convenient to have methods that USE already open transports.

Yes, the problem with the current implementation of RemoteData is that it will open a new connection itself. If this code is then directly used in processes, if they are run in large quantities, you can easily open too many connections and get banned. You could change the implementation to receive an open connection that it then uses, but then you are simply moving the problem. The code that calls the RemoteData is then responsible for opening a connection. If this is not done in a clever way, you end up with the same problem. The advantage of having the engine do these things, is that it has all the necessary infrastructure such as the transport queue.

Note that this is not the only complexity that the engine does well that would be difficult to reproduce by custom code. What happens if your call over transport fails, because the connection is interrupted. The engine has again complex functionality in the EBM to automatically retry and pause if the error persists. Recreating this behavior in client scripts is not trivial. I don't think that you can get away from the fact that if you want to perform tasks over transport, you will have to write asynchronous code that for most users will not be trivial. It is for that reason that I don't think we can make the RemoteData have a simple interface to do operations on. At least not when used in automated processes. For manual use in let's say a shell or command line, that would be different. But you wouldn't be able to just use it in your work chains or calculation jobs. I am not sure if all this is clear.

So instead of having a RemoteFolderData where we run jobs and RemoteStashData where we stash them, we would have a RemoteFolderData to run jobs but which could also be used as a class for the stashing (if you just copy), or you could also stash in a RemoteZipData, or a RemoteScriptData (which uses a bashscript for its methods), etc.

I think this refers more to issue #4626 does it? As I responded there to one of Leo's comments, I don't think the class should be determined by what type of stashing was used. It should rather distinguish it from a RemoteFolderData in the sense that that one can be readily accessed, whereas the RemoteStashData might not be, because it may be archived, written to tape, etc.

Does this idea make any sense?

That depends. If I understand your concern correctly (correct me if I am wrong) you do not like the concept that the interaction with this remote data has to be done indirectly through the engine. Instead, you would prefer to see this functionality directly accessible and usable through the various data classes. I hope I have explained clearly why I think that this approach won't work for cases where these actions have to be done in an automated way, that is to say in workchains. Of course for the single manual operation it might work, but I doubt that this is the majority of the use cases.

@sphuber sphuber force-pushed the fix/2150/add-archive-transport-task branch from 66604f9 to 9eabedf Compare December 10, 2020 17:01
@giovannipizzi
Copy link
Member

To comment on the discussion above: I think this might be right moment to change the interface of the RemoteData concerning auto-opening transports.
I would add a parameter transport to all concerned functions, e.g. is_empty, getfile, listdir, listdir_withattributes, _clean.
For backward compatibility, we make it optional for now - if not specified, we open a transport but also raise a deprecation warning. In 2.0, we make the parameter compulsory.

We instead add a get_transport method to do it (I would make it public, I think it's convenient for the users - this is the reason we were doing it "implicitly").

If possible (but I'm not sure it's easy) we should also check that the transport is "correct", i.e. to the correct machine (but I'm not sure it's easy).

Reason: while I agree that this only shifts the problem, it shifts it in the right direction: now, a conscious user cannot avoid to open a lot of connections. After this change, you can at least do something like:

  • get a list of remote data, group by computer
  • open the transport once
  • call all methods, for all remote data on the same computer, within the same connection

This will open up the possibility in the future to have e.g. a workflow step get a transport to perform operations (now you have to reimplement those methods instead, if that possibility is implemented).

@giovannipizzi
Copy link
Member

Regarding multiple (sub)classes vs. same class with an attribute to indicate the type: I guess it's (up to a certain level) a matter of taste.
In the meeting a few days ago, we converged on a single StashData, with attributes distinguishing the type (and different from RemoteFolderData).
I guess having an attribute makes it easy to support more types without having to define more classes.
The only thing to think about is how different the Python API will be.
If we expect them to be essentially the same, independent of the StashData type, then a single class+attribute to distinguish is OK.
If we want to provide API to act on each of them, and this is different, we should then probably instead go the multiple-subclasses approach? I'm not sure however of what we might want to support.
Definitely, however, even if we support multiple subclasses, I would vote for a single (sub)class for the StashData generated by the custom script installed on the supercomputer (to be still implemented), since we have no control on what the script will do except for the API that was used to invoke it.

Finally, regarding conversion to RemoteFolderData:

  • in the case of copy, we can/should provide a simple calcfunction that converts a StashData in mode copy to a RemoteData
  • for any other type, I think converting a StashData to a RemoteData should/could be done with new options of the TransferCalcJob

@ramirezfranciscof
Copy link
Member

ramirezfranciscof commented Dec 10, 2020

Note that this is not the only complexity that the engine does well that would be difficult to reproduce by custom code. What happens if your call over transport fails, because the connection is interrupted. The engine has again complex functionality in the EBM to automatically retry and pause if the error persists. Recreating this behavior in client scripts is not trivial. I don't think that you can get away from the fact that if you want to perform tasks over transport, you will have to write asynchronous code that for most users will not be trivial. It is for that reason that I don't think we can make the RemoteData have a simple interface to do operations on. At least not when used in automated processes. For manual use in let's say a shell or command line, that would be different. But you wouldn't be able to just use it in your work chains or calculation jobs. I am not sure if all this is clear.

I don't see that what I'm proposing necessarily implies giving up on the engine extra safeguards. For example, I'm thinking about changing:

        try:
            transport.copy(source_filepath, target_filepath)
        except (IOError, ValueError) as exception:
            execlogger.warning(f'failed to stash {source_filepath} to {target_filepath}: {exception}')

For:

        try:
            remote_folder.raw_copy(transport, source_filepath)
        except (SomeErrorType) as exception:
            execlogger.warning(f'failed to stash {source_filepath} to {remote_folder.path}: {exception}')

The raw_copy method does not currently exist but would be like the copy one except it uses the provided transport instead of opening one. This might seem like not much but would for example allow to instead of specifying a string for the kind of stashing it could just pass the class (RemoteFolderData, RemoteZipData, RemoteScriptData, etc.) which could significantly reduce upkeeping and interdependence between the different parts of the code.

One thing that might complicate this, however, is that a real generic copy procedure depends both on the transport and class of the source and the transport and class of the target. For example, I am not sure what kind of transport is being used in the example above (or in the execmanager in general) because it allows to copy a source from the remote machine into a target in the same remote machine: is this ssh? Because I would expect ssh transport copy stuff from the local machine to the remote machine. Now looking at the plugin code, I see that apparently it has methods for both: copy for the first and put for the second. More confusingly though, the local transport also has both these methods? And now I'm starting to get lost on how transports are supposed to work...can you copy files between machines with local transport put?

I think this refers more to issue #4626 does it? As I responded there to one of Leo's comments, I don't think the class should be determined by what type of stashing was used. It should rather distinguish it from a RemoteFolderData in the sense that that one can be readily accessed, whereas the RemoteStashData might not be, because it may be archived, written to tape, etc.

Yeah, sorry; I did see the comment there. I actually started writting my comment there, but since it was a more "bigger picture" and related many aspects, I wasn't sure exactly where to put it and ended up in here (there are a lot of current threads related to the different parts of this project). Also, maybe this part you quote makes more sense in combination with what I'm proposing before, so the class actually contains the logic on how to access the data through its methods.

EDIT: btw sorry for overlapping with @giovannipizzi , I started writing before his messages were sent he-he

@sphuber
Copy link
Contributor Author

sphuber commented Dec 10, 2020

To comment on the discussion above: I think this might be right moment to change the interface of the RemoteData concerning auto-opening transports. I would add a parameter transport to all concerned functions, e.g. is_empty, getfile, listdir, listdir_withattributes, _clean.

I agree that this would definitely be a step in the right direction. These methods should ultimately be moved from the RemoteData to the RemoteFolderData, because the RemoteData will no longer necessarily be a "folder". It won't even be instantiable any more. Have to think of how the deprecation pathway would go.

For backward compatibility, we make it optional for now - if not specified, we open a transport but also raise a deprecation warning. In 2.0, we make the parameter compulsory.

I am afraid that might be too fast. We sort of agreed to no longer deprecate things in 1.6 that will be removed in 2.0.

We instead add a get_transport method to do it (I would make it public, I think it's convenient for the users - this is the reason we were doing it "implicitly"). If possible (but I'm not sure it's easy) we should also check that the transport is "correct", i.e. to the correct machine (but I'm not sure it's easy).

This is indeed a problem and a critical check to perform but I am not sure it is currently possible. The Transport only get's the hostname of the Computer instance. Now it stands to argue that the hostname should uniquely determine the actual computer, however, in AiiDA it is definitely possible to have multiple Computer instances with the same hostname. A RemoteData will however be coupled to only one of those instances. It is currently not possible to determine from the Transport instance which computer that should be. However, I am also not sure if getting the wrong computer in that case could cause problems. They do represent the same physical machine after all. I guess the easiest would be to pass in the Computer instance in the Transport constructor, but this then tightly couples that class to an AiiDA ORM class, which is subideal as well.

Reason: while I agree that this only shifts the problem, it shifts it in the right direction: now, a conscious user cannot avoid to open a lot of connections. This will open up the possibility in the future to have e.g. a workflow step get a transport to perform operations (now you have to reimplement those methods instead, if that possibility is implemented).

While this is true, this does not solve the problem of connection problems. If you add this code into your workflow, even though it may use controlled transports to not overflow the machine, the transport operation may fail, which will except and which will except the entire workchain, toppling everything upstream as well. I think one would have to write this code such that it is wrapped in an EBM-like retry. This is possible, it would require the user to start writing asynchronous code.

@ramirezfranciscof I see now that you are proposing to move the logic for the actual operations as much as possible to within the data types. I am not opposed to this per sé and might be good in certain cases. Would have to look more into this. However, as explained above, this would still not solve the main problems unless they are actually called by the engine in the calcjobs. Users calling this code themselves in their workchains is problematic. Now it would be great if we could fix this in a generic way, but this will certainly be far from trivial.

The raw_copy method does not currently exist but would be like the copy one except it uses the provided transport instead of opening one. This might seem like not much but would for example allow to instead of specifying a string for the kind of stashing it could just pass the class (RemoteFolderData, RemoteZipData, RemoteScriptData, etc.)

What do you mean with "instead of specifying a string"? You mean the input in the stash metadata options to indicate the type of stashing that needs to be performed? I don't see what passing a class would solve. Since this will have to be stored as an attribute, the class will have to be serialized, which essentially means it is transformed into a string anyway.

For example, I am not sure what kind of transport is being used in the example above (or in the execmanager in general) because it allows to copy a source from the remote machine into a target in the same remote machine: is this ssh?

It is whatever transport the computer is configured to require, and the computer being the one to which that calculation job was submitted. If it was submitted to a computer with local transport, it will be a local transport instance.

Because I would expect ssh transport copy stuff from the local machine to the remote machine.

You are confusing the fact here that the "remote" machine is simply the computer to which the calcjob was submitted. This might just as well be the localhost, in which case it would most likely use local transport but not necessarily. You can also configure to connect over ssh to your localhost. Again, the "remote" computer in AiiDA is not always "really remote". Important not to forget that.

More confusingly though, the local transport also has both these methods? And now I'm starting to get lost on how transports are supposed to work...can you copy files between machines with local transport put?

The Transport is simply an abstract interface to copy and retrieve files and folders from one computer to the other. AiiDA is always running on the "one" computer, and it is connecting to the "other". The SshTransport facilitiates this interaction if the "other" computer is (only) reachable over an SSH connection. The LocalTransport implements the operations if the "other" computer is the same as the "one" computer. That's all.

@ramirezfranciscof
Copy link
Member

ramirezfranciscof commented Dec 11, 2020

It is whatever transport the computer is configured to require, and the computer being the one to which that calculation job was submitted. If it was submitted to a computer with local transport, it will be a local transport instance.

You are confusing the fact here that the "remote" machine is simply the computer to which the calcjob was submitted. This might just as well be the localhost, in which case it would most likely use local transport but not necessarily. You can also configure to connect over ssh to your localhost. Again, the "remote" computer in AiiDA is not always "really remote". Important not to forget that.

The Transport is simply an abstract interface to copy and retrieve files and folders from one computer to the other. AiiDA is always running on the "one" computer, and it is connecting to the "other". The SshTransport facilitiates this interaction if the "other" computer is (only) reachable over an SSH connection. The LocalTransport implements the operations if the "other" computer is the same as the "one" computer. That's all.

Ah, I see. Yes, I do know the remote computer can actually be your local one (I have run in localhost many times, haha), but I thought it might use the same ssh transport for that (I mean, you can ssh to your own computer), but it is true that it does not have the information to do that is the computer transport is set to local (i.e. passwords, key, etc that goes into the computer config). It makes sense in retrospect.

Also thanks for the explanation on the transports; so basically for local transport the put and copy methods will do the same thing.

@ramirezfranciscof I see now that you are proposing to move the logic for the actual operations as much as possible to within the data types. I am not opposed to this per sé and might be good in certain cases. Would have to look more into this. However, as explained above, this would still not solve the main problems unless they are actually called by the engine in the calcjobs. Users calling this code themselves in their workchains is problematic. Now it would be great if we could fix this in a generic way, but this will certainly be far from trivial.

I see what you mean; I guess @giovannipizzi was thinking more exclusively on allowing the user the "performant" way of transporting, not necessarily the whole retry on error infrastructure. They would anyways be able to do some manual retry, no? Like, for example:

# very pseudo pseudo-code
tries = 0
while error and tries < 5:
    tries =+ 1
    error = None
    try:
        folder_node.copy(transport,file)
    except RaisedError as er:
        error = er

The "general solution" problem seems interesting though. I clearly don't know enough of how this part work to propose a very specific solution, but would it be possible to automatically convert the steps of the workchain into async routines? And then maybe when you execute them if they catch a RetryError type (which would have to be manually raised by the workchain designer) it would mean they have to retry this step?

What do you mean with "instead of specifying a string"? You mean the input in the stash metadata options to indicate the type of stashing that needs to be performed? I don't see what passing a class would solve. Since this will have to be stored as an attribute, the class will have to be serialized, which essentially means it is transformed into a string anyway.

Well, sure you serialize to store but you already have rules to serialize classes, no? What I mean is now you don't need to create and maintain an "extra" equivalence relationship with the enum of possible stashing methods (you have the def of the enum on one part of the code, but then you have to "if ==" through it to decide which code to execute). The classes available are the possible stashing methods (and the code of what to do is in the class and passing the class is giving the instructions, there is only one place where the option is fully specified).

If you will, it would be the same if your enum was instead a dict with translator = {option_name: code_to_execute} and you just execute translator[option_passed] (if such a thing was possible).

Since now there is only one option possible, perhaps I'm not imagining correctly how you were thinking to implement the selector of what to do when we extended the functionality.

@sphuber
Copy link
Contributor Author

sphuber commented Dec 11, 2020

I see what you mean; I guess @giovannipizzi was thinking more exclusively on allowing the user the "performant" way of transporting, not necessarily the whole retry on error infrastructure. They would anyways be able to do some manual retry, no?

Yes, they would be able to do this. Still that doesn't solve the problem where people will put this in workflows, which is where it will break down. You will want a "sleep" in your retry to give a grace period, and if you don't do this async, you block the entire thread. Then of course you still want to "pause" and not except after max retries which again is not trivial.

The "general solution" problem seems interesting though. I clearly don't know enough of how this part work to propose a very specific solution, but would it be possible to automatically convert the steps of the workchain into async routines? And then maybe when you execute them if they catch a RetryError type (which would have to be manually raised by the workchain designer) it would mean they have to retry this step?

As far as I am aware, it is not possible to convert sync code to async code on-the-fly. It is not just slapping async in front of the method definition, you will also have to add await in parts of the code where you call other coroutines. The only solution is to require users to write async code to begin with. But I am not sure if we can allow to write "mixed" code, i.e. have only some of the outline steps in a workchain be async.

Since now there is only one option possible, perhaps I'm not imagining correctly how you were thinking to implement the selector of what to do when we extended the functionality.

No, that is more or less how I was imagining implementing it and extending it once more modes become available. I don't really see how this is such a problem though. Sure you can remove the if/else and just rely on passing the class and passing the single responsible method. To me that doesn't change that much, but it does add complexity elsewhere. We do not yet support serialization of arbitrary classes in attributes. Additionally, your approach assumes that the interface of this stash method on those RemoteStashData subclasses can have exactly the same interface. For normal copy and tar, maybe yes, but also for the custom remote script one? In the end, you may end up doing the same if/else, but on the class instead of the mode enum. Again, I don't see this as the main problem in this whole discussion.

TLDR: SUMMARY OF THREAD to summarize this very long discussion in how I see things:

  • The question and difference of opinion seems to be whether the RemoteData classes should function mostly as a data representer or as an operator. That is to say, should they merely store and reflect what data they contain, or should they also actually be functional classes to operate on that data.
  • From how classes are typically used in Python, it seems intuitive to make them operators and therefore implement functionality to manipulate their data. This seems to be what @ramirezfranciscof is suggesting.
  • On the contrary, I am not convinced that this is a wise design. The RemoteData needs a lot more information that is not really part of it, to perform these operations on data. It depends on the computer that it is one (which, technically can be retrieved from the RemoteData) and a transport to connect to that computer. The RemoteData is really just a "symbolic link". Any instance of it, doesn't really have access to its data, like an instance of a class typically has. It just has a reference which allows to address it, but only when it gets external resources that are required. It is therefore that I think we should treat these objects more as simple containers of a reference to data. The operator should be external to it and use it just to get the correct "address" and format of the data they reference.
  • The problem that I see with treating RemoteData as operator classes, is shown when it is used in automated code. It needs to have the transport passed in (doable, but already moves in the direction of the RemoteData no longer being the operator) and it does not have the capability itself to automatically recover from transient problems. Also this has to be done from the outside.
  • Now of course, these objections are not as much of a problem when used interactively, but AiiDA is mostly an automated infrastructure and not primarily designed to be an interactive tool.
  • That being said, I see the argument that it should be usable as both. My final observation is just that if you give the RemoteData class "operating" abilities, then people will use it and they will put it in workflows. We can of course document that they shouldn't do this, but we will then still have to implement the operating code in the engine.

FINAL QUESTION Does any of this discussion have any impact on the fundamental question of whether we should have the stashing functionality on a CalcJob at all? Or are we merely talking on how the various subclasses of RemoteData should be organized and what exact functionality they should possess?

@sphuber sphuber force-pushed the fix/2150/add-archive-transport-task branch from 9eabedf to db2466c Compare January 21, 2021 18:02
@sphuber
Copy link
Contributor Author

sphuber commented Jan 22, 2021

@chrisjsewell my pre-commit and docs checks started failing here ever since you released plumpy==0.18.4. Could you have a look and see whether you know why and if this is something we have to fix in aiida-core or in plumpy?

@chrisjsewell
Copy link
Member

@chrisjsewell my pre-commit and docs checks started failing here ever since you released plumpy==0.18.4. Could you have a look and see whether you know why and if this is something we have to fix in aiida-core or in plumpy?

well that will be motivation for you to have a look at #4669 then 😉

@chrisjsewell
Copy link
Member

chrisjsewell commented Jan 22, 2021

as I mention there, I'm not quite sure why mypy is not recognizing e.g. plumpy.Wait, and you need to have the full path plumpy.process_spec.Wait

@sphuber sphuber force-pushed the fix/2150/add-archive-transport-task branch 2 times, most recently from 0436370 to f51ce67 Compare January 27, 2021 10:32
Copy link
Member

@mbercx mbercx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sphuber! I gave the code a first pass and left some comments. I'll do a bit of field testing and if I don't run into any issues write a first draft for the documentation.

@giovannipizzi
Copy link
Member

Create the new remote.stash.folder: RemoteStashFolderData to be used for the copy mode.

OK

Change the current remote.stash: RemoteStashData entry point to be a non-instantiable base class

OK

I would leave the stash based entry points under the remote. hierarchy. Any downsides beside the query problem marked by Giovanni?

I see your comment on array vs array.kpoints (even if in that case it's really just an ArrayData node with some additional python methods, so really a subclass also in the python sense, here is a different implementation). On the other hand I agree that it's more intuitive if they live all inside remote. I don't see other downsides currently (and I really see limited need to query only for exactly RemoteData + only one subclass of stash) so OK, for me, to go ahead like this, also not to block this forever, and I think this is the easiest approach.
[EDIT: thinking better to it, one issue is people searching for all output RemoteData nodes of a ProcessNode and now (by default, with subclassing = True) finding more than one. But maybe it's OK - there is the word 'remote' in the RemoteStash*Data class name(s), and if they have some stash data, it means they explicitly asked to stash].

It would be great, though, if @mbercx (and @ramirezfranciscof ?) could confirm that there are no other issues, and (especially for @mbercx who's been keeping the design notes updated) could confirm that we're not missing one of the points we've been discussing, before you @sphuber do the work.

@mbercx
Copy link
Member

mbercx commented Feb 15, 2021

I don't know the history of that naughty KpointsData class, but I also don't see any downsides to having multiple classes for different types of stashing modes. It just offers us more flexibility in the future, and allows the user to query for a specific type of stashed data.

I've gone through the notes again, and I don't see any more issues with the currently proposed implementation. Just wanted to highlight again that we should also consider the unstashing of the data, via e.g. the TransferCalcJob. I can't think of any problems right now, but maybe you can still think of something that I've missed.

@sphuber sphuber force-pushed the fix/2150/add-archive-transport-task branch from 730022c to 0f67b80 Compare February 25, 2021 08:14
@sphuber
Copy link
Contributor Author

sphuber commented Feb 25, 2021

@ramirezfranciscof and @mbercx this is ready for review. I kept the final commit separate to easily see the changes since last review, but it should be squashed upon merge.

@mbercx mbercx self-requested a review March 1, 2021 11:49
Copy link
Member

@mbercx mbercx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sphuber! Changes look good, but I think you still need to add _storable = True to the RemoteStashFolderData class?

I'll open a separate PR for the documentation. I think I'll add a subsection to the "How to run external codes" section.

@sphuber sphuber force-pushed the fix/2150/add-archive-transport-task branch from 0f67b80 to 9941ffc Compare March 1, 2021 15:10
@sphuber
Copy link
Contributor Author

sphuber commented Mar 1, 2021

You are absolutely right. I have fixed it and added a unit test for it. I am hoping that this is what caused the daemon tests to hang.

@sphuber sphuber force-pushed the fix/2150/add-archive-transport-task branch from 9941ffc to be90296 Compare March 1, 2021 17:07
@sphuber
Copy link
Contributor Author

sphuber commented Mar 1, 2021

Alright @mbercx seems like the halting daemon tests were indeed just due to the unstorableness. That is fixed now and all tests pass. @ramirezfranciscof this is ready for review should you have the time.

@chrisjsewell chrisjsewell requested a review from mbercx March 4, 2021 14:20
@chrisjsewell
Copy link
Member

@mbercx, @ramirezfranciscof can you give a timeline on when you can review this?

@mbercx
Copy link
Member

mbercx commented Mar 4, 2021

@mbercx, @ramirezfranciscof can you give a timeline on when you can review this?

This is good to go for me, the only issue I still saw was the storability one. I think we're just waiting for @ramirezfranciscof's stamp of approval?

mbercx
mbercx previously approved these changes Mar 4, 2021
@ramirezfranciscof
Copy link
Member

Yeah, sorry, I'm half way through the thing but I have another big pressing priority I need to finish right now and is consuming a lot of my attention 😅. If you are tired of waiting I would understand, I can send the couple of comments I had (very minor, nothing critical so far I think) and just move on. Otherwise I could try to finish this tomorrow maybe (it really does depend if I manage to break through some specific obstacle I'm dealing with).

Copy link
Member

@ramirezfranciscof ramirezfranciscof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for taking a long time to see this, but all great! I just have a couple of comments and questions to understand this better (since this was already approved I'll just leave this as a comment).

@@ -415,6 +417,20 @@ def launch_all():
print('Running the `MultiplyAddWorkChain`')
run_multiply_add_workchain()

# Testing the stashing functionality
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe also test the behaviour for when the target_base path doesn't exist?

I was also going to say to check what happens when a non available stash_mode is provided, but then I saw at the end you did this in the tests/orm/data/test_remote_stash.py. I'm still wondering if it makes sense to check how the daemon handles that error too. I guess this is also related to what Giovanni mentions re- the codecov complaints, so perhaps you already thought about it and discarded the idea. Up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also going to say to check what happens when a non available stash_mode is provided, but then I saw at the end you did this in the tests/orm/data/test_remote_stash.py. I'm still wondering if it makes sense to check how the daemon handles that error too. I guess this is also related to what Giovanni mentions re- the codecov complaints, so perhaps you already thought about it and discarded the idea. Up to you.

It is a good point, but tests/orm/data/test_remote_stash.py doesn't test what you want it to test. It merely checks that the node implements the checks for valid values. What you mean is to check that the CalcJob process also validates properly. This is done however. In tests/engine/test_calc_job.py all the way at the bottom I test the validator for the stash options, which ensures that the process won't even be created when an invalid stash mode is defined.

Should we maybe also test the behaviour for when the target_base path doesn't exist?

Could do. Would it be ok to only check that, e.g., by simply deleting the tmpdir that is generated before launching it? If I have to launch an additional calculation just for the other case as well, these tests will get out of control. This is the downside of integration tests. You are running so much code just to test one minor variation and target a single line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good point, but tests/orm/data/test_remote_stash.py doesn't test what you want it to test. It merely checks that the node implements the checks for valid values. What you mean is to check that the CalcJob process also validates properly. This is done however. In tests/engine/test_calc_job.py all the way at the bottom I test the validator for the stash options, which ensures that the process won't even be created when an invalid stash mode is defined.

Ah, good point, you are right.

Could do. Would it be ok to only check that, e.g., by simply deleting the tmpdir that is generated before launching it? If I have to launch an additional calculation just for the other case as well, these tests will get out of control. This is the downside of integration tests. You are running so much code just to test one minor variation and target a single line.

Uhm, ok I see; yes, if it just re-creates it then it should be ok. Now, not for this PR, but maybe for a later discussion I wonder if we want to allow this path creation in a remote cluster where trying to just create folders in arbitrary locations may lead to trouble. Does the aiida working directory for remotes also create the full path if there are missing dirs? I don't know, maybe I'm being overly cautious...

Comment on lines 236 to 240
if node.get_state() == CalcJobState.PARSING:
logger.warning(f'CalcJob<{node.pk}> already marked as PARSING, skipping task_retrieve_job')
if node.get_state() == CalcJobState.DONE:
logger.warning(f'CalcJob<{node.pk}> already marked as DONE, skipping task_retrieve_job')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure why of this change. Maybe you already commented on this on one of the meetings we had and I don't remember, but I couldn't find anything in the text of the PR. Why did you add the DONE state? And why did you change it here instead of PARSING? What is the purpose of these checks here if the states are supposed to be called in order, just a sanity check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of these checks here if the states are supposed to be called in order, just a sanity check?

No @ramirezfranciscof this is if e.g. the daemon was stopped/restarted during the transport tasks, we don't want to repeat a task that has already been finished.

but yes @sphuber it appears here that you have simply replaced PARSING with DONE?
Is this the case and what was the reason?
If it is a swap then surely PARSING should be removed from CalcJobState and also the places in the tests that use PARSING should also be replaced with DONE.
Technically this would also warrant a database migration, which brings into question if it is really worth it 😬

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was necessary for one of the previous designs where I think the stashing was performed after the parsing and so a new state was necessary. However, in the current design, the stashing (if needed) is done before retrieving and the parsing then follows retrieving as it used to. So indeed now the change is merely replacing PARSING with DONE so I can revert it. Note that this state attribute is merely for internal purposes and is even removed after the process is terminated, so a migration wouldn't have been necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cheers

Note that this state attribute is merely for internal purposes and is even removed after the process is terminated, so a migration wouldn't have been necessary.

well I meant in the (very rare case) a migration occurred whilst you still had pending calculations in the PARSING state, but yeh a technicality to be sure

Comment on lines +324 to +326
except plumpy.process_states.Interruption:
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of exception is this catching? Can't we re-raise something more informative?

Copy link
Member

@chrisjsewell chrisjsewell Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its catching what it says its catching 😜 an interruption, like pausing or killing, so no

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but my underlying question is why would you catch the exception to just raise with nothing? Why not just leave plumpy.process_states.Interruption be the raising error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because otherwise it would be caught in the exception below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case this what your question was trying to address: if you catch and then simply call raise again, it will raise the original exception. Why is this useful? Because this way you can treat certain exceptions different from others. In this case we want exceptions to be treated a certain way, except the Interruption, which should just be bubbled up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you catch and then simply call raise again, it will raise the original exception.

Ahh ok. Yeah, sorry, I did not know this, I thought you had to explicitly provide the exception to raise.

Copy link
Member

@chrisjsewell chrisjsewell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from a quick skim, generally fine cheers, but still a few changes to make

Comment on lines 236 to 240
if node.get_state() == CalcJobState.PARSING:
logger.warning(f'CalcJob<{node.pk}> already marked as PARSING, skipping task_retrieve_job')
if node.get_state() == CalcJobState.DONE:
logger.warning(f'CalcJob<{node.pk}> already marked as DONE, skipping task_retrieve_job')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of these checks here if the states are supposed to be called in order, just a sanity check?

No @ramirezfranciscof this is if e.g. the daemon was stopped/restarted during the transport tasks, we don't want to repeat a task that has already been finished.

but yes @sphuber it appears here that you have simply replaced PARSING with DONE?
Is this the case and what was the reason?
If it is a swap then surely PARSING should be removed from CalcJobState and also the places in the tests that use PARSING should also be replaced with DONE.
Technically this would also warrant a database migration, which brings into question if it is really worth it 😬

sphuber added 2 commits March 10, 2021 11:25
A new namespace `stash` is added to the `metadata.options` input
namespace of the `CalcJob` process. This option namespace allows a user
to specify certain files that are created by the calculation job to be
stashed somewhere on the remote. This can be useful if those files need
to be stored for a longer time than the scratch space where the job was
run is typically not cleaned for, but need to be kept on the remote
machine and not retrieved. Examples are files that are necessary to
restart a calculation but are too big to be retrieved and stored
permanently in the local file repository.

The files that are to be stashed are specified through their relative
filepaths within the working directory in the `stash.source_list`
option. For now, the only supported option is to have AiiDA's engine
copy the files to another location on the same filesystem as the working
directory of the calculation job. The base path is defined through the
`stash.target_base` option. In the future, other methods may be
implemented, such as placing all files in a (compressed) tarball or even
stash files on tape.
Note the big change in `execmanager.py` is simply moving the
`stash_calculation` before `retrieve_calculation`. I did this because
all actions are in order and stashing comes before retrieving. There are
no actual changes in the code.
@sphuber sphuber force-pushed the fix/2150/add-archive-transport-task branch from 523e729 to a721d4b Compare March 10, 2021 10:55
@sphuber
Copy link
Contributor Author

sphuber commented Mar 10, 2021

Alright gents, thanks for the review. Should have addressed all of them. If it is ok to be approved, please let me merge myself as I need to adapt the commit message a bit.

@chrisjsewell chrisjsewell self-requested a review March 10, 2021 11:03
Copy link
Member

@chrisjsewell chrisjsewell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good job 👌

Copy link
Member

@ramirezfranciscof ramirezfranciscof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thanks!

@sphuber sphuber merged commit 9359b07 into aiidateam:develop Mar 10, 2021
@sphuber sphuber deleted the fix/2150/add-archive-transport-task branch March 10, 2021 14:30
chrisjsewell added a commit that referenced this pull request Mar 18, 2021
* Dependencies: bump cryptography to 3.2 in `requirements` (#4520)

Bumps `cryptography` from 2.8 to 3.2.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Sebastiaan Huber <mail@sphuber.net>

* CI: remove `run-on-comment` job in benchmark workflow (#4569)

This job is failing due to this change:
https://github.blog/changelog/2020-10-01-github-actions-deprecating-set-env-and-add-path-commands/
It's not really used, so lets just remove it

* Docs: update citations with AiiDA workflows paper (#4568)

Citation for the latest paper on the engine is added to the README and
the documentation index page. The paper in `aiida/__init__.py` is also
updated which was still referencing the original publication of 2016.

* Enforce verdi quicksetup --non-interactive (#4573)

When in non-interactive mode, do not ask whether to use existing user/database

* `SinglefileData`: add support for `pathlib.Path` for `file` argument (#3614)

* DOCS: Reverse daemon start and profile setup sections in intro. (#4574)

The profile must be setup prior to starting the daemons to avoid an error.

* Fix `verdi --version` in editable mode (#4576)

This commit fixes a bug,
whereby click was using a version statically stored on install of the package.
This meant changes to `__version__` were not dynamically reflected.

* Improve `verdi node delete` performance (#4575)

The `verdi node delete` process fully loaded all ORM objects at multiple stages
during the process, which is highly inefficient.
This commit ensures the process now only loads the PKs when possible.
As an example, the time to delete 100 "empty" nodes (no attributes/objects)
is now reduced from ~32 seconds to ~5 seconds.

* `CalcJob`: add the `additional_retrieve_list` metadata option (#4437)

This new option allows one to specify additional files to be retrieved
on a per-instance basis, in addition to the files that are already
defined by the plugin to be retrieved. This was often implemented by
plugin packages itself through a `settings` node that supported a key
that would allow a user to specify these additional files.

Since this is a common use case, we implement this functionality on
`aiida-core` instead to guarantee a consistent interface across plugins.

* Add options for transport tasks (#4583)

* Add options for transport tasks

When encountering failures during the execution of transport tasks, a runner
will wait for a time interval between transport task attempts. This time
interval between attempts is increased using an exponential backoff
mechanism, i.e. the time interval is equal to:

(TRANSPORT_TASK_RETRY_INITIAL_INTERVAL) * 2 ** (N_ATTEMPT - 1)

where N_ATTEMPT is the number of failed attempts. This mechanism is
interrupted once the TRANSPORT_TASK_MAXIMUM_ATTEMPTS is reached.

The initial interval and maximum attempts are currently fixed to 20
seconds and 5, respectively. This commit adds two configuration options
that use these defaults, but allow the user to adjust them using `verdi
config`.

* Fix command for getting EBM config options (#4587)

Currently the transport options for the EBM are obtained by using the
get_config function, e.g.:

`initial_interval = get_config_option(RETRY_INTERVAL_OPTION)`

However, it seems that `get_config()` does not get you the current
configuration (see #4586). 

Replacing `get_config().get_option()` with `get_config_option()` fixes this
issue for the EBM options.

* CI: revert apt source list removal

This work around was added some time ago because this source for the
`apt` package manager was causing the install of system dependencies to
fail.

* CI: Add workflow to run tests against various RabbitMQ versions

The main test workflow runs against a single version of RabbitMQ but
experience has shown that the code can break for different versions of
the RabbitMQ server. Here we add a new CI workflow that runs various
unit tests through pytest that simulate the typical interaction with the
RabbitMQ server in normal AiiDA operation. The difference is that these
are tested against the currently available versions of RabbitMQ.

The current setup, still only tests part of the functionality that AiiDA
uses, for example, the default credentials and virtual host are used.
Connections over TLS are also not tested. These options would require
the RabbitMQ service that is running in a docker container to be
configured differently. It is not clear how these various options can be
parametrized in concert with the actual unit tests.

* Engine: replace `tornado` with `asyncio`

The `plumpy` and `kiwipy` dependencies have already been migrated from
using `tornado` to the Python built-in module `asyncio` in the versions
`0.16.0` and `0.6.0`, respectively. This allows us to also rid AiiDA of
the `tornado` dependency, which has been giving requirement clashes with
other tools, specifically from the Jupyter and iPython world. The final
limitation was the `circus` library that is used to daemonize the daemon
workers, which as of `v0.17.1` also supports `tornado~=5`.

A summary of the changes:

 * Replace `tornado.ioloop` with `asyncio` event loop.
 * Coroutines are marked with `async` instead of decorated with the
   `tornado.gen.coroutine` decorator.
 * Replace `yield` with `await` when calling a coroutine.
 * Replace `raise tornado.gen.Return` with `return` when returning from
   a coroutine.
 * Replace `add_callback` call on event loop with `call_soon` when
   scheduling a callback.
 * Replace `add_callback` call on event loop with `create_task` when
   scheduling `process.step_until_terminated()`.
 * Replace `run_sync` call on event loop with `run_until_complete`.
 * Replace `pika` uses with `aio-pika` which is now used by the `plumpy`
   and `kiwipy` libraries.
 * Replace `concurrent.Future` with `asyncio.Future`.
 * Replace `yield tornado.gen.sleep` with `await asyncio.sleep`.

Additional changes:

 * Remove the `tornado` logger from the logging configuration.
 * Remove the `logging.tornado_loglevel` configuration option.
 * Turn the `TransportQueue.loop` attribute from method into property.
 * Call `Communicator.close()` instead of `Communicator.stop()` in the
   `Manager.close()` method. The `stop` method has been deprecated in
   `kiwipy==0.6.0`.

* `Process.kill`: properly resolve the killing futures

The result returned by `ProcessController.kill_process` that is called
in `Process.kill` for each of its children, if it has any, can itself be
a future, since the killing cannot always be performed directly, but
instead will be scheduled in the event loop. To resolve the future of
the main process, it will have to wait for the futures of all its
children to be resolved as well. Therefore an intermediate future needs
to be added that will be done once all child futures are resolved.

* Unwrap the futures returned by `ProcessController` in `verdi process`

The commands of `verdi process` that perform an RPC on a live process
will do so through the `ProcessController`, which returns a future.
Currently, the process controller uses the `LoopCommunicator` as its
communicator which adds an additional layer of wrapping. Ideally, the
return type of the communicator should not change depending on the
specific implementation that is used, however, for now that is the case
and so the future needs to be unwrapped explicitly one additional time.
Once the `LoopCommunicator` is fixed to return the same future type as
the base `Communicator` class, this workaround can and should be
removed.

* `Runner`: use global event loop and global runner for process functions

With the migration to `asyncio`, there is now only a single event loop
that is made reentrant through the `nest-asyncio` library, that monkey
patches `asyncio`'s built-in mechanism to prevent this. This means that
in the `Runner` constructor, we should simply get the global event loop
instead of creating a new one, if no explicit loop is passed into the
constructor. This also implies that the runner should never take charge
in closing the loop, because it no longer owns the global loop.

In addition, process functions now simply use the global runner instead
of creating a new runner. This used to be necessary because running in
the same runner, would mean running in the same loop and so the child
process would block the parent. However, with the new design on
`asyncio`, everything runs in a single reentrant loop and so child
processes no longer need to spawn their own independent nested runner.

* Engine: cancel active tasks when a daemon runner is shutdown

When a daemon runner is started, the `SIGINT` and `SIGTERM` signals are
captured to shutdown the runner before exiting the interpreter. However,
the async tasks associated with the interpreter should be properly
canceled first.

* Engine: enable `plumpy`'s reentrant event loop policy

The event loop implementation of `asyncio` does not allow to make the
event loop to be reentrant, which essentially means that event loops
cannot be nested. One event loop cannot be run within another event
loop. However, this concept is crucial for `plumpy`'s design to work and
was perfectly allowed by the previous event loop provider `tornado`.

To work around this, `plumpy` uses the library `nest_asyncio` to patch
the `asyncio` event loop and make it reentrant. The trick is that this
should be applied at the correct time. Here we update the `Runner` to
enable `plumpy`'s event loop policy, which will patch the default event
loop policy. This location is chosen since any process in `aiida-core`
*has* to be run by a `Runner` and only one runner instance will ever be
created in a Python interpreter. When the runner shuts down, the event
policy is reset to undo the patch.

* Tests: do not create or destroy event loop in test setup/teardown

* Engine: explicitly enable compatibility for RabbitMQ 3.5

RabbitMQ 3.6 changed the way integer values are interpreted for
connection parameters. This would cause certain integer values that used
to be perfectly acceptable, to all of suddent cause the declaration of
resources, such as channels and queues, to fail.

The library `pamqp`, that is used by `aiormq`, which in turn is used
ultimately by `kiwipy` to communicate with the RabbitMQ server, adapted
to these changes, but this would break code with RabbitMQ 3.5 that used
to work just fine. For example, the message TTL when declaring a queue
would now fail when `32767 < TTL < 655636` due to incorrect
interpretation of the integer type.

The library `pamqp` provides a way to enable compatibility with these
older versions. One should merely call the method:

    pamqp.encode.support_deprecated_rabbitmq()

This will enable the legacy integer conversion table and will restore
functionality for RabbitMQ 3.5.

* Dependencies: update minimum version for `notebook>=6.1.5` (#4593)

Lower versions suffer from vulnerability `GHSA-c7vm-f5p4-8fqh`.

Also update the requirement files to only use explicit pinned versions.
The compatibility operator was erroneously used for the `aio-pika`,
`pamqp` and `pytest-asyncio` dependencies.

For `pamqp` the minimum required version is upped to `2.3` since that
was the version that introduced the `support_deprecated_rabbitmq`
function that is required from that library.

* Daemon: replace deprecated classmethods of `asyncio.Task` in shutdown (#4608)

The `shutdown` function, that was attached to the loop of the daemon
runner in `aiida.engine.daemon.runner.start_daemon`, was calling the
classmethods `current_task` and `all_tasks` of `asyncio.Task` which have
been deprecated in Python 3.7 and are removed in Python 3.9. This would
prevent the daemon runners from being shutdown in Python 3.9. The
methods have been replaced with top level functions that can be imported
directl from `asyncio`.

This was not noticed in the tests because in the tests the daemon is
stopped but it is not checked whether this happens successfully. Anyway,
the error would only show up in the daemon log. To test the shutdown
method, it has been made into a standalone coroutine and renamed to
`shutdown_runner`.

Since the `shutdown_runner` is a coroutine, the unit test that calls it
also has to be one and therefore we need `pytest-asyncio` as a
dependency. The `event_loop` fixture, that is provided by this library,
is overrided such that it provides the event loop of the `Manager`,
since in AiiDA only ever this single reentrant loop should be used.

Note that the current CI tests run against Python 3.6 and Python 3.9 and
so will still not catch this problem, however, the `test-install`
workflow _does_ run against Python 3.9. I have opted not to change the
continuous integrations to run against Python 3.9 instead of 3.8, since
they take more than twice the time. Supposedly this is because certain
dependencies have to be built and compiled from scratch when the
testing environment is started.

* CLI: add the `verdi database version` command (#4613)

This shows the schema generation and version of the database of the
given profile, useful mostly for developers when debugging.

In addition to the new command, the code in `aiida.manage.manager` had
to be updated for the new functionality to work. The `get_backend_manager`
was so far _not_ loading the backend, although that really doesn't make
any sense. It is providing access to data from the database, but to do
so the backend should be loaded, otherwise a connection isn't possible.

This problem went unnoticed, because the `BackendManager` was so far only
used in `aiida.engine.utils.set_process_state_change_timestamp`. By the
time this gets used, the database backend will already have been loaded
through another code path.

For the change `verdi database version` command, however, the call to
get the backend manager needed to make sure that the database backend
itself was also loaded. It was not possible to have `get_backend_manager`
simply call `_load_backend()` because this would lead to infinite
recursion as `_load_backend()` also calls `get_backend_manager`.
Therefore `_load_backend` is refactored to not call the former but rather
to directly fetch it through `aiida.backends`.

* Add the `TransferCalcJob` plugin (#4194)

This calcjob allows the user to copy files between a remote machine and
the local machine running AiiDA. More specifically, it can do any of the
following:

* Take any number of files from any number of `RemoteData` folders in
a remote machine and copy them in the local repository of a single
newly created `FolderData` node.

* Take any number of files from any number of `FolderData` nodes in the
local machine and copy them in a single newly created `RemoteData` folder
in a given remote machine.

These are the main two use cases, but there are also other more complex
combinations allowed by the current implementation.

Co-authored-by: Sebastiaan Huber <mail@sphuber.net>

* Dependencies: update requirement `kiwipy~=0.7.1` and `plumpy~=0.18.0` (#4629)

A breaking change was released with `kiwipy==0.5.4` where the default
value for the task message TTL was changed. This caused connections to
existing RabbitMQ queues to fail. Since process task queues are
permanent in AiiDA, this would break all existing installations.

This problem was fixed by reverting the change which was released with
`kiwipy==0.5.5`, however, this was a support patch at the time and the
revert never made it into the main line, leaving all versions up from
`v0.6.0` still affected. Since these versions of `kiwipy` were never
required by a released version of `aiida-core`, but only the current
`develop`, which will become `v1.6.0`, we can simply update the
requirement to the latest patch `kiwipy==0.7.1` that addressed the
problem.

The dependency requirement for `plumpy` also had to be updated because
the old pinned minor version was pinned to `kiwipy~=0.6.0` which is not
compatible with our new requirements.

* Docs: add content from old documentation on caching/hashing (#4546)

Move the content of "Controlling hashing" and "Design guidelines" inside
of `developer_guide/core/caching.rst` to `topics/provenance/caching`.

* Engine: remote `with_persistence=False` from process function runner (#4633)

In principle the runner for a process function does not need a persister
since it runs in one go and does not have intermediate steps at which
the progress needs to be persisted. However, since the process function
implementation calls `Manager.get_runner`, if a runner has not yet been
created in the interpreter, one will be created and set to be the global
one. This is where the problem occurs because the process function
specifies `with_persistence=False` for the runner. This will cause any
subsequent process submissions to fail since the `submit` function will
call `runner.persister.save_checkpoint` which will fail since the
`persister` of the runner is `None`.

* `CalcJob`: improve testing and documentation of `retrieve_list` (#4611)

The documentation on the `retrieve_list` syntax and its functioning was
incorrect. The inaccuracies are corrected and extensive examples are
provided that give an example file hierarchy for the remote working
directory and then for a variety of definitions of the `retrieve_list`
the resulting file structure in the retrieved folder is depicted.

* CI: remote the `numpy` install workaround for `pymatgen`

The problem occurred due to an outdated version of `setuptools` which
would be invoked when `pymatgen` gets installed from a tarball, in which
case the wheel has to be built. In this scenario, the build requirements
get installed by `setuptools`, which at outdated versions did not
respect the Python requirements of the dependencies which would cause
incompatible version of `numpy` to be installed, calling the build to
fail. By updating `setuptools` the workaround of manually installing a
compatible `numpy` version beforehand is no longer necessary.

* CI: skip `restapi.test_threaded_restapi:test_run_without_close_session`

This test has been consistently failing on Python 3.8 and 3.9 despite
the two reruns using flaky. For now we skip it entirely instead.

* Dependencies: update requirement `plumpy~=0.18.1` (#4642)

This patch release of `plumpy` fixes a critical bug that makes the new
`asyncio` based implementation of the engine compatible with Jupyter
notebooks.

* CLI: ensure `verdi database version` works even if schema outdated (#4641)

The command was failing if the database schema was out of sync because
the backend was loaded, through `get_manager`, with the default schema
check on. Since the database does not actually have to be used, other
than to retrieve the current schema version and generation, we can load
the backend without the check.

* Add `verdi group delete --delete-nodes` (#4578)

This commit makes a number of improvements to the deletion of nodes API/CLI:

1. Makes `delete_nodes` usable outside of `click`; 
   adding a callback for the confirmation step, 
   rather than calling `click.confirm` directly, 
   and using logging instead of `click.echo`
2. Moves the function from `aiida/manage/database/delete/nodes.py`
   to `aiida/tools/graph/deletions.py`,
   leaving a deprecation warning at the old location.
   This is a more intuitive place since the function is directly build on the graph traversal functionality.
3. Exposes API functions *via* `from aiida.tools import delete_nodes` and adds their use to the documentation.
4. Adds `delete_group_nodes` mainly as a wrapper around `delete_nodes`;
   querying for all the node pks in the groups, then passing these to `delete_nodes`
5. Adds the ability to delete nodes to `verdi group delete --delete-nodes`,
   with the same flags and logic as `verdi node delete` 
6. Fixes a bug in `verdi node delete`, introduced by #4575, if a node does not exist

* 🧪 FIX: engine benchmark tests (#4652)

The `test_workchain_daemon` test group required updating to using asyncio (rather than tornado)

* Docs: Minor documentation fixes (#4643)

Small changes and fixes in the documentation.

* Docs: clarify docstrings of `get_last_job_info` and `get_detailed_job_info` (#4657)

`CalcJobNode`s contain two differente job infos, the `detailed_job_info` and
the `last_job_info`. The distinction between the two was not obvious,
and not documented. The docstrings are improved to clarify the difference.

* docs: simplify proxycommand (#4662)

The 'netcat mode' `-W` was added in OpenSSH 5.4, released March 2010.
Given that this simplifies the setup and and delegates handling of netcat
to ssh, this is what we should recommend.

For example, MacOS ships with OpenSSH 5.6 since MacOS 10.7, released October 2010.

* Docs: Add redirect for database backup page (#4675)

* Type checking: `aiida/engine` (+bug fixes) (#4669)

Added type checking for the modules

* `aiida.engine`
* `aiida.manage.manager`

Move `aiida.orm` imports to top of file in `aiida.engine` module. This should be
fine as `aiida.orm` should not import anything from `aiida.engine` and this way
we don't need import guards specifically for type checking.

* Fix `run_get_node`/`run_get_pk` namedtuples (#4677)

Fix a regression made in #4669, whereby the namedtuple's were incorrectly named

* REST API fixes

- Use node_type in construct_full_type().
- Don't use try/except for determining full_type.
- Remove unnecessary try/except in App for catch_internal_server.
- Use proper API_CONFIG for configure_api.

* New /querybuilder-endpoint with POST for REST API

The POST endpoint returns what the QueryBuilder would return, when
providing it with a proper queryhelp dictionary.
Furthermore, it returns the entities/results in the "standard" REST API
format - with the exception of `link_type` and `link_label` keys for
links. However, these particular keys are still present as `type` and
`label`, respectively.

The special Node property `full_type` will be removed from any entity,
if its value is `None`. There are two cases where this will be True:
- If the entity is not a `Node`; and
- If neither `node_type` or `process_type` are among the projected
properties for any given `Node`.

Concerning security:
The /querybuilder-endpoint can be toggled on/off with the configuration
parameter `CLI_DEFAULTS['POSTING']`.
Added this to `verdi restapi` as `--posting/--no-posting` option.
The option is hidden by default, as the naming may be changed in the
future.

Reviewed by @ltalirz.

* Use importlib in .ci folder

* Fix: pre-store hash for -0. and 0. is now the same

* ci: update paramiko version (#4686)

Now that the Github Action runners switched to Ubuntu 20.04, the default SSH
key format of OpenSSH changed and is no longer supported by paramiko
<=2.7.1.

* Fix: release signal handlers after run execution (#4682)

After a process has executed (when running rather than submitting),
return the signal handlers to their original state.

This fixes an issue whereby using `CTRL-C` after a process has run still calls the `process.kill`.
It also releases the `kill_process` function's reference to the process,
a step towards allowing the finished process to be garbage collected.

* Fix: `PluginVersionProvider` should cache process class (#4683)

Currently, the `PluginVersionProvider` is caching process instance, rather than class.
This commit fixes the bug, meaning the cache will now work correctly.
Removing the reference to the process instance also is a step towards allowing it to be garbage collected.

* remove leftover use of Computer.name (#4681)

Remove leftover use of deprecated Computer.name attribute in `verdi
computer list`.

Also update minimum version of click dependency to 7.1, since click 7.1
introduces additional whitespace in the verdi autodocs (running with 
click 7.0 locally resulted in pre-commit check failing on CI).

Co-authored-by: Chris Sewell <chrisj_sewell@hotmail.com>

* Add `to_aiida_type` to the public API (#4672)

Since `to_aiida_type` is intended for public use,
this commit makes it part of the public API,
via `from aiida.orm import to_aiida_type`.

* Add .dockerignore (#4564)

This commit adds a `.dockerignore` file to inhibit any unecessary/unwanted files being copied into the Docker container,
during the `COPY . aiida-core` command,
and also reduces the build time.

* CI: Remove `--use-feature=2020-resolver` pip feature flag tests. (#4689)

The feature is now on by default in the latest stable release.

* CI: Notify slack on failure of the test-install workflow. (#4690)

* Improve namedtuples in aiida/engine (#4688)

This commit replaces old-style namedtuples with `typing.NamedTuple` sub-classes.
This allows for typing of fields and better default value assignment.

Note this feature requires python>=3.6.1,
but it is anyhow intended that python 3.6 be dropped for the next release.

* test AiiDA ipython magics and remove copy-paste in docs (#4548)

Adds tests for the AiiDA IPython extension.

Also:
 * move some additional lines from the registration snippet to
  aiida-core (where we can adapt it if the IPython API ever changes)
 * rename and deprecate misnomer `load_ipython_extension` to
   `register_ipython_extension` (to be removed in aiida 3)
 * include the snippet to register the AiiDA ipython magics from the
   aiida-core codebase instead of the (already outdated) copy-pasted
  version.
 * revisit the corresponding section of the documentation, starting
  with the setup, and removing some generic information about jupyter.

* 🐛 FIX: typing failure (#4700)

As of numpy v1.20, `numpy.inf` is no longer recognised as an integer type

* 📚 DOCS: fix typo (#4711)

* BUILD: drop support for python 3.6 (#4701)

Following our support table, we drop python 3.6 support.

* BUILD: bump jenkins dockerimage to 20.04 (#4714)

Despite python3.7 being installed on the Jenkins dockerimage, pip
install failed after dropping python 3.6 support (likely because pip
from python 3.6 was being used).

We update ubuntu to 20.04, which comes with python 3.8.2 by default.

* Switch matrix order in continuous-integration tests job. (#4713)

To harmonize with test-install workflow.

* ♻️ REFACTOR: verdi export/import -> verdi archive (#4710)

This commit deprecates `verdi export` and `verdi import` and combines them into `verdi archive`.

* Dependencies: Require `ipython~=7.20` (#4715)

* Dependencies: Require `ipython~=7.20`

Package jedi version 0.18 introduces backwards incompatible changes that
break compatibility with ipython<7.20.

Fixes issue #4668.

* Automated update of requirements/ files. (#4716)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* ♻️ REFACTOR: `ci/` folder (#4565)

This commit looks to address two issues:

1. The `ci/` folder has become cluttered; 
   it contains configuration and scripts for both the GitHub Actions and Jenkins CI
   and it is not easily clear which is for which.
2. The Jenkins tests are somewhat of a black-box to most,
   since it is certainly not trivial to set up and run them locally.
   This has lead to them essentially not being touched since they were first written.

The changes are as follows:

1. Moved the GH actions specific scripts to `.github/system_tests`
2. Refactored the Jenkins setup/tests to use [molecule](https://molecule.readthedocs.io) in the `.molecule/` folder 
   (note we use molecule for testing all the quantum mobile code).
   You can read about this setup in `.molecule/README.md`,
   but essentially if you just run `tox -e molecule-django` locally it will create/launch a docker container, 
  setup and run the tests within that container, then destroy the container.
  Locally, it additionally records and prints an analysis of queries made to the database during the workchain runs.
3. Moved the Jenkins configuration to `.jenkins/`, which is now mainly a thin wrapper around (2).

This makes these tests more portable and easier to understand, modify or add to.

* 🔧 MAINTAIN: drop setuptools upper pinning (#4725)

* CI: Improve polish workchain failure debugging (#4729)

* fix: don't pass process stack via context (#4699)

This PR fixes a memory leak: when running `CalcJob`s over an SSH connection,
the first CalcJob that was run remained in memory indefinitely.

`plumpy` uses the `contextvars` module to provide a reference to the
`current_process` anywhere in a task launched by a process.  When using any of
`asyncio`'s `call_soon`, `call_later` or `call_at` methods, each individual
function execution gets their own copy of this context.  This means that as
long as a handle to these scheduled executions remains in memory, the copy of
the `'process stack'` context var (and thus the process itself) remain in
memory,

In this particular case, a handle to such a task (`do_open` a `transport`)
remained in memory and caused the whole process to remain in memory as well via
the 'process stack' context variable.  This is fixed by explicitly passing an
empty context to the execution of `do_open` (which anyhow does not need access
to the `current_process`).  An explicit test is added to make sure that no
references to processes are leaked after running process via the interpreter
as well as in the daemon tests.

This PR adds the empty context in two other invocations of `call_later`, but
there are more places in the code where these methods are used. As such it is a
bit of a workaround.  Eventually, this problem should likely be addressed by
converting any functions that use `call_soon`, `call_later` or `call_at` and
all their parents in the call stack to coroutines.

Co-authored-by: Chris Sewell <chrisj_sewell@hotmail.com>

* CI: Add retry for polish workchains (#4733)

To mitigate failures on Jenkins

* 🐛 FIX: Standardise transport task interrupt handling (#4692)

For all transport tasks (upload, submit, update, retrieve),
both `plumpy.futures.CancelledError` and `plumpy.process_states.Interruption` exceptions
should be ignored by the exponential backoff mechanism (i.e. the task should not be retried)
and raised directly (as opposed to as a `TransportTaskException`),
so that they can be correctly caught by the `Waiting.execute` method.

As an example, this fixes a known bug, whereby the upload task could not be
cancelled via `CTRL-C` in an ipython shell.

* Update use of various deprecated APIs (#4719)

This replaces the use of various deprecated APIs pointed out by warnings
thrown during runs of the test suite.
It also introduces one new feature and a bug fix.

Features:

 * Add non-zero exit code for failure to most `verdi daemon` commands, 
    so tests will catch possible errors.

Bug fixes:

* A couple of files were opened but not closed

Updates of deprecated APIs:

* np.int is deprecated alias of int

* np.float is deprecated alias of float

* put_object_from_filelike: force is deprecated

* archive import/export:  `silent` keyword is deprecated in favor of logger

* computer name => label

* Fix tests writing to the repository of nodes after they had been stored
  by replacing all times we use `.open` with `'w'` or `'wb'` mode
  with a correct call to `put_object_from_filelike` *before* the node is stored.

In one case, the data comes from a small archive file. In this case,
I recreated the (zipped) .aiida file adding two additional (binary) files
obtained by gzipping a short string.
This was used to ensure that `inputcat` and `outputcat` work also
when binary data was requested. Actually, this is better than before,
where the actual input or output of the calculation were overwritten
and then replaced back.

* communicator: replace deprecated stop() by close()

* silence some deprecation warnings in tests of APIs that will be removed in 2.0

Note that while unmuting the `ResourceWarning` was good to spot
some issues (bug fix above), the warning is raised in a couple more 
places where it's less obvious to fix (typically related to the daemon
starting some process in the background - or being started itself -
and not being stopped before the test actually finished).
I think this is an acceptable compromise - maybe we'll figure out
how to selectively silence those, and keeping warnings visible will
help us figure out possible leaks in the future.

Co-authored-by: Giovanni Pizzi <giovanni.pizzi@epfl.ch>

* ✨ NEW: Add `verdi database summary` (#4737)

Prints a summary of the count of each entity and,
with `-v` flag, additional summary of the unique identifiers for some entities.

* Upgrading dependency of sqlalchemy-utils (#4724)

* Upgrading dependency of sqlalchemy-utils

In sqlalchemy-utils 0.35, imports from collections where correctly
fixed to import from collections.abc (where this is needed).
This removes a few deprecation warnings (claiming that this will not
work in py 3.9, even if in reality this will stop working in py 3.10).
This partially addresses #4723.

We are actually pinning to >=0.36 since in 0.36 a feature was dropped
that we were planning to use (see #3845). In this way, we avoid relying
on a feature that is removed in later versions (risking to implement
something that then we have to remove, or even worse remain "pinned"
to an old version of sqlalchemy-utils because nobody has the time
to fix it with a different implementation [which is tricky, requires
some knowledge of how SqlAlchemy and PosgreSQL work]).

* Automated update of requirements/ files. (#4734)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Carl Simon Adorf <simon.adorf@epfl.ch>

* Bump aiida-prerequisites base image to 0.3.0 (#4738)

Changes in the new image:
- Updated conda (4.9.2)
- Start ssh-agent at user's startup

Co-authored-by: Chris Sewell <chrisj_sewell@hotmail.com>

* Add CalcJob test over SSH (#4732)

Adds a configuration for a remote computer (slurm docker container) and uses it
to run a CalcJob test over SSH.

This is a follow-up on the memory leak tests, since the leak of the process
instance was discovered to occur only when running CalcJobs on a remote
computer via an SSH connection.

Co-authored-by: Chris Sewell <chrisj_sewell@hotmail.com>

* 🧪 TESTS: Add pytest `requires_rmq` marker (#4739)

* Work on `verdi group remove-nodes` command (#4728).

* `verdi group remove-nodes`: Add warning when nodes are not in the group

Currently, the `verdi group remove-nodes` command does not raise any
warning when the nodes that the user wants to remove are not in the
group. It also says it removed the number of requested nodes from the
group, even when none of them is in the group specified.

Here we:

* Have the command fail with a `Critical` message when none of the
requested nodes are in the group.
* Raise a warning when any of the nodes requested are not in the
specified group, and list the PK's of the nodes that are missing.

Note that the Group.remove_nodes() command still does not raise any
warning when the requested nodes are not in the group.

* Fix bug and improve API

Fixes a bug when the user actually doesn't provide any nodes. In case
the `--clear` flag is also not provided, the command will fail since
there is nothing to remove. In case it is provided, the command will ask
for confirmation to remove all nodes unless the force flag is also set.

* Fail if both the `--clear` flag and nodes are provided

In the current API, it doesn't make sense to provide *both* the
`--clear` flag and a list of node identifiers. Here we check if both are
provided and abort the command in this case.

* Add tests.

* CI: Increase output verbosity of tests suite. (#4740)

* Skip test 'TestVerdiProcessDaemon::test_pause_play_kill'. (#4747)

The test randomly fails to complete within a reasonable amount of time
leading to a significant disruption of our CI pipeline.

Investigated in issue #4731.

* 🧪 TESTS: Fix pre-commit (pin astroid) (#4757)

Temporary fix for https://github.com/PyCQA/astroid/issues/895

* 🧪 TESTS: Fix plumpy incompatibility (#4751)

As of https://github.com/aiidateam/plumpy/commit/7004bd96bbaa678b5486a62677e139216877deef,
a paused workchain will hang if it is closed then played.
This test violated that rule and also was faulty,
in that it should test that the reloaded workchain can be played,
not the original workchain.

* 🔧 MAINTAIN: Reduce test warnings (#4742)

This commit reduces the number of pytest warnings of the test suite,
from 719 to 122:

- Replace `collections` with `collections.abc`
- pytest-asyncio does not work with `unittest.TestCase` derived tests (https://github.com/pytest-dev/pytest-asyncio/issues/77).
- `ProcessFuture` already closed via polling should not set a result via a broadcast event.
- Upgrade kiwipy and plumpy to fix:
  - https://github.com/aiidateam/kiwipy/pull/98
  - https://github.com/aiidateam/plumpy/pull/204
  - https://github.com/aiidateam/plumpy/pull/206

* 📚 DOCS: Add `BaseRestartWorkchain` how-to (#4709)

This section is adapted from:
https://github.com/aiidateam/aiida-tutorials/blob/master/docs/pages/2020_Intro_Week/sections/workflows_adv.rst

* CI: Bump reentry to v1.3.2 (#4746)

* 🐛 FIX: Node comments API (#4760)

* Fix hanging direct scheduler+ssh (#4735)

* Fix hanging direct scheduler+ssh

The fix is very simple: in the ssh transport, to emulate 'chdir',
we keep the current directory in memory, and we prepend every command
with a `cd FOLDER_NAME && ACTUALCOMMAND`.

One could put `;` instead of `&&`, but then if the folder does not
exist the ACTUALCOMMAND would still be run in the wrong folder, which is
very bad (imagine you are removing files...).

Now, in general this is not a problem. However, the direct scheduler
inserts a complex-syntax bash command to run the command in the background
and immediately get the PID of that process without waiting.
When combined with SSH, this hangs until the whole process is completed, unless
the actual command is wrapped in brackets.

A simple way to check this is running these two commands, that reproduce
the issue with plain ssh, without paramiko:

This hangs for 5 seconds:
```
ssh localhost 'cd tmp && sleep 5 > /dev/null 2>&1 & echo $!'
```

This returns immediately, as we want:
```
ssh localhost 'cd tmp && ( sleep 5 > /dev/null 2>&1 & echo $! )'
```

Also, adding a regression test for the hanging direct+ssh combination
This test checks that submitting a long job over the direct scheduler
does not "hang" with any plugin.

Co-authored-by: Leopold Talirz <leopold.talirz@gmail.com>

* ♻️ REFACTOR: configuration management API and CLI (#4712)

This commit primarily refactors the `verdi config` command
and merges the `cache_config.yml` into the `config.json`.

`config.json` changes:
- A jsonschema is added to validate the `config.json`,
  and also provide the options/defaults previously in `aiida/manage/configuration/options.py`.
- Rename option keys (with migration),
  for consistency with the internal representation
  (also rename `user.` fields to `autofill.user.`)
- Allow the `config.json` to contain a `$schema` key,
  that is preserved when storing new data
- Deprecated `cache_config.yml`: auto-merged into `config.json`,
  with deprecation warning, then renamed
- An `rmq.task_timeout` option has also been added
  (with default increased from 5 to 10 seconds),
  to fix timeout errors at high process loads.

`verdi config` changes:
- Refactor `verdi config` into separate commands: list/get/set/show/unset
- Include deprecation for current `verdi config <KEY>`
- `verdi caching` lists all process entry points that are enabled/disabled for caching

Also, code in `aiida/manage/caching.py` now utilises
the `get_config_option` function to retrieve caching configuration.

* 🧪 TESTS: add  `config_with_profile` fixture (#4764)

This allows for the removal of
`temporary_config_instance` and `with_temporary_config_instance`
from `tests/utils/configuration.py`

* 👌 IMPROVE: `verdi config list/show` (#4762)

Ensure these commands still work before a profile has been configured.

* 👌 IMPROVE: Add config `logging.aiopika_loglevel` (#4768)

* 📚 DOCS: Add process submit diagram (#4766)

* 📚 DOCS: Add process submit diagram

* Create submit_sysml.pptx

* 👌 IMPROVE: CTRL-C on running process (#4771)

Do not call `kill` on a process that is already being killed.
Also log a different message,
so that the user can see that the original CTRL-C was actioned.

* 🐛 FIX: kill_calculation before job submitted (#4770)

`job_id` will not yet have been set,
so we should not ask the scheduler to kill it.

* 🐛 FIX: `ModificationNotAllowed` on workchain kill (#4773)

In `Process.kill` the parent is killed first, then the children.
However, for workchains when entering the `Wait` state, awaitables (e.g. children)
are each assigned to `WorkChain.on_process_finished` as a callback on termination.
When the child is killed, this callback then calls `resolve_awaitable`,
which tries to update the status of the parent.
The parent is already terminated though and the node sealed -> `ModificationNotAllowed`.

In this commit we therefore check if the parent is already in a terminal state,
before attempting to update its status.

* 👌 IMPROVE: capture of node hashing errors (#4778)

Currently all exceptions are caught and ignored.
This commit adds a specific `HashingError` exception,
for known failure modes.
Only this exception is caught, if `ignore_errors=True`,
and the exception logged.

Also an `aiida_caplog` pytest fixture is added,
to enable logs from `AiiDA_LOGGER` to be captured.

* ⬆️ UPDATE: kiwipy/plumpy (#4776)

Update to new patch versions:

kiwipy v0.7.3:

- 👌 IMPROVE: Add debug logging for sending task/rpc/broadcast to RMQ.
- 👌 IMPROVE: Close created asyncio loop on RmqThreadCommunicator.close

plumpy v0.18.6:

- 👌 IMPROVE: Catch state change broadcast timeout

When using an RMQ communicator, the broadcast can timeout on heavy loads to RMQ.
This broadcast is not critical to the running of the process,
and so a timeout should not except it.
In aiida-core, the broadcast is subscribed to by `verdi process watch` (not critical),
in `aiida/engine/processes/futures.py:ProcessFuture` (unused),
and in `aiida/engine/runners.py:Runner.call_on_process_finish`
which has a backup polling mechanism on the node.

Also ensure the process PID is included in all log messages.

* Add fallback equality relationship based on uuid (#4753)

Add fallback equality relationship based on node uuid .

* Simplify AiidaTestCase implementation (#4779)

This simplifies the `AiidaTestCase` implementation - not yet replacing it with pytest fixtures, 
but hopefully getting one step closer to doing so eventually.

In particular
 * only truly backend-specific code is left in the backend-specific test classes
 * introduces `refurbish_db()` which includes the combination of cleaning the db and repopulating it with a user (which is a common combination)
 *  move creation of default computer from `setUpClass` to "on demand" (not needed by many tests)
 * merges `reset_database` and `clean_db` function that basically did the same
 * factors out the `get_default_user` function so that it can be reused outside the AiidaTestCase (`verdi setup`, pytest fixtures, ...) in a follow-up PR
 * add `orm.Computer.objects.get_or_create` (in analogy to similar methods for user, group, ...)

Note: While this change gets rid of unnecessary complexity, it does *not* switch to a mode where the database is cleaned between *every* test.
While some subclasses of `AiidaTestCase` do this, the `AiidaTestCase` itself only cleans the database in `setupClass`.
Some subclasses do significant test setup at the class level, which might slow things down if they had to be done for every test.

* 👌 IMPROVE: add broker_parameters to config schema (#4785)

* 👌 IMPROVE: Add 'exception' to projection mapping (#4786)

This commit adds `exception` to the list of allowed projections,
and also standardises the way the exception is set on the node
(capturing both the type and message).

* docs: reorder/simplify caching howto (#4787)

The howto on enabling caching has been reordered to move the concepts to
the beginning and technical details to where they fit better.

The figure has been simplified (complexity introduced by second input
node unnecessary).

Added explicit mention of the fact that hashing is enabled by default
(which may not be obvious).

Co-authored-by: Chris Sewell <chrisj_sewell@hotmail.com>

* docs: reference caching howto in workflow section (#4789)

It might be helpful to people learning about AiiDA workflows to know that caching exists and point them in that direction.

Co-authored-by: Leopold Talirz <leopold.talirz@gmail.com>

* setup: move away from legacy build backend (#4790)

The `pyproject.toml` was originally added in ca75832afb002b344b5854f2f049c74e80cad36b 
without specifying a backend, which implicitly defaults to the legacy `setuptools.build_meta:__legacy__` one. 
This choice was made explicit in a2bebb422f4a7b75e8ef65fd797f128abf12c6cc 

This can lead to issues when using a *system* version of setuptools
< 40.8.0, see [1].
We believe there is no good reason for sticking with the legacy build system.

I've tested that the `reentry_register` hook still works with the new
build backend.

[1] https://github.com/pypa/setuptools/issues/1694#issuecomment-466010982

* fix pymatgen imports (#4794)

pymatgen made a breaking change in v2021.3.4 that removed many classes
from the top level of the package.
The alternative imports were already available in previous versions,
i.e. we don't need to upgrade the pymatgen dependency.

* 🐛 FIX: `get_pymatgen_version` (#4796)

In version 2022.0.3 it was moved

* 👌 IMPROVE: add type checking for aiida/orm/nodes/process (#4772)

This commit adds type definitions to all code in `aiida/orm/nodes/process`,
and enables mypy type checking of the files.

Additionally, to fix mypy failures, two changes to the code were made:

1. Change `CalcJobNode.get_description` to return a string
2. In `aiida/engine/processes/calcjobs/tasks.py`,
   change `node.computer.get_authinfo(node.user)` to `node.get_authinfo()`,
   to use `CalcJobNode.get_authinfo` which checks if the computer is set.

* 🐛 FIX: `WorkChain.resolve_awaitable` (#4795)

An alteration to a recent fix (#4773);
`Process.has_terminated` is a method, not a property.

* 🐛 FIX: `Task.cancel` should not set state as EXCEPTED (#4792)

Currently, stopping the daemon in python 3.7 excepts all processes.
This is due to the code in `shutdown_runner`,
which cancels all asyncio tasks running on the loop,
including process continue and transport tasks.

Cancelling a task raises an `asyncio.CancellErrror`.
In python 3.8+ this exception only inherits from `BaseException`,
and so is not caught by any `except Exception` "checkpoints" in plumpy/aiida-core.
In python <= 3.7 however, the exception is equal to `concurrent.futures.CancelledError`,
and so it was caught by one of:
`Process.step`, `Running.execute` or `ProcessLauncher.handle_continue_exception`
and the process was set to an excepted state.

Ideally in the long-term, we will alter `shutdown_runner`,
to not use such a "brute-force" mechanism.
But in the short-term term this commit directly fixes the issue,
by re-raising the `asyncio.CancelledError` exception.

* Docs: fix the citation links on the index page (#4800)

The links were still using markdown syntax instead of restructured text.

* `CalcJob`: add the option to stash files after job completion (#4424)

A new namespace `stash` is added to the `metadata.options` input
namespace of the `CalcJob` process. This option namespace allows a user
to specify certain files that are created by the calculation job to be
stashed somewhere on the remote. This can be useful if those files need
to be stored for a longer time than the scratch space where the job was
run is typically not cleaned for, but need to be kept on the remote
machine and not retrieved. Examples are files that are necessary to
restart a calculation but are too big to be retrieved and stored
permanently in the local file repository.

The files that are to be stashed are specified through their relative
filepaths within the working directory in the `stash.source_list`
option. For now, the only supported option is to have AiiDA's engine
copy the files to another location on the same filesystem as the working
directory of the calculation job. The base path is defined through the
`stash.target_base` option. In the future, other methods may be
implemented, such as placing all files in a (compressed) tarball or even
stash files on tape. Which mode is to be used is communicated through
the enum `aiida.common.datastructures.StashMode` which for now therefore
only has the `COPY` value.

If the `stash` option namespace is defined for a calculation job, the
daemon will perform the stashing operations before the files are
retrieved. This also means that the stashing also happens before the
parsing of the output files (which occurs after the retrieving step)
which means that the files will be stashed independent of the final
exit status that the parser will assign to the calculation job. This
may cause files to be stashed of calculations that will later be
considered to have failed. However, the stashed files can always be
deleted manually by the user afterwards if needed.

Finally, the stashed files are represented by an output node that is
attached to the calculation node through the label `remote_stash`. Just
like the `remote_folder` node, this represents a location or files on a
remote machine and so is merely a "symbolic link" of sorts. AiiDA does
not actually own the files and the contents may disappear at some point.
To be able to distinguish the stashed folder from the remote folder, a
new data plugin is used, the `RemoteStashFolderData`. The base class is
`RemoteStashData` which is not instantiable, but will merely serve as a
base class for future subclasses, one for each `StashMode` value. The
reason is that the way files need to be accessed depend on the way they
were stashed and so it is good to have separate classes for this.

It was considered to give `RemoteFolderData` and `RemoteData` the same
base class (changing the type of the `remote_folder` to a new subclass
`RemoteFolderData`) but this would introduce breaking changes and so this
was relegated to a potential future major release.

* `verdi process play`: only query for active processes with `--all` flag (#4671)

The query used to target all process nodes with the `paused` attribute, so even
those in a terminal state. Here an additional filter is added to only query for nodes
in an active process state, because terminal nodes should not be affected. This
should speed up the query in principle.

* Dependencies: update pymatgen version specification (#4805)

Addresses #4797

* Dependencies: Pin sqlalchemy to minor release (#4809)

Version 1.4 currently breaks `verdi setup` and indeed,
according to https://www.sqlalchemy.org/download.html,
minor releases of SqlAlchemy may have breaking changes.

* 📚 DOCS: Add documentation on stashing (#4812)

Some additional minor changes

* Add link for `TransferCalcjob` feedback
* Add `versionadded` to `TransferCalcjob` docs

* 🔧 MAINTAIN: Add PyPI release workflow (#4807)

This is workflow is intended to reduce the potential for manual errors and faulty releases.

When you create the release, and hence git tag, this workflow is triggered;
checks the tag created matches the aiida package version,
runs pre-commit and (some) pytests and, if they all pass, deploys to PyPI.

* 🚀 RELEASE: v1.6.0 (#4816)

Co-authored-by: ramirezfranciscof <ramirezfranciscof@users.noreply.github.com>

Co-authored-by: Sebastiaan Huber <mail@sphuber.net>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Dominik Gresch <greschd@users.noreply.github.com>
Co-authored-by: flavianojs <flavianojs@live.com>
Co-authored-by: Marnik Bercx <mbercx@gmail.com>
Co-authored-by: Jason Eu <morty.yu@yahoo.com>
Co-authored-by: ramirezfranciscof <ramirezfranciscof@users.noreply.github.com>
Co-authored-by: Pranjal Mishra <39010495+pranjalmish1@users.noreply.github.com>
Co-authored-by: Leopold Talirz <leopold.talirz@gmail.com>
Co-authored-by: Casper Welzel Andersen <casper.andersen@epfl.ch>
Co-authored-by: Carl Simon Adorf <carl.simon.adorf@gmail.com>
Co-authored-by: Carl Simon Adorf <simon.adorf@epfl.ch>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Giovanni Pizzi <giovanni.pizzi@epfl.ch>
Co-authored-by: Aliaksandr Yakutovich <yakutovicha@gmail.com>
Co-authored-by: DanielMarchand <Daniel.marchand@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add optional TransportTask for CalcJob to stash files on the remote after the job has finished
5 participants