Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

♻️ Allow for file uploads/downloads to be async #6079

Closed
wants to merge 4 commits into from

Conversation

chrisjsewell
Copy link
Member

Note, this PR is currently dependent on aiidateam/plumpy#272


Currently, a possible bottleneck for workers (running potentially 1000s of processes asynchronously) is the upload/retrieval of calculation input/output data from an external compute resource (e.g. HPC).
This runs in a blocking manner, i.e. all other async tasks have to wait until all the input or outputs are fully uploaded/retrieved.

This could be made asynchronous, either at the "file level" - relinquishing control to the event loop after each file upload/download, or even at the "byte level" - relinquishing control after each "chunk of a file" has been uploaded/downloaded.
(For other transports, like FirecREST there are even other aspects of async to consider.)

This particular PR does not actually implement any async behaviour for uploads/downloads, it merely modifies the engine API to allow for implementations of the Transport interface to achieve this.

The PR changes the following functions/methods to async:

  • execmanager.upload_calculation
  • execmanager.retrieve_calculation
  • execmanager.retrieve_files_from_list
  • Calcjob.run
  • CalcJob._perform_dry_run
  • CalcJob._perform_import

However, all of these are not intended for use by the user, hence I would suggest this is backwards compatible.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Jul 9, 2023

As discussed with @giovannipizzi (who suggested it)

@khsrali
Copy link
Contributor

khsrali commented Oct 15, 2024

Hi @chrisjsewell, I hope you still remember some of your implementation here 🤞 🥲
So I'm trying out this PR (resolved the conflicts locally, also installed the relavant plumpy ).

However I'm face a "racing" scenario in "sub-tasks" of a calcjob.
Meaning if transport.put_async in upload_calculation would take some time (manually inserted await asyncio.sleep(1)) the other tasks of this specific process are racing over each other, like submit, or idk retrieve_calculation. Eventually, the calcjob ends up being excepted, because for .e.g. files are not uploaded yet and submit cannot find them..

To solve this, however, I have a main question:

  • Where and how exactly these "sub-tasks" are being lined up in "the" concurrent-queue? and where is that queue? How can I add these steps one after another to that queue?

I understand that many of these sort of magics are done in plumpy but JEEZ, it takes quite some time to understand it..

Maybe @sphuber could also answer this question

@khsrali
Copy link
Contributor

khsrali commented Nov 21, 2024

I pulled this changes, fixed the issues and developed a new transport plugin on top of it.
Check #6626

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants