-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor the archive exporter #4448
Refactor the archive exporter #4448
Conversation
This PR is part of a set that aims to decouple the generation of data to export, from the actual writing of the archive; as a prelude to introducing a new archive format. This particular PR does not introduce any breaking changes, it: - splits the amazingly complex `export_tree` function into a number of separate, self-contained, functions - adds mypy compliant typing - fixes an bug whereby, in python 3.8, `logging.disable(level=` has changed to `logging.disable(lvl=`
This refactor fully decouples the extraction of data from the database, from the writing of that data to an archive. It provides an abstract writer class, then each concrete writer is identified by a string (this could in fact be made into an entry point) This code is backwards incompatible, in terms of the Python API, in that it removes the `extract_zip` and `export_tar` functions, but should still work with the verdi CLI, which only calls the `export` fcuntion, whose signature remains the same. The commit also adds a backward compatible install requirement, `dataclasses`, which is included in python core from python 3.7 onwards. The tests have not yet been altered to accomodate for these changes.
@sphuber this is the second part of my master plan 😄 |
Note the last thing I plan to do (maybe in a separate PR) is to decouple the ~hard-coded progress bar implementation into a more abstract (possible callback-like) mechanism |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot @chrisjsewell
there are a couple of minor comments, but mainly naming issues, nothing major.
I will now check out the PR and do a local test, to see whether performance remains the same.
group_uuids: Dict[str, List[str]] | ||
link_uuids: List[dict] | ||
# all entity data from the database, except Node extras and attributes | ||
entity_data: Dict[str, Dict[int, dict]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason why this does not need to be iterable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well I've just initially copied what was being used previously.
It probably could be though.
node_data was the main one I wanted to make iterable, because that will be the one that takes up the most memory. I actually want to make it a generator, but just need to make sure that works with the progress bar updates
I now did some tests exporting. Performance seems unaffected (had no reason to think it was, just checking to be safe) - there is just one minor detail: I think the final progress bar is no longer being removed. old
new
@sphuber @giovannipizzi fyi - I just want to reiterate that (on the server I'm testing on, with a magnetic disk), the time spent on exporting the file repository is strongly dominating overall export time. Exporting the same set of nodes twice in a row, I am seeing huge differences in the timings for the export of the repository just from the disk cache - with the total export time going down by a factor 6 from 192s when running it the first time to ~31s the second time. |
yep already got a fix for that 👍 |
Codecov Report
@@ Coverage Diff @@
## develop #4448 +/- ##
===========================================
+ Coverage 79.31% 79.37% +0.06%
===========================================
Files 475 475
Lines 34840 34930 +90
===========================================
+ Hits 27631 27722 +91
+ Misses 7209 7208 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @chrisjsewell !
from my side this is fine, just two minor comments. you can fix them or not - you'll need to update the PR anyhow before merging.
@@ -475,9 +492,9 @@ def export( | |||
**traversal_rules | |||
) | |||
|
|||
output_data = {} | |||
report_data: Dict[str, Any] = {'time_write_start': None, 'time_write_stop': None, 'writer_data': None} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity - is this the suggested way to build instances of dataclasses, i.e. to build a dict and then pass it to the constructor?
E.g. couldn't you just initialize it below with
report = ExportReport(time_collect_start=..., time_collect_stop=...)
and then just add to the instance?
It's not very important here, but in general it would avoid retyping the attributes of the data class as dict keys (possibly introducing typos).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well you would have to intialise it with a dummy time_collect_start
etc, then assume that it will get overriden later on, which I don't think is a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, you could just initialize it once you have both collect
times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah potatas, potatas, both are valid 😉
Co-authored-by: Leopold Talirz <leopold.talirz@gmail.com>
ok, so turns out the summary of traversal rules output by the Then to test (for those interested but also a note to self):
version: '3.4'
services:
aiida:
build:
context: ../
dockerfile: Dockerfile
image: "aiidateam/aiida-core:latest"
container_name: aiida-core
networks:
- aiida-core
volumes:
- "../:/tmp/mount_folder"
networks:
aiida-core: $ docker-compose -f .ci/docker-aiida-core.yml up -d --build
$ docker exec -it --user aiida aiida-core /bin/bash
root = orm.Dict()
recursive_provenance(root, depth=9, breadth=3, num_objects=1)
$ verdi export create --input-calc-forward -N 1 -- tmp/mount_folder/out.aiida
-
EXPORT
-------------- --------------------------
Archive tmp/mount_folder/out.aiida
Format Zip (compressed)
Export version 0.9
Inclusion rules
----------------- ----
Include Comments True
Include Logs True
Traversal rules
--------------------------------- -----
Follow links input calc forwards True
Follow links input calc backwards True
Follow links create forwards True
Follow links create backwards True
Follow links return forwards True
Follow links return backwards False
Follow links input work forwards False
Follow links input work backwards True
Follow links call calc forwards True
Follow links call calc backwards True
Follow links call work forwards True
Follow links call work backwards True
Exporting a total of 40066 database entries, of which 40065 are Nodes.
Success: wrote the export archive file to tmp/mount_folder/out.aiida |
hehe, I think this warrants opening an issue to be closed by this PR ;-) |
lol do we get issue closing related pay |
just as a record that this issue existed ;-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this if the tests pass.
Please restrain yourself from emojis in the final commit message for the time being ;-)
👎 😢 😭 |
Honestly they do make the commit log so much clearer though (but I'll refrain) |
Ah got some autodoc nitpicks to fix:
I believe this is due to the order of processing (i.e. the object referencing it is processed before the object being referenced). Must be a sphinx v3 regression, I'll open an issue on sphinx about it |
This PR builds on #4448, with the goal of improving both the export writer API (allowing for "streamed" data writing) and performance of the export process (CPU and memory usage). The writer is now used as a context manager, rather than passing all data to it after extraction of the data from the AiiDA database. This means it is called throughout the export process, and will allow for less data to be kept in RAM when moving to a new archive format. The number of database queries has also been reduced, resulting in a faster process. Lastly, code for read/writes to the archive has been moved to the https://github.com/aiidateam/archive-path package. This standardises the interface for both zip and tar, and especially for export to tar, provides much improved performance, since the data is now written directly to the archive (rather than writing to a folder then only compressing at the end). Co-authored-by: Leopold Talirz <leopold.talirz@gmail.com>
I would appreciate if I would be kept in the loop when we add dependencies to aiida-core. |
thanks @csadorf , @chrisjsewell and me will remember it |
This PR builds on top of #4446
The refactor fully decouples the extraction of data from the database, from the writing of that data to an archive.
It provides an abstract writer class, then each concrete writer is identified by a string (this could in fact be made into an entry point)
This code is backwards incompatible, in terms of the Python API, in that it removes the
extract_zip
andexport_tar
functions, but should still work with the verdi CLI, which only calls theexport
function, whose signature remains the same.The commit also adds a backward compatible install requirement, dataclasses, which is included in python core from python 3.7 onwards.
The tests have not yet been altered to accommodate for these changes.