Refactor the archive exporter #4448

chrisjsewell · 2020-10-16T08:13:26Z

This PR builds on top of #4446

The refactor fully decouples the extraction of data from the database, from the writing of that data to an archive.

It provides an abstract writer class, then each concrete writer is identified by a string (this could in fact be made into an entry point)

This code is backwards incompatible, in terms of the Python API, in that it removes the extract_zip and export_tar functions, but should still work with the verdi CLI, which only calls the export function, whose signature remains the same.

The commit also adds a backward compatible install requirement, dataclasses, which is included in python core from python 3.7 onwards.

The tests have not yet been altered to accommodate for these changes.

This PR is part of a set that aims to decouple the generation of data to export, from the actual writing of the archive; as a prelude to introducing a new archive format. This particular PR does not introduce any breaking changes, it: - splits the amazingly complex `export_tree` function into a number of separate, self-contained, functions - adds mypy compliant typing - fixes an bug whereby, in python 3.8, `logging.disable(level=` has changed to `logging.disable(lvl=`

This refactor fully decouples the extraction of data from the database, from the writing of that data to an archive. It provides an abstract writer class, then each concrete writer is identified by a string (this could in fact be made into an entry point) This code is backwards incompatible, in terms of the Python API, in that it removes the `extract_zip` and `export_tar` functions, but should still work with the verdi CLI, which only calls the `export` fcuntion, whose signature remains the same. The commit also adds a backward compatible install requirement, `dataclasses`, which is included in python core from python 3.7 onwards. The tests have not yet been altered to accomodate for these changes.

chrisjsewell · 2020-10-16T08:14:07Z

@sphuber this is the second part of my master plan 😄

chrisjsewell · 2020-10-16T08:20:23Z

Note the last thing I plan to do (maybe in a separate PR) is to decouple the ~hard-coded progress bar implementation into a more abstract (possible callback-like) mechanism

aiida/tools/importexport/dbexport/__init__.py

ltalirz

thanks a lot @chrisjsewell

there are a couple of minor comments, but mainly naming issues, nothing major.
I will now check out the PR and do a local test, to see whether performance remains the same.

aiida/tools/importexport/dbexport/__init__.py

ltalirz · 2020-10-22T12:24:57Z

aiida/tools/importexport/dbexport/__init__.py

+    group_uuids: Dict[str, List[str]]
+    link_uuids: List[dict]
+    # all entity data from the database, except Node extras and attributes
+    entity_data: Dict[str, Dict[int, dict]]


Any particular reason why this does not need to be iterable?

well I've just initially copied what was being used previously.
It probably could be though.
node_data was the main one I wanted to make iterable, because that will be the one that takes up the most memory. I actually want to make it a generator, but just need to make sure that works with the progress bar updates

aiida/tools/importexport/dbexport/__init__.py

ltalirz · 2020-10-22T14:25:31Z

I now did some tests exporting.

Performance seems unaffected (had no reason to think it was, just checking to be safe) - there is just one minor detail: I think the final progress bar is no longer being removed.

old

Exporting a total of 13728 database entries, of which 9421 are Nodes.
Success: wrote the export archive file to develop.aiida

new

Exporting a total of 13728 database entries, of which 9421 are Nodes.
Exporting repository - UUID=e65d7e5d      99.9%|███████████████████████████████████████████████████████████▉| 9412/9421
Success: wrote the export archive file to chris.aiida

@sphuber @giovannipizzi fyi - I just want to reiterate that (on the server I'm testing on, with a magnetic disk), the time spent on exporting the file repository is strongly dominating overall export time.

Exporting the same set of nodes twice in a row, I am seeing huge differences in the timings for the export of the repository just from the disk cache - with the total export time going down by a factor 6 from 192s when running it the first time to ~31s the second time.
Very much looking forward to the improvements brought by the new repository implementation :-)

chrisjsewell · 2020-10-22T14:30:29Z

there is just one minor detail: I think the final progress bar is no longer being removed.

yep already got a fix for that 👍

codecov · 2020-10-22T16:54:45Z

Codecov Report

Merging #4448 into develop will increase coverage by 0.06%.
The diff coverage is 97.96%.

@@             Coverage Diff             @@
##           develop    #4448      +/-   ##
===========================================
+ Coverage    79.31%   79.37%   +0.06%     
===========================================
  Files          475      475              
  Lines        34840    34930      +90     
===========================================
+ Hits         27631    27722      +91     
+ Misses        7209     7208       -1

Flag	Coverage Δ
#django	`73.23% <97.96%> (+0.07%)`	⬆️
#sqlalchemy	`72.46% <97.96%> (+0.08%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
aiida/tools/importexport/dbexport/__init__.py	`98.06% <97.94%> (+0.34%)`	⬆️
aiida/tools/importexport/dbexport/utils.py	`82.27% <100.00%> (-0.83%)`	⬇️
aiida/transports/plugins/local.py	`82.57% <0.00%> (-0.25%)`	⬇️
aiida/engine/daemon/client.py	`73.57% <0.00%> (+1.15%)`	⬆️
aiida/engine/daemon/runner.py	`82.76% <0.00%> (+3.45%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 228716f...7206fdb. Read the comment docs.

ltalirz

thanks @chrisjsewell !

from my side this is fine, just two minor comments. you can fix them or not - you'll need to update the PR anyhow before merging.

aiida/tools/importexport/dbexport/__init__.py

ltalirz · 2020-10-22T20:53:47Z

aiida/tools/importexport/dbexport/__init__.py

@@ -475,9 +492,9 @@ def export(
            **traversal_rules
        )

-        output_data = {}
+        report_data: Dict[str, Any] = {'time_write_start': None, 'time_write_stop': None, 'writer_data': None}


Just out of curiosity - is this the suggested way to build instances of dataclasses, i.e. to build a dict and then pass it to the constructor?

E.g. couldn't you just initialize it below with
report = ExportReport(time_collect_start=..., time_collect_stop=...)
and then just add to the instance?

It's not very important here, but in general it would avoid retyping the attributes of the data class as dict keys (possibly introducing typos).

well you would have to intialise it with a dummy time_collect_start etc, then assume that it will get overriden later on, which I don't think is a good idea

no, you could just initialize it once you have both collect times

ah potatas, potatas, both are valid 😉

Co-authored-by: Leopold Talirz <leopold.talirz@gmail.com>

chrisjsewell · 2020-10-22T23:15:40Z

ok, so turns out the summary of traversal rules output by the verdi export create was completely wrong lol (can't believe no one had noticed), so in 9151ab3 I have fixed that by ensuring the rule set and defaults are taken directly from the GraphTraversalRules enum.

Then to test (for those interested but also a note to self):

I fired up a docker-compose file (put a .dockerignore in the repo to ignore the .tox folder):

version: '3.4'
services:
  aiida:
    build:
      context: ../
      dockerfile: Dockerfile
    image: "aiidateam/aiida-core:latest"
    container_name: aiida-core
    networks:
      - aiida-core
    volumes:
      - "../:/tmp/mount_folder"

networks:
  aiida-core:

$ docker-compose -f .ci/docker-aiida-core.yml up -d --build
$ docker exec -it --user aiida aiida-core /bin/bash

Used my recursive tree builder (in tests/benchmark/test_importexport.py) to generate a 40,000 node tree (I CTRL-C'd early)

root = orm.Dict()
recursive_provenance(root, depth=9, breadth=3, num_objects=1)

Run the CLI

$ verdi export create --input-calc-forward -N 1 -- tmp/mount_folder/out.aiida
-
EXPORT
--------------  --------------------------
Archive         tmp/mount_folder/out.aiida
Format          Zip (compressed)
Export version  0.9

Inclusion rules
-----------------  ----
Include Comments   True
Include Logs       True

Traversal rules
---------------------------------  -----
Follow links input calc forwards   True
Follow links input calc backwards  True
Follow links create forwards       True
Follow links create backwards      True
Follow links return forwards       True
Follow links return backwards      False
Follow links input work forwards   False
Follow links input work backwards  True
Follow links call calc forwards    True
Follow links call calc backwards   True
Follow links call work forwards    True
Follow links call work backwards   True

Exporting a total of 40066 database entries, of which 40065 are Nodes.                                                                                                                                            
Success: wrote the export archive file to tmp/mount_folder/out.aiida

ltalirz · 2020-10-22T23:20:33Z

hehe, I think this warrants opening an issue to be closed by this PR ;-)

chrisjsewell · 2020-10-22T23:23:31Z

hehe, I think this warrants opening an issue to be closed by this PR ;-)

lol do we get issue closing related pay

ltalirz · 2020-10-22T23:24:46Z

just as a record that this issue existed ;-)

ltalirz

I'm fine with this if the tests pass.

Please restrain yourself from emojis in the final commit message for the time being ;-)

chrisjsewell · 2020-10-22T23:25:57Z

👎 😢 😭

chrisjsewell · 2020-10-22T23:26:59Z

Honestly they do make the commit log so much clearer though (but I'll refrain)

chrisjsewell · 2020-10-22T23:38:21Z

Ah got some autodoc nitpicks to fix:

/home/circleci/.local/lib/python3.7/site-packages/aiida/tools/importexport/dbexport/__init__.py:docstring of aiida.tools.importexport.ArchiveWriterAbstract.write:: WARNING: py:class reference target not found: aiida.tools.importexport.dbexport.ArchiveData
/home/circleci/.local/lib/python3.7/site-packages/aiida/tools/importexport/__init__.py:docstring of aiida.tools.importexport.export:: WARNING: py:class reference target not found: aiida.tools.importexport.dbexport.ExportReport
/home/circleci/.local/lib/python3.7/site-packages/aiida/tools/importexport/dbexport/__init__.py:docstring of aiida.tools.importexport.dbexport.ArchiveWriterAbstract.write:: WARNING: py:class reference target not found: aiida.tools.importexport.dbexport.ArchiveData
/home/circleci/.local/lib/python3.7/site-packages/aiida/tools/importexport/dbexport/__init__.py:docstring of aiida.tools.importexport.dbexport.export:: WARNING: py:class reference target not found: aiida.tools.importexport.dbexport.ExportReport

I believe this is due to the order of processing (i.e. the object referencing it is processed before the object being referenced). Must be a sphinx v3 regression, I'll open an issue on sphinx about it

This PR builds on #4448, with the goal of improving both the export writer API (allowing for "streamed" data writing) and performance of the export process (CPU and memory usage). The writer is now used as a context manager, rather than passing all data to it after extraction of the data from the AiiDA database. This means it is called throughout the export process, and will allow for less data to be kept in RAM when moving to a new archive format. The number of database queries has also been reduced, resulting in a faster process. Lastly, code for read/writes to the archive has been moved to the https://github.com/aiidateam/archive-path package. This standardises the interface for both zip and tar, and especially for export to tar, provides much improved performance, since the data is now written directly to the archive (rather than writing to a folder then only compressing at the end). Co-authored-by: Leopold Talirz <leopold.talirz@gmail.com>

csadorf · 2020-11-16T09:43:47Z

I would appreciate if I would be kept in the loop when we add dependencies to aiida-core.

ltalirz · 2020-11-16T13:35:21Z

thanks @csadorf , @chrisjsewell and me will remember it

chrisjsewell added 2 commits October 16, 2020 07:39

chrisjsewell mentioned this pull request Oct 16, 2020

👌 Improve archive export code #4446

Closed

chrisjsewell changed the title ~~♻️ Refactor archive exporter~~ ♻️ Refactor the archive exporter Oct 16, 2020

chrisjsewell added 6 commits October 22, 2020 11:08

make back compatible (partial)

8bc6116

reinstate export_zip and export_tar

2093194

make keyword arguments explicit

797b1b9

make internal function private

df2bcb8

minor improvements

cb5d1c2

Update __init__.py

3653d82

ltalirz reviewed Oct 22, 2020

View reviewed changes

aiida/tools/importexport/dbexport/__init__.py Outdated Show resolved Hide resolved

chrisjsewell added 3 commits October 22, 2020 14:02

make node data an iterable

0c8d122

rename ExportData attributes

b3f072d

Add deprecation warnings

6f9ddd5

ltalirz suggested changes Oct 22, 2020

View reviewed changes

ltalirz marked this pull request as ready for review October 22, 2020 13:51

ltalirz reviewed Oct 22, 2020

View reviewed changes

aiida/tools/importexport/dbexport/__init__.py Outdated Show resolved Hide resolved

chrisjsewell added 2 commits October 22, 2020 16:51

Apply @ltalirz suggestions

b689f5e

reimplement export_tree

06bf80f

chrisjsewell requested a review from ltalirz October 22, 2020 16:41

Ensure logging level is reset

0121c49

chrisjsewell added 4 commits October 22, 2020 18:57

annotate ArchiveData attributes

7b7931e

make metadata a dataclass

e564a59

add ExportReport

9b9c0c9

move export version control to the writer

309ef94

add docstring

6f6b3e0

ltalirz previously approved these changes Oct 22, 2020

View reviewed changes

fix traversal summary and make export version property

9151ab3

chrisjsewell dismissed ltalirz’s stale review via 9151ab3 October 22, 2020 23:04

Update aiida/tools/importexport/dbexport/__init__.py

aa2144b

Co-authored-by: Leopold Talirz <leopold.talirz@gmail.com>

ltalirz self-requested a review October 22, 2020 23:21

ltalirz previously approved these changes Oct 22, 2020

View reviewed changes

Merge branch 'develop' into export/refactor2

0a8d177

fix doc warnings

7206fdb

chrisjsewell dismissed ltalirz’s stale review via 7206fdb October 22, 2020 23:44

chrisjsewell changed the title ~~♻️ Refactor the archive exporter~~ Refactor the archive exporter Oct 23, 2020

chrisjsewell merged commit c6d38c1 into aiidateam:develop Oct 23, 2020

chrisjsewell deleted the export/refactor2 branch October 23, 2020 00:12

chrisjsewell mentioned this pull request Oct 23, 2020

🧪 add docker-compose for local testing #4503

Closed

ltalirz self-requested a review October 23, 2020 07:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the archive exporter #4448

Refactor the archive exporter #4448

chrisjsewell commented Oct 16, 2020 •

edited

Loading

chrisjsewell commented Oct 16, 2020

chrisjsewell commented Oct 16, 2020

ltalirz left a comment

ltalirz Oct 22, 2020

chrisjsewell Oct 22, 2020 •

edited

Loading

ltalirz commented Oct 22, 2020 •

edited

Loading

chrisjsewell commented Oct 22, 2020

codecov bot commented Oct 22, 2020 •

edited

Loading

ltalirz left a comment

ltalirz Oct 22, 2020 •

edited

Loading

chrisjsewell Oct 22, 2020

ltalirz Oct 22, 2020

chrisjsewell Oct 22, 2020

chrisjsewell commented Oct 22, 2020 •

edited

Loading

ltalirz commented Oct 22, 2020

chrisjsewell commented Oct 22, 2020

ltalirz commented Oct 22, 2020

ltalirz left a comment

chrisjsewell commented Oct 22, 2020

chrisjsewell commented Oct 22, 2020

chrisjsewell commented Oct 22, 2020

csadorf commented Nov 16, 2020

ltalirz commented Nov 16, 2020

Refactor the archive exporter #4448

Refactor the archive exporter #4448

Conversation

chrisjsewell commented Oct 16, 2020 • edited Loading

chrisjsewell commented Oct 16, 2020

chrisjsewell commented Oct 16, 2020

ltalirz left a comment

Choose a reason for hiding this comment

ltalirz Oct 22, 2020

Choose a reason for hiding this comment

chrisjsewell Oct 22, 2020 • edited Loading

Choose a reason for hiding this comment

ltalirz commented Oct 22, 2020 • edited Loading

old

new

chrisjsewell commented Oct 22, 2020

codecov bot commented Oct 22, 2020 • edited Loading

Codecov Report

ltalirz left a comment

Choose a reason for hiding this comment

ltalirz Oct 22, 2020 • edited Loading

Choose a reason for hiding this comment

chrisjsewell Oct 22, 2020

Choose a reason for hiding this comment

ltalirz Oct 22, 2020

Choose a reason for hiding this comment

chrisjsewell Oct 22, 2020

Choose a reason for hiding this comment

chrisjsewell commented Oct 22, 2020 • edited Loading

ltalirz commented Oct 22, 2020

chrisjsewell commented Oct 22, 2020

ltalirz commented Oct 22, 2020

ltalirz left a comment

Choose a reason for hiding this comment

chrisjsewell commented Oct 22, 2020

chrisjsewell commented Oct 22, 2020

chrisjsewell commented Oct 22, 2020

csadorf commented Nov 16, 2020

ltalirz commented Nov 16, 2020

chrisjsewell commented Oct 16, 2020 •

edited

Loading

chrisjsewell Oct 22, 2020 •

edited

Loading

ltalirz commented Oct 22, 2020 •

edited

Loading

codecov bot commented Oct 22, 2020 •

edited

Loading

ltalirz Oct 22, 2020 •

edited

Loading

chrisjsewell commented Oct 22, 2020 •

edited

Loading