Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prioritize migrating datasets to use versioning #277

Closed
gnn opened this issue Jun 9, 2021 · 6 comments
Closed

Prioritize migrating datasets to use versioning #277

gnn opened this issue Jun 9, 2021 · 6 comments
Assignees

Comments

@gnn
Copy link
Collaborator

gnn commented Jun 9, 2021

Refactoring existing datasets to use versioning in order speed up DAG runs takes up a lot of time and varies greatly in implementation complexity. In order to better manage the time spend on migrating tasks, we'll use a table collecting the datasets and information about them, in order to better prioritize migrating them:

Dataset Running Time Connectedness Migration Complexity Issue Status Priority
OpenStreetMap 5 4 1 Usable 3
Zensus 4 In Progress 3
Database Setup 1 4 (No upstream task, but everything depends on it.) 1 Done 1
VG250 1 (seconds) 4 (many datasets depend on it) 1 #283 Done 5
substation 2 (minutes) 3 (osmtgmod depend on it) 1 #304 Done 4
osmTGmod 5 (up to 10 hours for Germany) 3 (needs VG250 and substation, other tasks depend on it) 3 (tool integration) Done 4
scenario_parameters 1 (seconds) 4 (central datasets depend on it) 1 (no upstream tasks, only code) #309 Done 3
demandregio 2 (~10 min) 3 4 (tool integration, un-versioned data from ffe) Done 2
mv grid district #297 Done
potential areas #297 Done
retrieve MaStR data #297 Done
egon_peta_heat 1 2 1 #302 Done 1
district_heating_areas 1 2 1 #303 Done 1
power plants #314 Done
weather data, RE feedin #314 Done
dynamic line rating

You can insert information into the table by editing this post, or by putting information into the comments. I'll regularly scan the comments and insert any missing information into the tables.
The numbers range from one to five with one meaning "Low" (i.e. "Fast" for "Running Time), three meaning "Medium" and five meaning "High" ("Slow"). Feel free to use fractional numbers if five levels aren't enough for your categorization needs. Also, if you have actual numbers, like e.g. time measurements for the running time or the number of upstream and downstream tasks for connectedness, feel free to insert them as parenthesized comments.

@nailend
Copy link
Contributor

nailend commented Jun 25, 2021

Working on PR #307 I was wondering, if I should adapt the folder structure of the concerning files. For example:
moving data/importing/mastr.py into datat/datasets/mastr/__init__.py

For the case of mastr data, and re_potential_data, I don't see any arguments against it, as they are basically only imported anyways. This would also lead to unambiguous task-ids in the airflow-Graphview!

But what about scripts which currently are in the data/processing directory? Did you already decide on a procedure?

@gnn @ClaraBuettner @IlkaCu @ulfmueller

I would appreciate your opinion on that.

@gnn
Copy link
Collaborator Author

gnn commented Jun 30, 2021

Working on PR #307 I was wondering, if I should adapt the folder structure of the concerning files. For example:
moving data/importing/mastr.py into datat/datasets/mastr/__init__.py

IMHO definitely yes. The move to Datasets was also the first step in getting rid of the "importing"/"processing" divide. Concerning your example, I'd also like to point out that data/importing/mastr.py should move to data/datasets/mastr.py unless the "mastr" directory is necessary in order to hold additional files.

For the case of mastr data, and re_potential_data, I don't see any arguments against it, as they are basically only imported anyways. This would also lead to unambiguous task-ids in the airflow-Graphview!

But what about scripts which currently are in the data/processing directory? Did you already decide on a procedure?

Usually, both, the code in the "importing" hierarchy, as well as the code below the "processing" hierarchy can be merged into one module (file, script) for the whole dataset. If the subdivision into importing and processing code really is essential for a particular dataset, for example ds1, one could recreate the distinction below the "datasets" hierarchy via e.g. datasets/ds1/importing.py and datasets/ds1/processing.py.

@IlkaCu
Copy link
Member

IlkaCu commented Jul 14, 2021

As discussed in our last eGon-data telco it is highly recommended to remove script-internal checks to skip parts of the code (because files, tables etc. are already existing) as soon as the data set gets refactored and included into the workflow-wide versioning.

@nesnoj
Copy link
Member

nesnoj commented Jul 21, 2021

Solve #341 as soon as this is done (or even better: consider for each dataset to avoid double work ;).

@nesnoj
Copy link
Member

nesnoj commented Dec 21, 2021

With Zensus you migrated the last dataset, right @IlkaCu ? If so, please close..

@IlkaCu
Copy link
Member

IlkaCu commented Dec 22, 2021

With Zensus you migrated the last dataset, right @IlkaCu ? If so, please close..

Right!

@IlkaCu IlkaCu closed this as completed Dec 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants