Prioritize migrating datasets to use versioning #277

gnn · 2021-06-09T11:09:39Z

Refactoring existing datasets to use versioning in order speed up DAG runs takes up a lot of time and varies greatly in implementation complexity. In order to better manage the time spend on migrating tasks, we'll use a table collecting the datasets and information about them, in order to better prioritize migrating them:

Dataset	Running Time	Connectedness	Migration Complexity	Issue	Status	Priority
OpenStreetMap	5	4	1		Usable	3
Zensus			4		In Progress	3
Database Setup	1	4 (No upstream task, but everything depends on it.)	1		Done	1
VG250	1 (seconds)	4 (many datasets depend on it)	1	#283	Done	5
substation	2 (minutes)	3 (osmtgmod depend on it)	1	#304	Done	4
osmTGmod	5 (up to 10 hours for Germany)	3 (needs VG250 and substation, other tasks depend on it)	3 (tool integration)		Done	4
scenario_parameters	1 (seconds)	4 (central datasets depend on it)	1 (no upstream tasks, only code)	#309	Done	3
demandregio	2 (~10 min)	3	4 (tool integration, un-versioned data from ffe)		Done	2
mv grid district				#297	Done
potential areas				#297	Done
retrieve MaStR data				#297	Done
egon_peta_heat	1	2	1	#302	Done	1
district_heating_areas	1	2	1	#303	Done	1
power plants				#314	Done
weather data, RE feedin				#314	Done
dynamic line rating

You can insert information into the table by editing this post, or by putting information into the comments. I'll regularly scan the comments and insert any missing information into the tables.
The numbers range from one to five with one meaning "Low" (i.e. "Fast" for "Running Time), three meaning "Medium" and five meaning "High" ("Slow"). Feel free to use fractional numbers if five levels aren't enough for your categorization needs. Also, if you have actual numbers, like e.g. time measurements for the running time or the number of upstream and downstream tasks for connectedness, feel free to insert them as parenthesized comments.

nailend · 2021-06-25T06:36:06Z

Working on PR #307 I was wondering, if I should adapt the folder structure of the concerning files. For example:
moving data/importing/mastr.py into datat/datasets/mastr/__init__.py

For the case of mastr data, and re_potential_data, I don't see any arguments against it, as they are basically only imported anyways. This would also lead to unambiguous task-ids in the airflow-Graphview!

But what about scripts which currently are in the data/processing directory? Did you already decide on a procedure?

@gnn @ClaraBuettner @IlkaCu @ulfmueller

I would appreciate your opinion on that.

gnn · 2021-06-30T01:50:54Z

Working on PR #307 I was wondering, if I should adapt the folder structure of the concerning files. For example:
moving data/importing/mastr.py into datat/datasets/mastr/__init__.py

IMHO definitely yes. The move to Datasets was also the first step in getting rid of the "importing"/"processing" divide. Concerning your example, I'd also like to point out that data/importing/mastr.py should move to data/datasets/mastr.py unless the "mastr" directory is necessary in order to hold additional files.

For the case of mastr data, and re_potential_data, I don't see any arguments against it, as they are basically only imported anyways. This would also lead to unambiguous task-ids in the airflow-Graphview!

But what about scripts which currently are in the data/processing directory? Did you already decide on a procedure?

Usually, both, the code in the "importing" hierarchy, as well as the code below the "processing" hierarchy can be merged into one module (file, script) for the whole dataset. If the subdivision into importing and processing code really is essential for a particular dataset, for example ds1, one could recreate the distinction below the "datasets" hierarchy via e.g. datasets/ds1/importing.py and datasets/ds1/processing.py.

IlkaCu · 2021-07-14T12:48:28Z

As discussed in our last eGon-data telco it is highly recommended to remove script-internal checks to skip parts of the code (because files, tables etc. are already existing) as soon as the data set gets refactored and included into the workflow-wide versioning.

nesnoj · 2021-07-21T13:47:55Z

Solve #341 as soon as this is done (or even better: consider for each dataset to avoid double work ;).

nesnoj · 2021-12-21T15:33:56Z

With Zensus you migrated the last dataset, right @IlkaCu ? If so, please close..

IlkaCu · 2021-12-22T08:44:03Z

With Zensus you migrated the last dataset, right @IlkaCu ? If so, please close..

Right!

gnn assigned gplssm, ulfmueller, IlkaCu, ClaraBuettner and AmeliaNadal Jun 9, 2021

nailend mentioned this issue Jun 22, 2021

Migrate datasets to versioning module #297

Closed

5 tasks

nailend self-assigned this Jun 24, 2021

nesnoj mentioned this issue Sep 1, 2021

Migrate zensus to datasets #422

Closed

IlkaCu closed this as completed Dec 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prioritize migrating datasets to use versioning #277

Prioritize migrating datasets to use versioning #277

gnn commented Jun 9, 2021 •

edited by IlkaCu

Loading

nailend commented Jun 25, 2021 •

edited

Loading

gnn commented Jun 30, 2021 •

edited

Loading

IlkaCu commented Jul 14, 2021

nesnoj commented Jul 21, 2021

nesnoj commented Dec 21, 2021

IlkaCu commented Dec 22, 2021

Prioritize migrating datasets to use versioning #277

Prioritize migrating datasets to use versioning #277

Comments

gnn commented Jun 9, 2021 • edited by IlkaCu Loading

nailend commented Jun 25, 2021 • edited Loading

gnn commented Jun 30, 2021 • edited Loading

IlkaCu commented Jul 14, 2021

nesnoj commented Jul 21, 2021

nesnoj commented Dec 21, 2021

IlkaCu commented Dec 22, 2021

gnn commented Jun 9, 2021 •

edited by IlkaCu

Loading

nailend commented Jun 25, 2021 •

edited

Loading

gnn commented Jun 30, 2021 •

edited

Loading