-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prioritize migrating datasets to use versioning #277
Comments
Working on PR #307 I was wondering, if I should adapt the folder structure of the concerning files. For example: For the case of mastr data, and re_potential_data, I don't see any arguments against it, as they are basically only imported anyways. This would also lead to unambiguous task-ids in the airflow-Graphview! But what about scripts which currently are in the @gnn @ClaraBuettner @IlkaCu @ulfmueller I would appreciate your opinion on that. |
IMHO definitely yes. The move to
Usually, both, the code in the "importing" hierarchy, as well as the code below the "processing" hierarchy can be merged into one module (file, script) for the whole dataset. If the subdivision into importing and processing code really is essential for a particular dataset, for example |
As discussed in our last eGon-data telco it is highly recommended to remove script-internal checks to skip parts of the code (because files, tables etc. are already existing) as soon as the data set gets refactored and included into the workflow-wide versioning. |
Solve #341 as soon as this is done (or even better: consider for each dataset to avoid double work ;). |
With Zensus you migrated the last dataset, right @IlkaCu ? If so, please close.. |
Right! |
Refactoring existing datasets to use versioning in order speed up DAG runs takes up a lot of time and varies greatly in implementation complexity. In order to better manage the time spend on migrating tasks, we'll use a table collecting the datasets and information about them, in order to better prioritize migrating them:
You can insert information into the table by editing this post, or by putting information into the comments. I'll regularly scan the comments and insert any missing information into the tables.
The numbers range from one to five with one meaning "Low" (i.e. "Fast" for "Running Time), three meaning "Medium" and five meaning "High" ("Slow"). Feel free to use fractional numbers if five levels aren't enough for your categorization needs. Also, if you have actual numbers, like e.g. time measurements for the running time or the number of upstream and downstream tasks for connectedness, feel free to insert them as parenthesized comments.
The text was updated successfully, but these errors were encountered: