Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option for running workflow in test mode #112

Closed
8 tasks done
gplssm opened this issue Feb 10, 2021 · 14 comments · Fixed by #118 or #159
Closed
8 tasks done

Option for running workflow in test mode #112

gplssm opened this issue Feb 10, 2021 · 14 comments · Fixed by #118 or #159
Assignees
Labels
🚀 feature New feature or feature request 🧪 testing Tests and CI

Comments

@gplssm
Copy link
Contributor

gplssm commented Feb 10, 2021

Since running the workflow with the dataset for entire Germany has too long computation time for effective testing, a test mode with a subset of this data (i.e. a single federal state) is needed. Activating this test mode then instructs the workflow to run only on this subset of data.

Tasks

  • Implement a switch for test mode that can be toggled from the UI and CLI
  • Download and import the respective subset of OSM data
  • Download and import the respective subset of VG250 data (for federal states and below)
  • Select subset frpm DemandRegio
  • Download and import the respective subset of Zensus data
  • Add to docs: responsibility of authors of new datasets to provide cutting for testmode
  • Update README accordingly https://github.com/openego/eGon-data/blob/e53c7c292ef39dc81133321f5679527f23fcd81e/README.rst#test-mode
  • Add a note in CHANGELOG
@gplssm gplssm added 🚀 feature New feature or feature request 🧪 testing Tests and CI labels Feb 10, 2021
@gplssm
Copy link
Contributor Author

gplssm commented Feb 10, 2021

The switch for choosing the test mode can be implemented best using parameters passed to the dag as described on StackOverflow.

@gplssm
Copy link
Contributor Author

gplssm commented Feb 12, 2021

I the last eGon-data webco we've decided to use a separate database for running the workflow on test data. This decision was prepared by a dicussion between @IlkaCu and @gnn.
As this issue is intended to provide a solution for using the test mode, it is heavily overlapping with this decision/discussion.

@ClaraBuettner and @gplssm decided to wait here, until we moved forward regarding setting up this separate database with test data.
@IlkaCu can we maybe help in setting up this second database?

@gplssm
Copy link
Contributor Author

gplssm commented Feb 22, 2021

Since we decided today that choosing a database and choosing the testmode or not is linked, default_args seemed to be a good way to go for me. But then, I read that kwargs in default_args are kind of deprecated and won't be supported in the future.

I would like to avoid using parameters as proposed in #112 (comment). As I explained above, we already chose to use the testmode via a CLI flag. So, we don't need to tell this twice passing airflow parameters.

I simply suggest to introduce an additional argument for each function that needs to know that a run is triggered in testmode. It might make sense to use a global variable for that in pipeline.py.

@ClaraBuettner
Copy link
Contributor

I think that's fine. We need to update every function to use the test case, adding an argument doesn't seem to be more effort.

And I wrote some lines that create a subset of vg250, I would like to push this. Can simply I create a branch for this issue which we use to implement the testcase for all datasets?

@gplssm
Copy link
Contributor Author

gplssm commented Feb 23, 2021

I think that's fine. We need to update every function to use the test case, adding an argument doesn't seem to be more effort.

After my last comment I though about making use of the new, decentralized dataset.ymls. We could specify an url_testmode in addition to a normal url. But, I need to know some more details about the decentralized dataset.yml.
Meanwhile, I do an temporary approach, which I'll show by pushing it.

And I wrote some lines that create a subset of vg250, I would like to push this. Can simply I create a branch for this issue which we use to implement the testcase for all datasets?

Please, feel free to to create a branch and push your code 🚀

@ClaraBuettner
Copy link
Contributor

Here is the branch with my commits: https://github.com/openego/eGon-data/tree/features/%23112-option-testmode
The switch for the testcase is currently the argument of the 'to_postgis'-function. I wasn't able to set this in the pipline.py, that resulted in many errors when running the dag.
On Monday we discussed the decentralization of datasets.yml, I think it will be possible to add another argument called url_testmode.

@gplssm
Copy link
Contributor Author

gplssm commented Feb 23, 2021

The switch for the testcase is currently the argument of the 'to_postgis'-function. I wasn't able to set this in the pipline.py, that resulted in many errors when running the dag.

You mean passing the parameter testcase to to_postgres() function when this function is called while running the DAG in pipeline.py?
Then, use op_args of https://airflow.apache.org/docs/apache-airflow/1.10.14/_api/airflow/operators/python_operator/index.html#airflow.operators.python_operator.PythonOperator.

On Monday we discussed the decentralization of datasets.yml, I think it will be possible to add another argument called url_testmode.

👍

@gplssm
Copy link
Contributor Author

gplssm commented Feb 23, 2021

We should discuss conceptionally if we want testmode to be a bool. First, I was thinking definitively so. Now, I'm considering a smaller variant of the SH testcase for CI. It would then be nice to use the same code as here...

@gnn
Copy link
Collaborator

gnn commented Feb 23, 2021

I'm more than a bit against sprinkling such an option everywhere because it looks like noise instead of useful information. As I said on Monday, this rather looks like something which should be placed in a configuration file, or in a table storing configuration information in the database. Maybe we can talk about this tomorrow.

@ClaraBuettner
Copy link
Contributor

The script of @gnn which cuts the zenus data is integrated. I decided not to delete the reduced files to reduce the computation time. If you don't like this, I can also change this again. The overall execution time (including cutting zensus data) is ~ 25 min.

I also replaced the testmode variable with a string value called dataset. To run the workflow with the whole dataset, the variable has to be set to dataset = 'main'. I can easily change this name, I just didn't had a better idea.

@ClaraBuettner
Copy link
Contributor

The nep input data is now also available in test mode. I decided to scale district heating capacities per population share because the demand is more linked to population than to area.
I also tested it and it worked. Without cutting the zensus data the overall computation time is ~13min.

@gplssm
Copy link
Contributor Author

gplssm commented Mar 5, 2021

I double-check that it run successfully on my side. Afterswards, I do the remaining open tasks above and create another issue for the second database stuff that @gnn is working on

@gplssm
Copy link
Contributor Author

gplssm commented Mar 5, 2021

The test mode including all cutting took 29 mins for. But I think I had already most of the data on my disk and download tasks were mostly skipped

gplssm added a commit that referenced this issue Mar 5, 2021
@gplssm
Copy link
Contributor Author

gplssm commented Mar 5, 2021

I gonna reopen for @gnn...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🚀 feature New feature or feature request 🧪 testing Tests and CI
Projects
None yet
3 participants