Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task run_osmtgmod fails #258

Closed
gplssm opened this issue May 25, 2021 · 31 comments · Fixed by #263 or #338
Closed

Task run_osmtgmod fails #258

gplssm opened this issue May 25, 2021 · 31 comments · Fixed by #263 or #338
Assignees
Labels
🐛 bug Something isn't working

Comments

@gplssm
Copy link
Contributor

gplssm commented May 25, 2021

The task run_osmtmod fails, but only with --dataset-boundary=Everything.

[2021-05-21 04:56:22,636] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: egon-data-processing-pipeline.run_osmtgmod 2021-05-20T10:52:08.299196+00:00 [queued]>
[2021-05-21 04:56:22,726] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: egon-data-processing-pipeline.run_osmtgmod 2021-05-20T10:52:08.299196+00:00 [queued]>
[2021-05-21 04:56:22,726] {taskinstance.py:880} INFO - 
--------------------------------------------------------------------------------
[2021-05-21 04:56:22,726] {taskinstance.py:881} INFO - Starting attempt 1 of 1
[2021-05-21 04:56:22,726] {taskinstance.py:882} INFO - 
--------------------------------------------------------------------------------
[2021-05-21 04:56:22,767] {taskinstance.py:901} INFO - Executing <Task(PythonOperator): run_osmtgmod> on 2021-05-20T10:52:08.299196+00:00
[2021-05-21 04:56:22,778] {standard_task_runner.py:54} INFO - Started process 12373 to run task
[2021-05-21 04:56:22,847] {standard_task_runner.py:77} INFO - Running: ['airflow', 'run', 'egon-data-processing-pipeline', 'run_osmtgmod', '2021-05-20T10:52:08.299196+00:00', '--job_id', '195', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/dags/pipeline.py', '--cfg_path', '/tmp/tmpsjuo6545']
[2021-05-21 04:56:22,848] {standard_task_runner.py:78} INFO - Job 195: Subtask run_osmtgmod
[2021-05-21 04:56:22,956] {logging_mixin.py:120} INFO - Running <TaskInstance: egon-data-processing-pipeline.run_osmtgmod 2021-05-20T10:52:08.299196+00:00 [running]> on host srvsim01.rl-institut.local
[2021-05-21 04:56:23,099] {logging_mixin.py:120} INFO - Manually updating geometry of substation in Garenfeld
[2021-05-21 04:56:23,208] {taskinstance.py:1150} ERROR - (psycopg2.errors.InvalidParameterValue) Geometry SRID (0) does not match column SRID (4326)

[SQL: 
            UPDATE power_ways
            SET way = '0102000000160000001612D5004A081E4020A8644A35B349407B0ACA'
            '7E27071E405F23EE563BB34940287CB60E0E061E4055A4C2D842B34940352FE29'
            '6EA051E4017940E7B46B34940C0D02346CF051E4042EBE1CB44B34940D67E219A'
            '2F051E40FECF06054AB349407F964A442F031E40C2F441F471B34940A8A544676'
            '1021E40AB9412CA8FB349409C4848881E021E40B7BA08C691B34940B22D4E1430'
            '001E40CE913856BDB34940E2810B122C001E40898CAEAFDBB349402CDAF043480'
            '11E40ED678C32F0B349402FE640E25C041E405A86F21AF1B3494061D525C46F04'
            '1E40ABEF60C892B34940DC2F9FAC18061E400D33D9E495B349401FD7868A71061'
            'E40D2D8A89894B3494083932353F4061E40077360DE88B34940624ED02687071E'
            '404F08782D7CB349405000C5C892091E403EFBDBAF4CB349403DDBFEF04E091E4'
            '0658D7A8846B349405AD5928E72081E405BE8EF4A37B349401612D5004A081E40'
            '20A8644A35B34940'
            WHERE name = 'Garenfeld'
            AND id = 24667346
            ]
(Background on this error at: http://sqlalche.me/e/13/9h9h)
Traceback (most recent call last):
  File "/home/local/RL-INSTITUT/guido.plessmann/miniconda3/envs/eGon-data/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
    self.dialect.do_execute(
  File "/home/local/RL-INSTITUT/guido.plessmann/miniconda3/envs/eGon-data/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.InvalidParameterValue: Geometry SRID (0) does not match column SRID (4326)


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/local/RL-INSTITUT/guido.plessmann/miniconda3/envs/eGon-data/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/home/local/RL-INSTITUT/guido.plessmann/miniconda3/envs/eGon-data/lib/python3.8/site-packages/airflow/operators/python_operator.py", line 113, in execute
    return_value = self.execute_callable()
  File "/home/local/RL-INSTITUT/guido.plessmann/miniconda3/envs/eGon-data/lib/python3.8/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/home/local/RL-INSTITUT/guido.plessmann/git-repos/eGon-data/src/egon/data/processing/osmtgmod/__init__.py", line 34, in run_osmtgmod
    osmtgmod(
  File "/home/local/RL-INSTITUT/guido.plessmann/git-repos/eGon-data/src/egon/data/processing/osmtgmod/__init__.py", line 197, in osmtgmod
    db.execute_sql(
  File "/home/local/RL-INSTITUT/guido.plessmann/git-repos/eGon-data/src/egon/data/db.py", line 62, in execute_sql
    con.execute(text(sql_string))
  File "/home/local/RL-INSTITUT/guido.plessmann/miniconda3/envs/eGon-data/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1011, in execute
    return meth(self, multiparams, params)
  File "/home/local/RL-INSTITUT/guido.plessmann/miniconda3/envs/eGon-data/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/home/local/RL-INSTITUT/guido.plessmann/miniconda3/envs/eGon-data/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1124, in _execute_clauseelement
    ret = self._execute_context(
  File "/home/local/RL-INSTITUT/guido.plessmann/miniconda3/envs/eGon-data/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1316, in _execute_context
    self._handle_dbapi_exception(
  File "/home/local/RL-INSTITUT/guido.plessmann/miniconda3/envs/eGon-data/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1510, in _handle_dbapi_exception
    util.raise_(
  File "/home/local/RL-INSTITUT/guido.plessmann/miniconda3/envs/eGon-data/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/home/local/RL-INSTITUT/guido.plessmann/miniconda3/envs/eGon-data/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
    self.dialect.do_execute(
  File "/home/local/RL-INSTITUT/guido.plessmann/miniconda3/envs/eGon-data/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.DataError: (psycopg2.errors.InvalidParameterValue) Geometry SRID (0) does not match column SRID (4326)

[SQL: 
            UPDATE power_ways
            SET way = '0102000000160000001612D5004A081E4020A8644A35B349407B0ACA'
            '7E27071E405F23EE563BB34940287CB60E0E061E4055A4C2D842B34940352FE29'
            '6EA051E4017940E7B46B34940C0D02346CF051E4042EBE1CB44B34940D67E219A'
            '2F051E40FECF06054AB349407F964A442F031E40C2F441F471B34940A8A544676'
            '1021E40AB9412CA8FB349409C4848881E021E40B7BA08C691B34940B22D4E1430'
            '001E40CE913856BDB34940E2810B122C001E40898CAEAFDBB349402CDAF043480'
            '11E40ED678C32F0B349402FE640E25C041E405A86F21AF1B3494061D525C46F04'
            '1E40ABEF60C892B34940DC2F9FAC18061E400D33D9E495B349401FD7868A71061'
            'E40D2D8A89894B3494083932353F4061E40077360DE88B34940624ED02687071E'
            '404F08782D7CB349405000C5C892091E403EFBDBAF4CB349403DDBFEF04E091E4'
            '0658D7A8846B349405AD5928E72081E405BE8EF4A37B349401612D5004A081E40'
            '20A8644A35B34940'
            WHERE name = 'Garenfeld'
            AND id = 24667346
            ]
(Background on this error at: http://sqlalche.me/e/13/9h9h)
[2021-05-21 04:56:23,233] {taskinstance.py:1187} INFO - Marking task as FAILED. dag_id=egon-data-processing-pipeline, task_id=run_osmtgmod, execution_date=20210520T105208, start_date=20210521T045622, end_date=20210521T045623
[2021-05-21 04:56:27,540] {local_task_job.py:102} INFO - Task exited with return code 1
@gplssm gplssm added the 🐛 bug Something isn't working label May 25, 2021
@gplssm
Copy link
Contributor Author

gplssm commented May 25, 2021

Update: please don't act here for now, because it might be related to a No space left on device-error

@gplssm
Copy link
Contributor Author

gplssm commented May 26, 2021

I checked again and I'm quite sure that this osmtgmod error is independent from insufficient disk space.

When reading this part of the error message
sqlalchemy.exc.DataError: (psycopg2.errors.InvalidParameterValue) Geometry SRID (0) does not match column SRID (4326)
it seems more like a data related bug.
@ulfmueller could you help here?

@gplssm
Copy link
Contributor Author

gplssm commented May 26, 2021

@ClaraBuettner will add manual setting of the SRID for this geom.

Additionally, we try to understand why the workflow ran successfully on the Flensburg server but not RLI server by comparing python package versions.

@ClaraBuettner
Copy link
Contributor

The following versions of packages which might be related to this are installed on the server in Flensburg:
SQLAlchemy 1.3.24
SQLAlchemy-JSONField 0.9.0
SQLAlchemy-Utils 0.36.8
psycopg2 2.8.6
psycopg2-binary 2.8.6
GeoAlchemy2 0.6.3

@gplssm
Copy link
Contributor Author

gplssm commented May 26, 2021

These are our package versions

SQLAlchemy==1.3.24
SQLAlchemy-JSONField==0.9.0
SQLAlchemy-Utils==0.37.3
psycopg2==2.8.6
psycopg2-binary==2.8.6
GeoAlchemy2==0.6.3

The only difference (highlighted) is in the package version of SQLAlchemy-Utils.

@gplssm
Copy link
Contributor Author

gplssm commented Jun 1, 2021

Thanks to the fix of manually set SRID #263 , the error does not occur anymore.
The ask run_osmtgmod still fails. See the relevant part of the log

[2021-05-27 12:02:53,369] {logging_mixin.py:120} WARNING - 05/27/2021 12:02:53 Executing SQL statement 70:
SELECT otg_unknown_value_heuristic ()
[2021-05-27 12:03:39,856] {__init__.py:474} ERROR - Exception raised with command 70. Check data and code and restart with 'python ego_otg.py egon-data-germany 70'.
Traceback (most recent call last):
  File "/home/local/RL-INSTITUT/guido.plessmann/git-repos/eGon-data/src/egon/data/processing/osmtgmod/__init__.py", line 471, in osmtgmod
    cur.execute(command)
psycopg2.errors.OutOfMemory: out of shared memory
HINT:  You might need to increase max_locks_per_transaction.
CONTEXT:  SQL statement "DROP TABLE IF EXISTS all_neighbours"
PL/pgSQL function otg_neighbour_heuristic() line 114 at SQL statement
SQL statement "SELECT otg_neighbour_heuristic ()"
PL/pgSQL function otg_unknown_value_heuristic() line 13 at PERFORM

[2021-05-27 12:03:39,859] {logging_mixin.py:120} WARNING - 05/27/2021 12:03:39 Exception raised with command 70. Check data and code and restart with 'python ego_otg.py egon-data-germany 70'.
Traceback (most recent call last):
  File "/home/local/RL-INSTITUT/guido.plessmann/git-repos/eGon-data/src/egon/data/processing/osmtgmod/__init__.py", line 471, in osmtgmod
    cur.execute(command)
psycopg2.errors.OutOfMemory: out of shared memory
HINT:  You might need to increase max_locks_per_transaction.
CONTEXT:  SQL statement "DROP TABLE IF EXISTS all_neighbours"
PL/pgSQL function otg_neighbour_heuristic() line 114 at SQL statement
SQL statement "SELECT otg_neighbour_heuristic ()"
PL/pgSQL function otg_unknown_value_heuristic() line 13 at PERFORM
[2021-05-27 12:03:44,036] {local_task_job.py:102} INFO - Task exited with return code 1

I remember the psycopg2.errors.OutOfMemory: out of shared memory from earlier which was resolved by installing java requirements. I double-checked that java is installed.

@ClaraBuettner @ulfmueller do you have an idea why the task still fails?

@gplssm
Copy link
Contributor Author

gplssm commented Jun 1, 2021

The above error might be related to a to small shm_size. This let some other tasks fail. I'll increase the shm_size, retry it and give a feedback later. Just out of curiosity: was shm_size do you use on the Flensburg server?

gplssm added a commit that referenced this issue Jun 1, 2021
Set SRID for maunally updated geometry
@nesnoj
Copy link
Member

nesnoj commented Jun 1, 2021

@gplssm If a different shm_size solves the problem, could you add a note to the docs' troubleshooting section?

@ClaraBuettner
Copy link
Contributor

I don't know why this task still fails, installing java always solved this issue in my case.

The above error might be related to a to small shm_size. This let some other tasks fail. I'll increase the shm_size, retry it and give a feedback later. Just out of curiosity: was shm_size do you use on the Flensburg server?

I didn't changed the shm_size on the server in Flensburg, so when I understand it correctly it has the default size of 64MB shared memory. I can double-check this if you tell me how you checked the shm_size .

@gplssm
Copy link
Contributor Author

gplssm commented Jun 1, 2021

I don't know why this task still fails, installing java always solved this issue in my case.

The above error might be related to a to small shm_size. This let some other tasks fail. I'll increase the shm_size, retry it and give a feedback later. Just out of curiosity: was shm_size do you use on the Flensburg server?

I didn't changed the shm_size on the server in Flensburg, so when I understand it correctly it has the default size of 64MB shared memory. I can double-check this if you tell me how you checked the shm_size .

Would be great if you could check this. Please use docker exec -it egon-data-local-database-container /bin/bash to jump in the terminal inside the docker container. Then, use df -h | grep shm

@ClaraBuettner
Copy link
Contributor

Would be great if you could check this. Please use docker exec -it egon-data-local-database-container /bin/bash to jump in the terminal inside the docker container. Then, use df -h | grep shm

Thank you! The output of df -h | grep shm is: shm 64M 16K 64M 1% /dev/shm
So the server in Flensburg uses only 64MB of shared memory.

@gplssm
Copy link
Contributor Author

gplssm commented Jun 1, 2021

Would be great if you could check this. Please use docker exec -it egon-data-local-database-container /bin/bash to jump in the terminal inside the docker container. Then, use df -h | grep shm

Thank you! The output of df -h | grep shm is: shm 64M 16K 64M 1% /dev/shm
So the server in Flensburg uses only 64MB of shared memory.

Thanks!
That's what I expected, since you haven't changed the shm_size. I'm still wondering why the workflow executes successfully in Flensburg but not in Berlin. Seems like increasing the shm_size (as done in #268) won't help 😞.
Anyways, I already started a try with the increase shm_size. We'll see...

@nesnoj
Copy link
Member

nesnoj commented Jun 10, 2021

@gplssm's trial to increase shm_size did not work in the end :(.

I created everything from scratch on the same server and face the very same problem after 1 day. Taking this to extreme with rather ridiculous values I increased the shm_size on-the-fly to 32 GB and set max_locks_per_transaction = 16384 in the container's /var/lib/postgresql/data/postgresql.conf. Now it has been running for 3.5 days but without any notable progress. I expect it to fail the very same way just a little later.. 😞

Could the table locking problem be related to concurrent processes, but there's no extra multiprocessing in the osmTGmod task, right @ClaraBuettner ?

As we dunno how to tackle this further, I have a request to you @gnn: if you can spare some time, could you please set up eGon-data on one of your machines to see whether u run into the same transaction jungle?

Relevant part of the log (last entry seems to be running since startup..):

[2021-06-06 21:18:50,900] {logging_mixin.py:120} WARNING - 06/06/2021 21:18:50 Executing SQL statement 66:
ALTER TABLE power_line ADD COLUMN all_neighbours BIGINT [4][2][10][2] 
DEFAULT '{{{{NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}}, 
{{NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}}},
{{{NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}},
{{NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}}},
{{{NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}},
{{NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}}},
{{{NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}},
{{NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}, {NULL,NULL}}}}'
[2021-06-06 21:18:50,922] {__init__.py:467} INFO - Executing SQL statement 67:
CREATE INDEX all_neighbour_idx ON power_line(all_neighbours)

[2021-06-06 21:18:50,923] {logging_mixin.py:120} WARNING - 06/06/2021 21:18:50 Executing SQL statement 67:
CREATE INDEX all_neighbour_idx ON power_line(all_neighbours)
[2021-06-06 21:18:51,006] {__init__.py:467} INFO - Executing SQL statement 68:
UPDATE power_line SET all_neighbours = otg_get_all_neighbours (id)

[2021-06-06 21:18:51,006] {logging_mixin.py:120} WARNING - 06/06/2021 21:18:51 Executing SQL statement 68:
UPDATE power_line SET all_neighbours = otg_get_all_neighbours (id)
[2021-06-06 21:18:51,017] {__init__.py:467} INFO - Executing SQL statement 69:
ALTER TABLE power_line ADD COLUMN cables_from_neighbour BOOLEAN

[2021-06-06 21:18:51,017] {logging_mixin.py:120} WARNING - 06/06/2021 21:18:51 Executing SQL statement 69:
ALTER TABLE power_line ADD COLUMN cables_from_neighbour BOOLEAN
[2021-06-06 21:18:51,036] {__init__.py:467} INFO - Executing SQL statement 70:
SELECT otg_unknown_value_heuristic ()

[2021-06-06 21:18:51,037] {logging_mixin.py:120} WARNING - 06/06/2021 21:18:51 Executing SQL statement 70:
SELECT otg_unknown_value_heuristic ()

@ClaraBuettner
Copy link
Contributor

@gplssm's trial to increase shm_size did not work in the end :(.

I created everything from scratch on the same server and face the very same problem after 1 day. Taking this to extreme with rather ridiculous values I increased the shm_size on-the-fly to 32 GB and set max_locks_per_transaction = 16384 in the container's /var/lib/postgresql/data/postgresql.conf. Now it has been running for 3.5 days but without any notable progress. I expect it to fail the very same way just a little later.. disappointed

Just as an information, in my case a run for Germany needs ~1.5 days (6 tasks in parallel, 1 process per task). osmTGmod finishes after ~6 hours.

Could the table locking problem be related to concurrent processes, but there's no extra multiprocessing in the osmTGmod task, right @ClaraBuettner ?

As far as I know multiprocessing is only used in the osm import (not the one for osmtgmod) and some pandas operations. In #269 you can also limit the number of parallel processes per task.
Did you tried to run only the osmtgmod-task to prevent any concurrent process problems?

@nesnoj
Copy link
Member

nesnoj commented Jun 10, 2021

Thanks for your quick reply!
The osmtgmod task is the only one running so no side effects to expect here.

@nesnoj nesnoj reopened this Jun 10, 2021
@nesnoj
Copy link
Member

nesnoj commented Jul 9, 2021

Now this problem also occurs with the SH test region, not only for entire Germany. And on 3 different machines including your server in Flensburg with a fresh venv and current dev.
I also tried on your server with the identical configuration of your venv @ClaraBuettner - same result.

@ClaraBuettner @IlkaCu @ulfmueller @gnn
The gist of it all: The problem seems to be not specific to our infrastructure and I guess your instances are quite fragile, you pray that you'll never lose them! 😄
Seriously, maybe at some point a bug was introduced and the osmtgmod task was executed on your machines before?!

@ClaraBuettner, do you dare to re-run this task in your DE version cause I don't..

Log:

[2021-07-08 23:30:46,795] {logging_mixin.py:120} WARNING - 07/08/2021 23:30:46 Executing SQL statement 69:
ALTER TABLE power_line ADD COLUMN cables_from_neighbour BOOLEAN
[2021-07-08 23:30:46,800] {__init__.py:467} INFO - Executing SQL statement 70:
SELECT otg_unknown_value_heuristic ()

[2021-07-08 23:30:46,800] {logging_mixin.py:120} WARNING - 07/08/2021 23:30:46 Executing SQL statement 70:
SELECT otg_unknown_value_heuristic ()
[2021-07-08 23:31:07,904] {__init__.py:474} ERROR - Exception raised with command 70. Check data and code and restart with 'python ego_otg.py egon-data 70'.
Traceback (most recent call last):
  File "/home/nesnoj/git-repos/openego/eGon-data/src/egon/data/processing/osmtgmod/__init__.py", line 471, in osmtgmod
    cur.execute(command)
psycopg2.errors.OutOfMemory: out of shared memory
HINT:  You might need to increase max_locks_per_transaction.
CONTEXT:  SQL statement "CREATE TABLE all_neighbours (
	id BIGINT,
	cables INT,
	frequency INT)"
PL/pgSQL function otg_neighbour_heuristic() line 14 at SQL statement
SQL statement "SELECT otg_neighbour_heuristic ()"
PL/pgSQL function otg_unknown_value_heuristic() line 13 at PERFORM

[2021-07-08 23:31:07,905] {logging_mixin.py:120} WARNING - 07/08/2021 23:31:07 Exception raised with command 70. Check data and code and restart with 'python ego_otg.py egon-data 70'.
Traceback (most recent call last):
  File "/home/nesnoj/git-repos/openego/eGon-data/src/egon/data/processing/osmtgmod/__init__.py", line 471, in osmtgmod
    cur.execute(command)
psycopg2.errors.OutOfMemory: out of shared memory
HINT:  You might need to increase max_locks_per_transaction.
CONTEXT:  SQL statement "CREATE TABLE all_neighbours (
	id BIGINT,
	cables INT,
	frequency INT)"
PL/pgSQL function otg_neighbour_heuristic() line 14 at SQL statement
SQL statement "SELECT otg_neighbour_heuristic ()"
PL/pgSQL function otg_unknown_value_heuristic() line 13 at PERFORM
[2021-07-08 23:31:10,482] {local_task_job.py:102} INFO - Task exited with return code 1

PS: Again, increasing shm_size (to 1G) and max_locks_per_transaction (to 1024) did not solve the problem :(.

@nesnoj nesnoj changed the title Task run_osmtgmod fails on dataset Everything Task run_osmtgmod fails ~on dataset Everything~ Jul 9, 2021
@nesnoj nesnoj changed the title Task run_osmtgmod fails ~on dataset Everything~ Task run_osmtgmod fails Jul 9, 2021
@ClaraBuettner
Copy link
Contributor

Now this problem also occurs with the SH test region, not only for entire Germany. And on 3 different machines including your server in Flensburg with a fresh venv and current dev.
I also tried on your server with the identical configuration of your venv @ClaraBuettner - same result.

@ClaraBuettner @IlkaCu @ulfmueller @gnn
The gist of it all: The problem seems to be not specific to our infrastructure and I guess your instances are quite fragile, you pray that you'll never lose them! smile
Seriously, maybe at some point a bug was introduced and the osmtgmod task was executed on your machines before?!

The task run successfully about 10 days ago on the current dev branch. I double-checked that the task was running completely.
Since osmtgmod is another repository I also verified that the osmtgmod repository is up to date. So it doesn't look like recent changes have caused this problem.

@ClaraBuettner, do you dare to re-run this task in your DE version cause I don't..

I don't want to re-run the task, but I can create a new database in test mode. And I will create a backup of the osmtgmod results for Germany.

@nesnoj
Copy link
Member

nesnoj commented Jul 13, 2021

The task run successfully about 10 days ago on the current dev branch. I double-checked that the task was running completely.
Since osmtgmod is another repository I also verified that the osmtgmod repository is up to date. So it doesn't look like recent changes have caused this problem.

Thanks for double checking this!

@ClaraBuettner, do you dare to re-run this task in your DE version cause I don't..

I don't want to re-run the task, but I can create a new database in test mode. And I will create a backup of the osmtgmod results for Germany.

Thank you. I'd think that there's not too much to screw up during setup?!

virtualenv --python=python3.8 eGon-data
source eGon-data/bin/activate
pip install --upgrade pip
pip install oedialect==0.0.8
pip install -e /path/to/cloned/repo/
egon-data --dataset-boundary Schleswig-Holstein --database-port 59777 --airflow-port 8888 --jobs 12 --docker-container-name egon-data-local-database-container-jonathan-sh serve

I'm curious whether you chose a different way for your setup?

@ClaraBuettner
Copy link
Contributor

Thank you. I'd think that there's not too much to screw up during setup?!

virtualenv --python=python3.8 eGon-data
source eGon-data/bin/activate
pip install --upgrade pip
pip install oedialect==0.0.8
pip install -e /path/to/cloned/repo/
egon-data --dataset-boundary Schleswig-Holstein --database-port 59777 --airflow-port 8888 --jobs 12 --docker-container-name egon-data-local-database-container-jonathan-sh serve

I'm curious whether you chose a different way for your setup?

I choose only 6 parallel jobs since some tasks also have an internal multi-threading. I'm not sure if that might have caused your problems.

@nesnoj
Copy link
Member

nesnoj commented Jul 13, 2021

Unfortunately not, I also tried with 2 jobs.

I'm curious if your trial will work out....oh, this suspense is killing me... :)

@ClaraBuettner
Copy link
Contributor

I created a new instance of egon-data (in test mode) on the server, now I also run into the error.
I cloned the venv and copied a osm-file which was working before. So I don't know what is different by now.
@gnn I also tried what you explained and run only the tasks until osmtgmod, but the result was the same.

@ClaraBuettner
Copy link
Contributor

Okay, now I also get this error on my local machine in all environments that worked before. I also created a new one, and osmtgmod fails with the same error.
It also happens when I run the failing osmtgmod function outside of airflow in SQL.

@ulfmueller
Copy link
Member

Damn it! How on earth is that possible? @ClaraBuettner should we have a meeting on this to brainstorm the possible reasons?

@nesnoj
Copy link
Member

nesnoj commented Jul 15, 2021

Thanks for testing! Damn, I was hoping that I did something wrong...

@ClaraBuettner
Copy link
Contributor

This is the exact same error that occurs when java was not installed (see #206). I assumed that there is a problem with the java installation. and tried to uninstall and re-install java, but it didn't help.

Then I had a look into my database and noticed that all tables in the public schema which are used as an input for osmtgmod are empty. In addition the osmtgmod_osm_import needed much less time then in previous, successful runs. So I assume that the initial problem is in osmtgmod_osm_import and run_osmtgmod only fails because there is no input data.

The function osmtgmod_osm_import includes this hint:

# BUG: Python continues (and sets osm_metadata)
# even in case osmosis fails!!!
proc = subprocess.Popen(
"%s --read-pbf %s --write-pgsql \
database=%s host=%s user=%s password=%s"
% (
os.path.join(egon.data.__path__[0],
"processing/osmtgmod/osmTGmod/",
config["osm_data"]["osmosis_path_to_binary"]),
filtered_osm_pbf_path_to_file,
config_database,
config["postgres_server"]["host"]
+ ":"
+ config["postgres_server"]["port"],
config["postgres_server"]["user"],
config["postgres_server"]["password"],
),
shell=True,
)

I replaced the subprocess.popen with the egon-data internal subprocess function to solve the bug described there. When I did this, the task osmtgmod_osm_import fails with this log:


> egon.data.subprocess.CalledProcessError: 
> | Command '/home/clara/test-egon-data/git/eGon-
> | data/src/egon/data/processing/osmtgmod/osmTGmod/osmosis/bin/osmosis
> | --read-pbf /home/clara/test-egon-data/git/eGon-
> | data/src/egon/data/importing/openstreetmap/schleswig-
> | holstein-210101.osm.pbf --write-pgsql                 database=test-
> | egon-data host=127.0.0.1:59734 user=egon password=data' returned non-
> | zero exit status 1.
> | Output:
> | Errors:
> |   Juli 16, 2021 9:48:56 VORM. org.openstreetmap.osmosis.core.Osmosis
> |   run
> |   INFORMATION: Osmosis Version 0.44.1
> |   Juli 16, 2021 9:48:56 VORM. org.openstreetmap.osmosis.core.Osmosis
> |   run
> |   INFORMATION: Preparing pipeline.
> |   java.io.FileNotFoundException: /home/clara/test-egon-data/git/eGon-
> |   data/src/egon/data/importing/openstreetmap/schleswig-
> |   holstein-210101.osm.pbf (Datei oder Verzeichnis nicht gefunden)
> |           at java.base/java.io.FileInputStream.open0(Native Method)
> |           at
> |   java.base/java.io.FileInputStream.open(FileInputStream.java:219)
> |           at
> |   java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
> |           at crosby.binary.osmosis.OsmosisReaderFactory.createTaskMana
> |   gerImpl(OsmosisReaderFactory.java:39)

So there is a problem with java, osmosis (which imports the input data for osmtgmod) fails and we just didn't saw this error message before.
I did not find a solution yet but at least I understand better what is happening.

@ClaraBuettner
Copy link
Contributor

I finally found the problem and it is very easy to solve. The pbf file is stored in another directory since migrating the osm-download to datasets. When I change the path, it is working. I will prepare a PR which solves this.

The old file is not deleted in directories which were created before. That is why it is not failing in existing instances of egon-data. (On my local machine I had the smart idea to clean up my folders and deleted all old files, which is why the error also occurred in my former running instances.)

@nesnoj
Copy link
Member

nesnoj commented Jul 16, 2021

🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉

@nesnoj
Copy link
Member

nesnoj commented Jul 19, 2021

While SH ran successfully, DE did not.
I have an error in upstream task osmtgmod_osm_import: No space left on device but there're 350 GB available. I double checked this, it's also available in the container. I tried to increase shm size again, it runs somewhat longer but with the same result in the end. ARGH

@ClaraBuettner
Copy link
Contributor

While SH ran successfully, DE did not.
I have an error in upstream task osmtgmod_osm_import: No space left on device but there're 350 GB available. I double checked this, it's also available in the container. I tried to increase shm size again, it runs somewhat longer but with the same result in the end. ARGH

I remember that the parallel task migrate_osm needs much disk space in between. Did this task also run in parallel and might have cause this problem?

@nesnoj
Copy link
Member

nesnoj commented Jul 22, 2021

I have an error in upstream task osmtgmod_osm_import: No space left on device but there're 350 GB available. I double checked this, it's also available in the container. I tried to increase shm size again, it runs somewhat longer but with the same result in the end. ARGH

I found the problem, see #344

@nesnoj
Copy link
Member

nesnoj commented Jul 22, 2021

After fixing #344 run_osmtgmod also works in normal mode. Finally feeling like 😄 🎉 😄 🎉 😄 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working
Projects
None yet
4 participants