Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve segmentation fault in clisops PR 179 #183

Closed
agstephens opened this issue Aug 31, 2021 · 11 comments
Closed

Resolve segmentation fault in clisops PR 179 #183

agstephens opened this issue Aug 31, 2021 · 11 comments
Assignees

Comments

@agstephens
Copy link
Collaborator

agstephens commented Aug 31, 2021

  • clisops version: 0.6.5 - branch: regrid-main-ag
  • Python version: 3.8
  • Operating System: centos7

Description

Trying to debug why we get a segmentation fault in the clisop Pull Request: #179

I did this:

# Install clisops from scratch

## Create VM

Create a VM with 2GB RAM and 2 CPUs

## Install clisops and environment and run tests

--
mkdir ~/clisops-install
cd ~/clisops-install/

wget https://repo.anaconda.com/miniconda/Miniconda3-py38_4.10.3-Linux-x86_64.sh
bash Miniconda3-py38_4.10.3-Linux-x86_64.sh -b -p ~/clisops-install/miniconda

export PATH=${PWD}/miniconda/bin:$PATH

git clone https://github.com/roocs/clisops
cd clisops
git checkout regrid-main-ag

time conda env create -f environment.yml
# Took 7 minutes
source activate clisops
pip install -e ".[dev]"
### pip install --upgrade git+https://github.com/pangeo-data/xESMF.git
pip install pangeo-xesmf==0.6.0


pytest --cov tests
--

And yes, @sol1105 , I get a segmentation fault here too!
Now, I just need to compare the two environments.

@agstephens agstephens self-assigned this Aug 31, 2021
@agstephens
Copy link
Collaborator Author

I have fixed python=3.8. Latest results:

  • run all: pytest tests - segfault
  • run core tests: pytest tests/core - regrid tests fail (?)
    • can be fixed by referencing .data in:

clisops/core/regrid.py:513: in grid_reformat
ds_ref = xr.Dataset(

- run only core subset tests: tests pass

@agstephens
Copy link
Collaborator Author

agstephens commented Sep 1, 2021

The order the tests are run in seems to affect the result:

  1. No segfault with:
pytest tests/core/test_average.py tests/core/test_regrid.py tests/core/test_subset.py tests/ops/test_average.py tests/ops/test_subset.py tests/ops/test_xarray_mean.py tests/ops/test_regrid.py tests/test_config.py tests/test_dataset_roll.py tests/test_file_namers.py tests/test_output_utils.py tests/test_roll.py tests/test_utils/test_dataset_utils.py
  1. Segfault with:
pytest  # to run all tests
  1. Segfaults happen with these combinations:
pytest tests/test_config.py tests/test_dataset_roll.py tests/test_file_namers.py tests/test_output_utils.py tests/test_roll.py tests/core/test_average.py tests/core/test_regrid.py tests/core/test_subset.py

and

pytest tests/test_config.py tests/test_dataset_roll.py tests/test_file_namers.py tests/test_output_utils.py tests/test_roll.py tests/core/test_subset.py

and

pytest tests/test_config.py tests/test_dataset_roll.py tests/test_file_namers.py  tests/core/test_subset.py
pytest tests/test_config.py tests/test_file_namers.py tests/core/test_subset.py
pytest tests/test_file_namers.py tests/core/test_subset.py

But, THESE SUCCEED:

pytest tests/test_config.py tests/test_dataset_roll.py tests/core/test_subset.py
pytest tests/test_config.py tests/core/test_subset.py
pytest tests/core/test_subset.py

--

Does python -m pytest make any difference? No, it doesn't.

@agstephens
Copy link
Collaborator Author

So, to summarise...

FAILS:

pytest tests/test_file_namers.py tests/core/test_subset.py

SUCCEEDS:

pytest tests/test_file_namers.py
pytest tests/core/test_subset.py
pytest tests/core/test_subset.py tests/test_file_namers.py

ASSUMED REASON:

  • an interaction is causing the seg fault, between the two test modules when run in this order:
    • tests/test_file_namers.py
    • tests/core/test_subset.py

@agstephens
Copy link
Collaborator Author

@cehbrecht: please can you try to reproduce the problem outlined here. The first part of this issue explains how to set up the environment. Please confirm that you get a segfault that matches my experience above...

FAILS:

pytest tests/test_file_namers.py tests/core/test_subset.py

SUCCEEDS:

pytest tests/test_file_namers.py
pytest tests/core/test_subset.py
pytest tests/core/test_subset.py tests/test_file_namers.py

@cehbrecht
Copy link
Collaborator

I can reproduce the segfault error in a Debian docker container. Well, it works on my mac ;)

On my dev branch I can avoid the segfault by removing the regrid import:

# from .regrid import Grid, Weights, regrid

and use the direct import of regrid:

from clisops.core.regrid import Grid, Weights
from clisops.core.regrid import regrid as core_regrid

With these changes, the tests are working:

pytest tests
pytest tests/test_file_namers.py tests/core/test_subset.py

@cehbrecht
Copy link
Collaborator

you can also switch subset and regrid imports:

from .subset import (
create_mask,
subset_bbox,
subset_gridpoint,
subset_level,
subset_shape,
subset_time,
)
from .regrid import Grid, Weights, regrid

@agstephens
Copy link
Collaborator Author

This final fix works, as long we use: pangeo-xesmf==0.6.0

But there are tests failing. Need to now look into these...

@agstephens
Copy link
Collaborator Author

Hi @cehbrecht, I've made some progress on this stuff. Can you remember why we/you set the rules for which GitHub Actions are allowed to fail in this workflow:

https://github.com/roocs/clisops/blob/master/.github/workflows/main.yml#L26-L42

@cehbrecht
Copy link
Collaborator

I think it is a copy+paste thing. There is no py37-xarray env in tox.ini. We can probably change the env py37-xarray to py37 and it should not fail:

- python-version: 3.7
tox-env: py37-xarray
allowed_to_fail: true

ping @Zeitsperre

@agstephens
Copy link
Collaborator Author

@Zeitsperre I think we got the different CI environments from xclim initially. We now have these various allowed_to_fail options. Do you think it is okay for any of these to fail? I have lost track whether some of them always will because of dependencies used in the subsetting by shape/polygon. Thanks

@Zeitsperre
Copy link
Collaborator

@cehbrecht There should be less of a likelihood of segmentation faults now that read operations are separated by worker and cache as of #345. Closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants