Code Review: Methods to handle lines and polygons with CLIMADA exposures by Evelyn-M · Pull Request #221 · CLIMADA-project/climada_python

Evelyn-M · 2021-04-27T09:15:37Z

Dear all

This is an incomplete, work-in-progress module, which I do not intend to merge soon.
Yet, I'd like some conceptual feedback and also make sure changes aren't horrendous in a month's time or so.

The changes include the following:
Merging util.interpolation and util.coordinates, which has led to circular imports. This also removes duplicate functions. The unit tests work in the merger.

Adding the following util functions in coordinates: interpolate_lines, interpolate_polygons and the helper dist_great_circle_allgeoms.

Adding the new module lines_polys_handler.py which does the transformation from line / poly geometries into CLIMADA point exposures and the back-transformation from point impacts into line / polygon geometries, via some aggregation options. They are not yet implemented for polys.

I also added "fake" unittests for all those functions in the respective test files.

…python into feature/lines_polygons_exp

climada/util/coordinates.py

chahank · 2021-04-29T09:19:01Z

climada/util/coordinates.py

 import numpy as np
+from numba import jit
 import pandas as pd
+import pyproj


This is a new dependency. Is it absolutely necessary? Has this been discussed with @emanuel-schmid . Please, as a general convention, before adding new dependencies in a pull request discuss it openly to see whether a) it is really needed b) it is compatible with the environment c) does not add many extra implicit dependencies.

pyproj is not a new dependency for CLIMADA. It has been included in the requirements as an indirect dependency (through geopandas) for a long time and is used in several places in the code base. But maybe it's true that we should explicitly list it in the requirements file for the hypothetical case that geopandas might drop this dependency at some point in the future.

Still, it's not a big thing that @Evelyn-M imports it here and we should probably have this discussion in a different place.

Here is a full list of packages that are explicitly imported in CLIMADA, but not listed in the requirements files:

cftime

ee (on purpose)

fiona

mock

numpy (!)

pyproj

shapefile

shapely

Good points! It seems to me that some of these requirements should really be explicitly listed in the requirements file.

I agree that the import in general does not necessarily need to be discussed here, but this is still exactly the goal of pull requests of this type: discuss the general architecture which includes the discussion about external dependencies.

chahank · 2021-04-29T09:21:59Z

climada/util/coordinates.py

 MAX_DEM_TILES_DOWN = 300
 """Maximum DEM tiles to dowload"""

+######### copied over from interpolation module #####################


Does this mean that nothing in this pull request is new?

This means that i copied this part from the interpolation method (until where the next full line of ######### is). It's just a marker to facilitate the review. I will of course delete this remark later.

climada/util/coordinates.py

chahank · 2021-04-29T09:37:39Z

I have some questions about the use of certain methods. What would the following be used for? I do not really see concrete use cases. Also, they seem to not be used anywhere in the CLIMADA code base. This indicates to me that they should potentially be entirely removed instead of being ported.

interpol_index
index_nn_aprox
index_nn_haversine

tovogt · 2021-04-29T10:03:03Z

The function interpol_index has originally been implemented for the assign_centroids functionality. Now, this functionality has been moved to the util function assign_coordinates. But this function still uses interpol_index to find the index of the geographically nearest neighbor in a given point cloud.

The functions index_nn_approx and index_nn_haversine are helper functions for use within interpol_index.

chahank · 2021-04-29T11:35:31Z

The function interpol_index has originally been implemented for the assign_centroids functionality. Now, this functionality has been moved to the util function assign_coordinates. But this function still uses interpol_index to find the index of the geographically nearest neighbor in a given point cloud.

The functions index_nn_approx and index_nn_haversine are helper functions for use within interpol_index.

Strange, when I checked the references to these functions on github it did not show them, but I see them in the branch. Fair enough. Thanks for clarifying.

chahank

A few reflections on the interpolate_polygons function.

chahank · 2021-05-03T19:43:52Z

climada/util/coordinates.py

+                                            axis=1)
+    # filter only centroids in actual polygons
+    for i, polygon in enumerate(gdf_points.geometry):
+        in_geom = coord_on_land(lat=gdf_points['lat'].iloc[i], 


Why implicitly assume points on land? The util function to transform from polygons to points should not assume that. There are several use cases where points on the ocean are relevant.

That's not what it assumes. it does as said in the comment - # filter only centroids in actual polygons
The function coord_on_land simply checks whether the points generated from the even grid lie within the desired polygon, and excludes the others. in short, it's just a wrapper for an sjoin func

This is confusing to me. Why use coord_on_land then? What happens if the coords or the centroids are in the water and the polygon also?

Why not use sjoin directly then?

Thanks, I looked again at the code for coord_on_land, my bad. It can indeed be misused for filtering points in polygons.

I checked the coord_on_land and it's basically a re-naming of the shapely-native function vectorized.contains() which is exactly what is needed here to see which points fall into the polygon. I use the latter directly now in commit e892d39

chahank · 2021-05-03T19:45:03Z

climada/util/coordinates.py

+    for axis in ['lon', 'lat']:
+        gdf_points[axis] = gdf_points.apply(lambda row: row[axis].flatten(), 
+                                            axis=1)
+    # filter only centroids in actual polygons


If I am not mistaken, you can thus easily end up with a polygon that is mapped to no point at all (if it is between raster points). I am not sure how to solve this, but it probably is not desired behaviour, or at least it should be warned about.

One possibility is to assign a value proportional to the cell area in a given polygon. This will lead to points belonging to different polygons at the same time though.

Well, if you choose your representative area per point too small, then yes. The function assumes that the user adequately judges the size of the polygons in relation to the area per point. Aka, if you have a tiny island of 1km^2, you shouldn't interpolate your polygon with a point_area of 10'000m2

There's also a lot of discussions online on how to best allocate a certain amount of points within an arbitrary 2D shape. There's a few fancy algorithms out there, but honestly, i think for our purpose, the commonly accepted easy-way out is good enough. And that's making an even grid of pre-determined resolution, and then cropping it to the outline of the polygon. This is how it's implemented atm

I agree with the simple implementation. But I would at least raise a warning if one of the polygons has no point associated.

e.g. (https://stackoverflow.com/questions/11178414/algorithm-to-generate-equally-distributed-points-in-a-polygon)

A warning is now raised if a polygon has no point with the given resolution. Also, one "representative" point is then still set into the shape.

chahank · 2021-05-03T19:45:50Z

climada/util/coordinates.py

+    # expand gdf from MultiPoint entries to single Points per row
+    return gdf_points.explode().drop(['distance_vector', 'length_full'], axis=1)
+
+def interpolate_polygons(gdf_poly, area_point):


Would it be useful to provide the grid to map the polygons onto? Think for instance that one has a centroids raster and wants to map the polygons to this raster.

ehm. if you already have a grid at hand, you may not really need this function. with a grid, you can just use an sjoin to see which points of the grid lie inside the polygon. The idea of this func here is to help the user make a grid, where they just need to think about the representative area (resolution) per grid-point.

Ok, but then the function should only produce the grid. Another function can then be made that maps the polygons onto said grid.

Note that this function doesn't simply create "a grid". It creates a separate grid for each polygon. Not only the origin of each of those grids is different, but also the resolution. The reason for that is the fact that polygons are assumed to be given in lat-lon-coordinates, while the user specifies the desired resolution in meters. So, the function chooses the grid resolution according to the specific degree-meter conversion factor at the latitudinal location of each polygon.

I partly agree with @chahank: This function should probably be better separated into "subfunctions". I think the most important separation would be to have a function _interpolate_one_polygon that works on each polygon separately and does both generate the grid and apply it to the polygon. The function interpolate_polygons then just uses apply and explode. If you want to be even more modular (as suggested by @chahank), you can divide _interpolate_one_polygon further.

Good points! Since this is a pull request for testing the design, I suggest to try to split up the functions better.

After needing to use polygons recently, we (together with @ChrisFairless ) saw that there are (at least) two different use cases. One is a you described, that one needs the spatial information in each grid point and for instance divide a polygons in cells of equal surface. Another use case is when one simply needs to aggregate the impacts at the centroid points inside of the polygon(s). I am not sure what is the best way to handle both cases. But separating the interpolation functions into subfunctions would allow to address both cases more easily.

Both good points! I will implement a more modular version and also think about your use case @chahank , which was not the prime intention, and is very easily implementable with the default geopandas functionalities. It might still fit somewhere here.

Great! I think this is really important for your design and should be taken into account right at the beginning. Even though it might not be the prime intention, a well-designed set of functions should without difficulty allow for both use cases. Whether this is possible might be a good way to test the (good) modularity of the design.

I think the most simple use-case is something like LitPop where a given value is distributed over all chosen points inside of a shape-polygon (the basic functions should be integrable/grated into LitPop). Afterward, there should be a function where the impacts are aggregated.

Now, the variations and difficulties come in when one chooses a way to distribute, a way to set the points inside of the polygon (a grid, centroids, grid of equal surface cells,...) and a way to aggregate the impacts.

commit a4ceae1 refactors the polygon (and the line) interpolation: a sub-function _interpolate_one_polygon now creates a separate grid for each polygon with the desired resolution and cuts out only those resulting points that lie inside it.

chahank · 2021-05-03T19:48:40Z

climada/util/coordinates.py

+    Parameters
+    ----------
+    gdf_lines : gpd.GeoDataframe or str
+        Geodataframe or filepath from which to read the GeoDataframe


I do not really like the mixing of data reading and data handling inside of a util funtion.

mmmh. how would you do it? force the user to only one input type (i.e. gdfs)?

Yes, exactly.

Changed this to allowing only gdfs

chahank · 2021-05-03T20:05:18Z

General comment: what is implemented here is basically a version of Exposure.set_lat_lon for Lines and Polygons instead of points. Would it make sense to include the helper functions there? Then there could be a quite natural flow for compute impacts from geometry exposures, by simply checking in impact.calc if the lat and lon are set, and if not apply set_lat_lon. The difficulty is of course how to disaggregate by default, but maybe we can find a good compromise there. In addition, one could reaggregate the impact by default again at the end of the computation. This might again lead to aggregation default problems, but could be a way togo.

I think in past discussions this idea was rejected due to the question of how to aggregate and disaggregate. However, we might give it a tough a new here. In particular, if the util functions do not get different types of aggregations/disaggregation.

chahank · 2021-05-03T20:08:46Z

climada/util/lines_polys_handler.py

+    exp.check()
+    return exp
+
+def agg_point_impact_to_lines(gdf_lines, exp_points, imp_points, 


Why are both gdf_lines and exp_points needed? The exp_points should contain all the information from gdf_lines.

True that- this has been removed.
It' still a bit "annoying" that the exposure has to be provided - simply because the multi-index is needed on which to re-aggregate the points. The impact instance only saves exposure coordinates as nd-arrays, without the indices.

Evelyn-M · 2021-05-11T13:29:19Z

General comment: what is implemented here is basically a version of Exposure.set_lat_lon for Lines and Polygons instead of points. Would it make sense to include the helper functions there? Then there could be a quite natural flow for compute impacts from geometry exposures, by simply checking in impact.calc if the lat and lon are set, and if not apply set_lat_lon. The difficulty is of course how to disaggregate by default, but maybe we can find a good compromise there. In addition, one could reaggregate the impact by default again at the end of the computation. This might again lead to aggregation default problems, but could be a way togo.

I think in past discussions this idea was rejected due to the question of how to aggregate and disaggregate. However, we might give it a tough a new here. In particular, if the util functions do not get different types of aggregations/disaggregation.

I agree on the fact that it would be nice for the impact calc flow, but also difficult without making lots of assumptions on the disaggregation and re-aggregation. I'll continue for now with the current structure, and then lets see if we can make a proposal to use those functionalities with some default assumptions in the impact.calc routine

chahank · 2021-12-06T14:51:20Z

I propose to close this pull request since many changes have been made to the methods proposed for handling geometries and lines. New separate pull requests will be opened to handle the different proposed changes.

Evelyn-M · 2021-12-06T14:57:35Z

Yes, we can do that

Evelyn-M added 9 commits April 21, 2021 15:08

New version consistent with develop

f2cbc72

new version consistent with develop

fe1677e

try removing circular import

a765270

merge interpolation into coordinates for circular import error

10103b7

adjust unit tests after merging interpolation & coordinates

c9b18e8

Merge branch 'develop' of https://github.com/CLIMADA-project/climada_…

49a1846

…python into feature/lines_polygons_exp

rename double dist_approx function from interpolation module

cc7d016

remove duplicate dist_approx, adjust tests

f7dce04

add placeholder unittests for lp_handler and interpolation funcs

034a312

Evelyn-M requested review from ThomasRoosli and chahank April 27, 2021 09:15

tovogt reviewed Apr 27, 2021

View reviewed changes

climada/util/coordinates.py Outdated Show resolved Hide resolved

chahank requested review from zeliest and removed request for ThomasRoosli April 27, 2021 12:29

tovogt reviewed Apr 27, 2021

View reviewed changes

climada/util/coordinates.py Outdated Show resolved Hide resolved

tovogt reviewed Apr 27, 2021

View reviewed changes

climada/util/coordinates.py Outdated Show resolved Hide resolved

chahank reviewed Apr 29, 2021

View reviewed changes

climada/util/coordinates.py Show resolved Hide resolved

Evelyn-M added 3 commits April 30, 2021 15:27

change name of dist_great_circle_allgeoms

3824e7f

update interpolate_polygons

ba48731

update docstring to numpy format

946d4f1

chahank requested a review from ChrisFairless May 3, 2021 08:27

chahank requested changes May 3, 2021

View reviewed changes

chahank reviewed May 3, 2021

View reviewed changes

Chahan Kropf and others added 26 commits October 28, 2021 11:00

Add notes on lines polyongs

fad8b7b

Refactor polygon utils (part1)

2e6480e

Merge branch 'develop' into feature/lines_polygons_exp

f54387a

Allow any geo CRS for geodesics

e9397a1

Move inteMove interpolate lines polyongs to line_polys_handler

128a6d5

Rewrite poly to points and line to points

8b8ad34

Remove old methods

9ab2c69

Add helper methods geometry impact

c9d37a3

Remove old methods

c1daf07

Add line impact

94ff972

Remove old set geom/line methods

5fc09ad

Add imp matrix sub-routines

e2c4bda

Add small example

8553ef1

Make impact agg method and add geom_orig to imp

492d6ca

Add example imp agg

ed85226

Add impact calc for exp pnt

959c580

rename method

7d3d6a0

Add select hazard by extent

f420842

Allow select by reg_id and extent

0832451

Add select tight_centroids_bounds

5cf7874

Add docstring

53779b6

Optimize aggregate impact

49b3274

Remove useless empty line

8c79980

Add kdtree nearest neighbour search

a591028

User parallel computing by default

f21193c

Change matrix multiplication to avoid an expensive todense() call

20f0f5d

Evelyn-M closed this Dec 6, 2021

chahank mentioned this pull request Jan 20, 2022

Handling lines & polygons as exposures for impact calcs #351

Merged

Conversation

Evelyn-M commented Apr 27, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tovogt Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tovogt Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chahank commented Apr 29, 2021

Uh oh!

tovogt commented Apr 29, 2021

Uh oh!

chahank commented Apr 29, 2021

Uh oh!

chahank left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chahank May 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Evelyn-M May 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chahank May 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tovogt Apr 29, 2021 •

edited

Loading

tovogt Apr 29, 2021 •

edited

Loading

chahank May 4, 2021 •

edited

Loading

Evelyn-M May 11, 2021 •

edited

Loading

chahank May 11, 2021 •

edited

Loading