Code Review: Methods to handle lines and polygons with CLIMADA exposures#221
Code Review: Methods to handle lines and polygons with CLIMADA exposures#221
Conversation
…python into feature/lines_polygons_exp
| import numpy as np | ||
| from numba import jit | ||
| import pandas as pd | ||
| import pyproj |
There was a problem hiding this comment.
This is a new dependency. Is it absolutely necessary? Has this been discussed with @emanuel-schmid . Please, as a general convention, before adding new dependencies in a pull request discuss it openly to see whether a) it is really needed b) it is compatible with the environment c) does not add many extra implicit dependencies.
There was a problem hiding this comment.
pyproj is not a new dependency for CLIMADA. It has been included in the requirements as an indirect dependency (through geopandas) for a long time and is used in several places in the code base. But maybe it's true that we should explicitly list it in the requirements file for the hypothetical case that geopandas might drop this dependency at some point in the future.
Still, it's not a big thing that @Evelyn-M imports it here and we should probably have this discussion in a different place.
There was a problem hiding this comment.
Here is a full list of packages that are explicitly imported in CLIMADA, but not listed in the requirements files:
- cftime
- ee (on purpose)
- fiona
- mock
- numpy (!)
- pyproj
- shapefile
- shapely
There was a problem hiding this comment.
Good points! It seems to me that some of these requirements should really be explicitly listed in the requirements file.
I agree that the import in general does not necessarily need to be discussed here, but this is still exactly the goal of pull requests of this type: discuss the general architecture which includes the discussion about external dependencies.
climada/util/coordinates.py
Outdated
| MAX_DEM_TILES_DOWN = 300 | ||
| """Maximum DEM tiles to dowload""" | ||
|
|
||
| ######### copied over from interpolation module ##################### |
There was a problem hiding this comment.
Does this mean that nothing in this pull request is new?
There was a problem hiding this comment.
This means that i copied this part from the interpolation method (until where the next full line of ######### is). It's just a marker to facilitate the review. I will of course delete this remark later.
|
I have some questions about the use of certain methods. What would the following be used for? I do not really see concrete use cases. Also, they seem to not be used anywhere in the CLIMADA code base. This indicates to me that they should potentially be entirely removed instead of being ported.
|
|
The function The functions |
Strange, when I checked the references to these functions on github it did not show them, but I see them in the branch. Fair enough. Thanks for clarifying. |
chahank
left a comment
There was a problem hiding this comment.
A few reflections on the interpolate_polygons function.
climada/util/coordinates.py
Outdated
| axis=1) | ||
| # filter only centroids in actual polygons | ||
| for i, polygon in enumerate(gdf_points.geometry): | ||
| in_geom = coord_on_land(lat=gdf_points['lat'].iloc[i], |
There was a problem hiding this comment.
Why implicitly assume points on land? The util function to transform from polygons to points should not assume that. There are several use cases where points on the ocean are relevant.
There was a problem hiding this comment.
That's not what it assumes. it does as said in the comment - # filter only centroids in actual polygons
The function coord_on_land simply checks whether the points generated from the even grid lie within the desired polygon, and excludes the others. in short, it's just a wrapper for an sjoin func
There was a problem hiding this comment.
This is confusing to me. Why use coord_on_land then? What happens if the coords or the centroids are in the water and the polygon also?
Why not use sjoin directly then?
There was a problem hiding this comment.
Thanks, I looked again at the code for coord_on_land, my bad. It can indeed be misused for filtering points in polygons.
There was a problem hiding this comment.
I checked the coord_on_land and it's basically a re-naming of the shapely-native function vectorized.contains() which is exactly what is needed here to see which points fall into the polygon. I use the latter directly now in commit e892d39
climada/util/coordinates.py
Outdated
| for axis in ['lon', 'lat']: | ||
| gdf_points[axis] = gdf_points.apply(lambda row: row[axis].flatten(), | ||
| axis=1) | ||
| # filter only centroids in actual polygons |
There was a problem hiding this comment.
If I am not mistaken, you can thus easily end up with a polygon that is mapped to no point at all (if it is between raster points). I am not sure how to solve this, but it probably is not desired behaviour, or at least it should be warned about.
One possibility is to assign a value proportional to the cell area in a given polygon. This will lead to points belonging to different polygons at the same time though.
There was a problem hiding this comment.
Well, if you choose your representative area per point too small, then yes. The function assumes that the user adequately judges the size of the polygons in relation to the area per point. Aka, if you have a tiny island of 1km^2, you shouldn't interpolate your polygon with a point_area of 10'000m2
There was a problem hiding this comment.
There's also a lot of discussions online on how to best allocate a certain amount of points within an arbitrary 2D shape. There's a few fancy algorithms out there, but honestly, i think for our purpose, the commonly accepted easy-way out is good enough. And that's making an even grid of pre-determined resolution, and then cropping it to the outline of the polygon. This is how it's implemented atm
There was a problem hiding this comment.
I agree with the simple implementation. But I would at least raise a warning if one of the polygons has no point associated.
There was a problem hiding this comment.
There was a problem hiding this comment.
A warning is now raised if a polygon has no point with the given resolution. Also, one "representative" point is then still set into the shape.
climada/util/coordinates.py
Outdated
| # expand gdf from MultiPoint entries to single Points per row | ||
| return gdf_points.explode().drop(['distance_vector', 'length_full'], axis=1) | ||
|
|
||
| def interpolate_polygons(gdf_poly, area_point): |
There was a problem hiding this comment.
Would it be useful to provide the grid to map the polygons onto? Think for instance that one has a centroids raster and wants to map the polygons to this raster.
There was a problem hiding this comment.
ehm. if you already have a grid at hand, you may not really need this function. with a grid, you can just use an sjoin to see which points of the grid lie inside the polygon. The idea of this func here is to help the user make a grid, where they just need to think about the representative area (resolution) per grid-point.
There was a problem hiding this comment.
Ok, but then the function should only produce the grid. Another function can then be made that maps the polygons onto said grid.
There was a problem hiding this comment.
Note that this function doesn't simply create "a grid". It creates a separate grid for each polygon. Not only the origin of each of those grids is different, but also the resolution. The reason for that is the fact that polygons are assumed to be given in lat-lon-coordinates, while the user specifies the desired resolution in meters. So, the function chooses the grid resolution according to the specific degree-meter conversion factor at the latitudinal location of each polygon.
I partly agree with @chahank: This function should probably be better separated into "subfunctions". I think the most important separation would be to have a function _interpolate_one_polygon that works on each polygon separately and does both generate the grid and apply it to the polygon. The function interpolate_polygons then just uses apply and explode. If you want to be even more modular (as suggested by @chahank), you can divide _interpolate_one_polygon further.
There was a problem hiding this comment.
Good points! Since this is a pull request for testing the design, I suggest to try to split up the functions better.
After needing to use polygons recently, we (together with @ChrisFairless ) saw that there are (at least) two different use cases. One is a you described, that one needs the spatial information in each grid point and for instance divide a polygons in cells of equal surface. Another use case is when one simply needs to aggregate the impacts at the centroid points inside of the polygon(s). I am not sure what is the best way to handle both cases. But separating the interpolation functions into subfunctions would allow to address both cases more easily.
There was a problem hiding this comment.
Both good points! I will implement a more modular version and also think about your use case @chahank , which was not the prime intention, and is very easily implementable with the default geopandas functionalities. It might still fit somewhere here.
There was a problem hiding this comment.
Great! I think this is really important for your design and should be taken into account right at the beginning. Even though it might not be the prime intention, a well-designed set of functions should without difficulty allow for both use cases. Whether this is possible might be a good way to test the (good) modularity of the design.
I think the most simple use-case is something like LitPop where a given value is distributed over all chosen points inside of a shape-polygon (the basic functions should be integrable/grated into LitPop). Afterward, there should be a function where the impacts are aggregated.
Now, the variations and difficulties come in when one chooses a way to distribute, a way to set the points inside of the polygon (a grid, centroids, grid of equal surface cells,...) and a way to aggregate the impacts.
There was a problem hiding this comment.
commit a4ceae1 refactors the polygon (and the line) interpolation: a sub-function _interpolate_one_polygon now creates a separate grid for each polygon with the desired resolution and cuts out only those resulting points that lie inside it.
climada/util/coordinates.py
Outdated
| Parameters | ||
| ---------- | ||
| gdf_lines : gpd.GeoDataframe or str | ||
| Geodataframe or filepath from which to read the GeoDataframe |
There was a problem hiding this comment.
I do not really like the mixing of data reading and data handling inside of a util funtion.
There was a problem hiding this comment.
mmmh. how would you do it? force the user to only one input type (i.e. gdfs)?
There was a problem hiding this comment.
Changed this to allowing only gdfs
|
General comment: what is implemented here is basically a version of I think in past discussions this idea was rejected due to the question of how to aggregate and disaggregate. However, we might give it a tough a new here. In particular, if the util functions do not get different types of aggregations/disaggregation. |
climada/util/lines_polys_handler.py
Outdated
| exp.check() | ||
| return exp | ||
|
|
||
| def agg_point_impact_to_lines(gdf_lines, exp_points, imp_points, |
There was a problem hiding this comment.
Why are both gdf_lines and exp_points needed? The exp_points should contain all the information from gdf_lines.
There was a problem hiding this comment.
True that- this has been removed.
It' still a bit "annoying" that the exposure has to be provided - simply because the multi-index is needed on which to re-aggregate the points. The impact instance only saves exposure coordinates as nd-arrays, without the indices.
I agree on the fact that it would be nice for the impact calc flow, but also difficult without making lots of assumptions on the disaggregation and re-aggregation. I'll continue for now with the current structure, and then lets see if we can make a proposal to use those functionalities with some default assumptions in the impact.calc routine |
|
I propose to close this pull request since many changes have been made to the methods proposed for handling geometries and lines. New separate pull requests will be opened to handle the different proposed changes. |
|
Yes, we can do that |
Dear all
This is an incomplete, work-in-progress module, which I do not intend to merge soon.
Yet, I'd like some conceptual feedback and also make sure changes aren't horrendous in a month's time or so.
The changes include the following:
Merging
util.interpolationandutil.coordinates, which has led to circular imports. This also removes duplicate functions. The unit tests work in the merger.Adding the following util functions in coordinates:
interpolate_lines,interpolate_polygonsand the helperdist_great_circle_allgeoms.Adding the new module
lines_polys_handler.pywhich does the transformation from line / poly geometries into CLIMADA point exposures and the back-transformation from point impacts into line / polygon geometries, via some aggregation options. They are not yet implemented for polys.I also added "fake" unittests for all those functions in the respective test files.