Skip to content

Code Review: Methods to handle lines and polygons with CLIMADA exposures#221

Closed
Evelyn-M wants to merge 55 commits intodevelopfrom
feature/lines_polygons_exp
Closed

Code Review: Methods to handle lines and polygons with CLIMADA exposures#221
Evelyn-M wants to merge 55 commits intodevelopfrom
feature/lines_polygons_exp

Conversation

@Evelyn-M
Copy link
Collaborator

Dear all

This is an incomplete, work-in-progress module, which I do not intend to merge soon.
Yet, I'd like some conceptual feedback and also make sure changes aren't horrendous in a month's time or so.

The changes include the following:
Merging util.interpolation and util.coordinates, which has led to circular imports. This also removes duplicate functions. The unit tests work in the merger.

Adding the following util functions in coordinates: interpolate_lines, interpolate_polygons and the helper dist_great_circle_allgeoms.

Adding the new module lines_polys_handler.py which does the transformation from line / poly geometries into CLIMADA point exposures and the back-transformation from point impacts into line / polygon geometries, via some aggregation options. They are not yet implemented for polys.

I also added "fake" unittests for all those functions in the respective test files.

@chahank chahank requested review from zeliest and removed request for ThomasRoosli April 27, 2021 12:29
import numpy as np
from numba import jit
import pandas as pd
import pyproj
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new dependency. Is it absolutely necessary? Has this been discussed with @emanuel-schmid . Please, as a general convention, before adding new dependencies in a pull request discuss it openly to see whether a) it is really needed b) it is compatible with the environment c) does not add many extra implicit dependencies.

Copy link
Collaborator

@tovogt tovogt Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyproj is not a new dependency for CLIMADA. It has been included in the requirements as an indirect dependency (through geopandas) for a long time and is used in several places in the code base. But maybe it's true that we should explicitly list it in the requirements file for the hypothetical case that geopandas might drop this dependency at some point in the future.

Still, it's not a big thing that @Evelyn-M imports it here and we should probably have this discussion in a different place.

Copy link
Collaborator

@tovogt tovogt Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a full list of packages that are explicitly imported in CLIMADA, but not listed in the requirements files:

  • cftime
  • ee (on purpose)
  • fiona
  • mock
  • numpy (!)
  • pyproj
  • shapefile
  • shapely

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points! It seems to me that some of these requirements should really be explicitly listed in the requirements file.

I agree that the import in general does not necessarily need to be discussed here, but this is still exactly the goal of pull requests of this type: discuss the general architecture which includes the discussion about external dependencies.

MAX_DEM_TILES_DOWN = 300
"""Maximum DEM tiles to dowload"""

######### copied over from interpolation module #####################
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that nothing in this pull request is new?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that i copied this part from the interpolation method (until where the next full line of ######### is). It's just a marker to facilitate the review. I will of course delete this remark later.

@chahank
Copy link
Member

chahank commented Apr 29, 2021

I have some questions about the use of certain methods. What would the following be used for? I do not really see concrete use cases. Also, they seem to not be used anywhere in the CLIMADA code base. This indicates to me that they should potentially be entirely removed instead of being ported.

  • interpol_index
  • index_nn_aprox
  • index_nn_haversine

@tovogt
Copy link
Collaborator

tovogt commented Apr 29, 2021

The function interpol_index has originally been implemented for the assign_centroids functionality. Now, this functionality has been moved to the util function assign_coordinates. But this function still uses interpol_index to find the index of the geographically nearest neighbor in a given point cloud.

The functions index_nn_approx and index_nn_haversine are helper functions for use within interpol_index.

@chahank
Copy link
Member

chahank commented Apr 29, 2021

The function interpol_index has originally been implemented for the assign_centroids functionality. Now, this functionality has been moved to the util function assign_coordinates. But this function still uses interpol_index to find the index of the geographically nearest neighbor in a given point cloud.

The functions index_nn_approx and index_nn_haversine are helper functions for use within interpol_index.

Strange, when I checked the references to these functions on github it did not show them, but I see them in the branch. Fair enough. Thanks for clarifying.

@chahank chahank requested a review from ChrisFairless May 3, 2021 08:27
Copy link
Member

@chahank chahank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few reflections on the interpolate_polygons function.

axis=1)
# filter only centroids in actual polygons
for i, polygon in enumerate(gdf_points.geometry):
in_geom = coord_on_land(lat=gdf_points['lat'].iloc[i],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why implicitly assume points on land? The util function to transform from polygons to points should not assume that. There are several use cases where points on the ocean are relevant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not what it assumes. it does as said in the comment - # filter only centroids in actual polygons
The function coord_on_land simply checks whether the points generated from the even grid lie within the desired polygon, and excludes the others. in short, it's just a wrapper for an sjoin func

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing to me. Why use coord_on_land then? What happens if the coords or the centroids are in the water and the polygon also?

Why not use sjoin directly then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I looked again at the code for coord_on_land, my bad. It can indeed be misused for filtering points in polygons.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the coord_on_land and it's basically a re-naming of the shapely-native function vectorized.contains() which is exactly what is needed here to see which points fall into the polygon. I use the latter directly now in commit e892d39

for axis in ['lon', 'lat']:
gdf_points[axis] = gdf_points.apply(lambda row: row[axis].flatten(),
axis=1)
# filter only centroids in actual polygons
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken, you can thus easily end up with a polygon that is mapped to no point at all (if it is between raster points). I am not sure how to solve this, but it probably is not desired behaviour, or at least it should be warned about.

One possibility is to assign a value proportional to the cell area in a given polygon. This will lead to points belonging to different polygons at the same time though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if you choose your representative area per point too small, then yes. The function assumes that the user adequately judges the size of the polygons in relation to the area per point. Aka, if you have a tiny island of 1km^2, you shouldn't interpolate your polygon with a point_area of 10'000m2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also a lot of discussions online on how to best allocate a certain amount of points within an arbitrary 2D shape. There's a few fancy algorithms out there, but honestly, i think for our purpose, the commonly accepted easy-way out is good enough. And that's making an even grid of pre-determined resolution, and then cropping it to the outline of the polygon. This is how it's implemented atm

Copy link
Member

@chahank chahank May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the simple implementation. But I would at least raise a warning if one of the polygons has no point associated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A warning is now raised if a polygon has no point with the given resolution. Also, one "representative" point is then still set into the shape.

# expand gdf from MultiPoint entries to single Points per row
return gdf_points.explode().drop(['distance_vector', 'length_full'], axis=1)

def interpolate_polygons(gdf_poly, area_point):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to provide the grid to map the polygons onto? Think for instance that one has a centroids raster and wants to map the polygons to this raster.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ehm. if you already have a grid at hand, you may not really need this function. with a grid, you can just use an sjoin to see which points of the grid lie inside the polygon. The idea of this func here is to help the user make a grid, where they just need to think about the representative area (resolution) per grid-point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but then the function should only produce the grid. Another function can then be made that maps the polygons onto said grid.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this function doesn't simply create "a grid". It creates a separate grid for each polygon. Not only the origin of each of those grids is different, but also the resolution. The reason for that is the fact that polygons are assumed to be given in lat-lon-coordinates, while the user specifies the desired resolution in meters. So, the function chooses the grid resolution according to the specific degree-meter conversion factor at the latitudinal location of each polygon.

I partly agree with @chahank: This function should probably be better separated into "subfunctions". I think the most important separation would be to have a function _interpolate_one_polygon that works on each polygon separately and does both generate the grid and apply it to the polygon. The function interpolate_polygons then just uses apply and explode. If you want to be even more modular (as suggested by @chahank), you can divide _interpolate_one_polygon further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points! Since this is a pull request for testing the design, I suggest to try to split up the functions better.

After needing to use polygons recently, we (together with @ChrisFairless ) saw that there are (at least) two different use cases. One is a you described, that one needs the spatial information in each grid point and for instance divide a polygons in cells of equal surface. Another use case is when one simply needs to aggregate the impacts at the centroid points inside of the polygon(s). I am not sure what is the best way to handle both cases. But separating the interpolation functions into subfunctions would allow to address both cases more easily.

Copy link
Collaborator Author

@Evelyn-M Evelyn-M May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both good points! I will implement a more modular version and also think about your use case @chahank , which was not the prime intention, and is very easily implementable with the default geopandas functionalities. It might still fit somewhere here.

Copy link
Member

@chahank chahank May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I think this is really important for your design and should be taken into account right at the beginning. Even though it might not be the prime intention, a well-designed set of functions should without difficulty allow for both use cases. Whether this is possible might be a good way to test the (good) modularity of the design.

I think the most simple use-case is something like LitPop where a given value is distributed over all chosen points inside of a shape-polygon (the basic functions should be integrable/grated into LitPop). Afterward, there should be a function where the impacts are aggregated.

Now, the variations and difficulties come in when one chooses a way to distribute, a way to set the points inside of the polygon (a grid, centroids, grid of equal surface cells,...) and a way to aggregate the impacts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit a4ceae1 refactors the polygon (and the line) interpolation: a sub-function _interpolate_one_polygon now creates a separate grid for each polygon with the desired resolution and cuts out only those resulting points that lie inside it.

Parameters
----------
gdf_lines : gpd.GeoDataframe or str
Geodataframe or filepath from which to read the GeoDataframe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not really like the mixing of data reading and data handling inside of a util funtion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmh. how would you do it? force the user to only one input type (i.e. gdfs)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to allowing only gdfs

@chahank
Copy link
Member

chahank commented May 3, 2021

General comment: what is implemented here is basically a version of Exposure.set_lat_lon for Lines and Polygons instead of points. Would it make sense to include the helper functions there? Then there could be a quite natural flow for compute impacts from geometry exposures, by simply checking in impact.calc if the lat and lon are set, and if not apply set_lat_lon. The difficulty is of course how to disaggregate by default, but maybe we can find a good compromise there. In addition, one could reaggregate the impact by default again at the end of the computation. This might again lead to aggregation default problems, but could be a way togo.

I think in past discussions this idea was rejected due to the question of how to aggregate and disaggregate. However, we might give it a tough a new here. In particular, if the util functions do not get different types of aggregations/disaggregation.

exp.check()
return exp

def agg_point_impact_to_lines(gdf_lines, exp_points, imp_points,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are both gdf_lines and exp_points needed? The exp_points should contain all the information from gdf_lines.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True that- this has been removed.
It' still a bit "annoying" that the exposure has to be provided - simply because the multi-index is needed on which to re-aggregate the points. The impact instance only saves exposure coordinates as nd-arrays, without the indices.

@Evelyn-M
Copy link
Collaborator Author

General comment: what is implemented here is basically a version of Exposure.set_lat_lon for Lines and Polygons instead of points. Would it make sense to include the helper functions there? Then there could be a quite natural flow for compute impacts from geometry exposures, by simply checking in impact.calc if the lat and lon are set, and if not apply set_lat_lon. The difficulty is of course how to disaggregate by default, but maybe we can find a good compromise there. In addition, one could reaggregate the impact by default again at the end of the computation. This might again lead to aggregation default problems, but could be a way togo.

I think in past discussions this idea was rejected due to the question of how to aggregate and disaggregate. However, we might give it a tough a new here. In particular, if the util functions do not get different types of aggregations/disaggregation.

I agree on the fact that it would be nice for the impact calc flow, but also difficult without making lots of assumptions on the disaggregation and re-aggregation. I'll continue for now with the current structure, and then lets see if we can make a proposal to use those functionalities with some default assumptions in the impact.calc routine

@chahank
Copy link
Member

chahank commented Dec 6, 2021

I propose to close this pull request since many changes have been made to the methods proposed for handling geometries and lines. New separate pull requests will be opened to handle the different proposed changes.

@Evelyn-M
Copy link
Collaborator Author

Evelyn-M commented Dec 6, 2021

Yes, we can do that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants