Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement geom_equals and binary predicates that depend only on it. #926

Merged
Merged
Show file tree
Hide file tree
Changes from 58 commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
67a2a6a
Start refactoring binops to handle various ops and inputs.
thomcom Dec 1, 2022
3038253
Abstraction working with binops that depend on pip.
thomcom Dec 1, 2022
d7a842b
Clean it up in GeoSeries.
thomcom Dec 1, 2022
3301b95
Correct abstraction for pip_only tests.
thomcom Dec 1, 2022
f6688d7
Support the seven extra pip tests.
thomcom Dec 2, 2022
77b6858
Support all supportable pip calls.
thomcom Dec 2, 2022
c69d2e6
Remove unused abstraction.
thomcom Dec 2, 2022
de923c5
Clean up binops.intersects docs.
thomcom Dec 2, 2022
67b0df6
Pass all equals tests except one.
thomcom Dec 2, 2022
b4ea5e5
Merge
thomcom Dec 5, 2022
305f937
Finish abstracting pip based ops.
thomcom Dec 5, 2022
707217b
Pass all multi tests.
thomcom Dec 5, 2022
376833a
Add equals only binops.
thomcom Dec 7, 2022
f0d0c46
Rename binops to binpreds.
thomcom Dec 8, 2022
441a585
Cleaning up based on review.
thomcom Dec 8, 2022
e02a669
Cleaned up unneeded future material.
thomcom Dec 8, 2022
6444a4c
Resolve has_same_dimension naming issue.
thomcom Dec 8, 2022
29870d4
Fix postprocess docs.
thomcom Dec 9, 2022
c2cc722
Factor out op string argument and move logic to the __call__ function…
thomcom Dec 9, 2022
18631e3
Fix bug with complete overlap
thomcom Dec 9, 2022
a1ab8a7
Merge new pip only binpreds refactor into equals.
thomcom Dec 9, 2022
023247e
Pass all equals only tests with new factoring.
thomcom Dec 9, 2022
3ddf7c4
Fix contains issue.
thomcom Dec 9, 2022
82f40ec
Update python/cuspatial/cuspatial/core/binpreds/binpreds.py
thomcom Dec 12, 2022
b71f34f
Handle review comments from @isVoid.
thomcom Dec 12, 2022
5338a73
Remove more equals code.
thomcom Dec 12, 2022
af5ae69
Merge branch 'feature/all_pip_operations' into feature/all_equals_ope…
thomcom Dec 12, 2022
34b3fbf
Update python/cuspatial/cuspatial/core/geoseries.py
thomcom Dec 13, 2022
7be5a73
Handle review comments from @harrism
thomcom Dec 13, 2022
230b0b7
Merge branch 'feature/all_equals_operations' of github.com:thomcom/cu…
thomcom Dec 13, 2022
e2a3616
Merge branch 'branch-23.02' into feature/all_equals_operations
thomcom Dec 13, 2022
04cd58d
Update python/cuspatial/cuspatial/core/geoseries.py
thomcom Dec 13, 2022
c9211f3
Update python/cuspatial/cuspatial/core/binpreds/binpreds.py
thomcom Dec 14, 2022
a2acb0c
Update python/cuspatial/cuspatial/core/binpreds/binpreds.py
thomcom Dec 14, 2022
f749ca0
Replace operation with predicate
thomcom Dec 14, 2022
e94293a
Replace operation with predicate.
thomcom Dec 14, 2022
3e7dacd
Rewrite docs.
thomcom Dec 14, 2022
86d30b6
Merge branch 'feature/all_pip_operations' into feature/all_equals_ope…
thomcom Dec 14, 2022
2c2de39
Working on using instead of equals.
thomcom Dec 15, 2022
5448dab
Fix everything except equals tests.
thomcom Dec 16, 2022
5b39f70
Iterative refactor that actually compares some points and linestrings…
thomcom Dec 16, 2022
c7f1d21
More specific test cases.
thomcom Dec 16, 2022
56865de
Refactor out _compare_aligned_vertices
thomcom Dec 16, 2022
57a2d46
Add multipoint
thomcom Dec 16, 2022
b5002e2
Add misordered multipoint test and success.
thomcom Dec 16, 2022
3e1d4d5
Implement polygons cases which do not yet pass.
thomcom Dec 16, 2022
fdc2fe6
Committed a crime of refactoring in order to get polygons non reorder…
thomcom Dec 16, 2022
32fb2a4
Properly support equals operations.
thomcom Jan 31, 2023
6b0c4c7
Merge branch 'branch-23.04' into feature/all_equals_operations
thomcom Feb 1, 2023
5996035
Merge branch 'branch-23.04' into feature/all_equals_operations
thomcom Feb 14, 2023
049b15f
Resolve merge of geoseries.py
thomcom Feb 14, 2023
4356f24
Put GeoColumnAccessors back inside of GeoSeries.
thomcom Feb 14, 2023
9dffa70
Clean up test_pip_only_binpreds.py
thomcom Feb 14, 2023
654c28e
Discovered a bug with larger sets of points.
thomcom Feb 14, 2023
212881e
Cleanup and drop 31 points
thomcom Feb 15, 2023
61a4108
Rename binops to binpreds.
thomcom Mar 8, 2023
5416e90
Merge branch 'branch-23.04' into feature/all_equals_operations
thomcom Mar 8, 2023
96666a5
Update copyright dates.
thomcom Mar 8, 2023
21f694c
Getting started with _ sort_polygons. Needs to have the closed edge r…
thomcom Mar 9, 2023
7854069
Dropping first point.
thomcom Mar 15, 2023
10c0498
Merge branch 'branch-23.04' into feature/all_equals_operations
thomcom Mar 17, 2023
5278b5e
Convert polygons to multipoints without their extra endpoint for sort…
thomcom Mar 17, 2023
dbcb617
Refactor out messy _sort_linestrings.
thomcom Mar 17, 2023
4a5ff80
cleaning up
thomcom Mar 17, 2023
ce3aa49
Remove polygon related equals operations.
thomcom Mar 17, 2023
d88c33b
Handle a review comment about wording.
thomcom Mar 17, 2023
3a7e04f
Trying to resolve docs formatting error.
thomcom Mar 17, 2023
b7d4d17
I think I've fixed the docs ci issue.
thomcom Mar 19, 2023
01789d7
Skip NotImplemented tests.
thomcom Mar 20, 2023
b53fb6c
Refactor equals binpred to handle equality of linestrings, which requ…
thomcom Mar 21, 2023
a9e6563
Rename dtypes.
thomcom Mar 21, 2023
41ead90
Extra overlaps and crosses tests
thomcom Mar 21, 2023
c7e3686
Merge branch 'branch-23.04' into feature/all_equals_operations
thomcom Mar 22, 2023
2b857db
Handle review comments.
thomcom Mar 22, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
272 changes: 265 additions & 7 deletions python/cuspatial/cuspatial/core/binpreds/binpreds.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,35 @@
# Copyright (c) 2022, NVIDIA CORPORATION.
# Copyright (c) 2022-2023, NVIDIA CORPORATION.

from abc import ABC, abstractmethod

import cupy as cp

import cudf

from cuspatial.core._column.geocolumn import GeoColumn
from cuspatial.core.binpreds.contains import contains_properly
from cuspatial.utils.column_utils import (
contains_only_linestrings,
contains_only_multipoints,
contains_only_points,
contains_only_polygons,
has_same_geometry,
)


class PreprocessorOutput:
thomcom marked this conversation as resolved.
Show resolved Hide resolved
def __init__(self, coords, indices) -> None:
self.vertices = coords
self.indices = indices

@property
def xy(self):
return self.vertices

def point_indices(self):
return self.indices


class BinaryPredicate(ABC):
@abstractmethod
def preprocess(self, lhs, rhs):
Expand Down Expand Up @@ -43,15 +59,19 @@ def preprocess(self, lhs, rhs):
The right-hand-side of the internal binary predicate, may be
reordered.
"""
pass
return (lhs, rhs, cudf.RangeIndex(len(rhs)))

@abstractmethod
def postprocess(self, point_indices, point_result):
def postprocess(
self, point_indices: cudf.Series, point_result: cudf.Series
) -> cudf.Series:
"""Postprocess the output data for the binary predicate. This method
should be implemented by subclasses.

Postprocess converts the raw results of the binary predicate into
the final result. This is where the discrete math rules are applied.
the final result. At this step the results for none, any, and all
thomcom marked this conversation as resolved.
Show resolved Hide resolved
are applied to the result of the equals, intersects, and
point-in-polygon predicates.

Parameters
----------
Expand Down Expand Up @@ -114,6 +134,15 @@ def __init__(self, lhs, rhs, align=True):
(self.lhs, self.rhs) = lhs.align(rhs) if align else (lhs, rhs)
self.align = align

def _cancel_op(self, lhs, rhs, result):
"""Used to disable computation of the binary predicate.

This occurs when the binary predicate is not supported for the
input types, and a final result can be computed only using
`preprocess` and `postprocess`."""
self._op = lambda x, y: result
return (lhs, rhs, result)

def __call__(self) -> cudf.Series:
"""Return the result of the binary predicate."""
# Type disambiguation
Expand Down Expand Up @@ -185,18 +214,21 @@ def postprocess(self, point_indices, point_result):
correct type for the predicate."""
result = cudf.DataFrame({"idx": point_indices, "pip": point_result})
df_result = result
# Discrete math recombination
if (
contains_only_linestrings(self.rhs)
or contains_only_polygons(self.rhs)
or contains_only_multipoints(self.rhs)
):
# process for completed linestrings, polygons, and multipoints.
# Not necessary for points.
# Compute the set of results for each point-in-polygon predicate.
# Group them by the original index, and sum the results. If the
# sum of points in the rhs feature is equal to the number of
# points found in the polygon, then the polygon contains the
# feature.
df_result = (
result.groupby("idx").sum().sort_index()
== result.groupby("idx").count().sort_index()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to count the number of points in a feature? Isn't it just the size of the feature, which is stored? (Constant time vs. linear time)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case code complexity is reduced by using the point_indices object that is an argument to postprocess. Instead of having to handle all types and combinations in each postprocess, the relevant indices for a given row of the GeoSeries, which vary across geometry types, are computed in the preprocess step and then carried through the BinaryPredicate operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it just the size of the feature, which is stored?

The size of each feature in the column isn't stored. But can probably be easily computed with offest[1:] - offset[:-1]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, you meant to compute "is all points of the feature contained in the polygon". Isn't it just a logical_or to PiP result of the same feature, aka a groupby max?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the result of BOOL8 is actually uint8, if you do sum and there are more than 256 points in the multipoint, will it overflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the preprocess step,

# RHS conditioning:
point_indices = None
# point in polygon
if contains_only_linestrings(rhs):
# condition for linestrings
geom = rhs.lines
elif contains_only_polygons(rhs) is True:
# polygon in polygon
geom = rhs.polygons
elif contains_only_multipoints(rhs) is True:
# mpoint in polygon
geom = rhs.multipoints
else:
# no conditioning is required
geom = rhs.points
xy_points = geom.xy
# Arrange into shape for calling point-in-polygon, intersection, or
# equals
point_indices = geom.point_indices()
from cuspatial.core.geoseries import GeoSeries
final_rhs = GeoSeries(
GeoColumn._from_points_xy(xy_points._column)
).points
return (lhs, final_rhs, point_indices)
, depending on rhs type the point values are extracted and flattened into a GeoSeries of points. At that time point_indices is also recorded. In order to not use point_indices here that logic would have to be reproduced in the postprocess as well.

)
# Convert the result to a GeoSeries.
point_result = cudf.Series(
df_result["pip"], index=cudf.RangeIndex(0, len(df_result))
)
Expand Down Expand Up @@ -234,3 +266,229 @@ def preprocess(self, lhs, rhs):
if contains_only_polygons(rhs):
(lhs, rhs) = (rhs, lhs)
return super().preprocess(lhs, rhs)

def postprocess(self, point_indices, point_result):
"""Postprocess the output GeoSeries to ensure that they are of the
correct type for the predicate."""
result = cudf.DataFrame({"idx": point_indices, "pip": point_result})
df_result = result
# Discrete math recombination
if (
contains_only_linestrings(self.rhs)
or contains_only_polygons(self.rhs)
or contains_only_multipoints(self.rhs)
):
# process for completed linestrings, polygons, and multipoints.
# Not necessary for points.
df_result = (
result.groupby("idx").sum().sort_index()
== result.groupby("idx").count().sort_index()
)
point_result = cudf.Series(
df_result["pip"], index=cudf.RangeIndex(0, len(df_result))
)
point_result.name = None
return point_result


class EqualsBinpred(BinaryPredicate):
def _offset_equals(self, lhs, rhs):
"""Compute the pairwise length equality of two offset arrays"""
lhs_lengths = lhs[:-1] - lhs[1:]
rhs_lengths = rhs[:-1] - rhs[1:]
return lhs_lengths == rhs_lengths

def _sort_multipoints(self, lhs, rhs, initial):
"""Sort xy according to bins defined by offset"""
sort_indices = cp.repeat(lhs.point_indices(), 2)
lhs_xy, rhs_xy = lhs.xy, rhs.xy
lhs_xy.index = sort_indices
rhs_xy.index = sort_indices
lhs_df = lhs_xy.reset_index(drop=False, name="xy")
rhs_df = rhs_xy.reset_index(drop=False, name="xy")
lhs_sorted = lhs_df.sort_values(by=["index", "xy"]).reset_index(
drop=True
)
rhs_sorted = rhs_df.sort_values(by=["index", "xy"]).reset_index(
drop=True
)
return (
PreprocessorOutput(lhs_sorted["xy"], lhs.point_indices()),
PreprocessorOutput(rhs_sorted["xy"], rhs.point_indices()),
initial,
)

def _sort_linestrings(self, lhs, rhs, initial):
"""Swap first and last values of each linestring to ensure that
the first point is the lowest value. This is necessary to ensure
that the endpoints are not included in the comparison."""
# Save temporary values since lhs.x cannot be used modified.
lhs_x = lhs.x
lhs_y = lhs.y
rhs_x = rhs.x
rhs_y = rhs.y
point_range = cudf.Series(cp.arange(len(lhs_x)))
indices = point_range.groupby(lhs.point_indices())
# Create masks for the first and last values of each linestring
swap_lx = lhs_x[indices.last()].reset_index(drop=True) < lhs_x[
indices.first()
].reset_index(drop=True)
swap_ly = lhs_y[indices.last()].reset_index(drop=True) < lhs_y[
indices.first()
].reset_index(drop=True)
swap_rx = rhs_x[indices.last()].reset_index(drop=True) < rhs_x[
indices.first()
].reset_index(drop=True)
swap_ry = rhs_y[indices.last()].reset_index(drop=True) < rhs_y[
indices.first()
].reset_index(drop=True)
# Swap the first and last values of each linestring
(
lhs_x.iloc[indices.last()[swap_lx]],
thomcom marked this conversation as resolved.
Show resolved Hide resolved
lhs_x.iloc[indices.first()[swap_lx]],
) = (
lhs_x.iloc[indices.first()[swap_lx]],
lhs_x.iloc[indices.last()[swap_lx]],
)
(
lhs_y.iloc[indices.last()[swap_ly]],
lhs_y.iloc[indices.first()[swap_ly]],
) = (
lhs_y.iloc[indices.first()[swap_ly]],
lhs_y.iloc[indices.last()[swap_ly]],
)
(
rhs_x.iloc[indices.last()[swap_rx]],
rhs_x.iloc[indices.first()[swap_rx]],
) = (
rhs_x.iloc[indices.first()[swap_rx]],
rhs_x.iloc[indices.last()[swap_rx]],
)
(
rhs_y.iloc[indices.last()[swap_ry]],
rhs_y.iloc[indices.first()[swap_ry]],
) = (
rhs_y.iloc[indices.first()[swap_ry]],
rhs_y.iloc[indices.last()[swap_ry]],
)
# Reconstruct the xy columns
lhs_xy = lhs.xy
rhs_xy = rhs.xy
lhs_xy.iloc[::2] = lhs_x
lhs_xy.iloc[1::2] = lhs_y
rhs_xy.iloc[::2] = rhs_x
rhs_xy.iloc[1::2] = rhs_y

return (
PreprocessorOutput(lhs_xy, lhs.point_indices()),
PreprocessorOutput(rhs_xy, rhs.point_indices()),
initial,
)

def preprocess(self, lhs, rhs):
# Compare types
type_compare = lhs.dtype == rhs.dtype
# Any unmatched type is not equal
if (type_compare == False).all(): # noqa: E712
isVoid marked this conversation as resolved.
Show resolved Hide resolved
# Override _op so that it will not be run.
return self._cancel_op(lhs, rhs, type_compare)
# Get indices of matching types
if contains_only_multipoints(lhs):
lengths_equal = self._offset_equals(
lhs.multipoints.geometry_offset,
rhs.multipoints.geometry_offset,
)
if lengths_equal.any():
# Multipoints are equal if they contains the
# same unordered points.
return self._sort_multipoints(
lhs[lengths_equal].multipoints,
rhs[lengths_equal].multipoints,
lengths_equal,
)
else:
# No lengths are equal, so none can be equal.
return self._cancel_op(lhs, rhs, lengths_equal)
elif contains_only_linestrings(lhs):
lengths_equal = self._offset_equals(
lhs.lines.part_offset, rhs.lines.part_offset
)
if lengths_equal.any():
# Linestrings are equal if their sorted points
# are equal. This is unintuitive and perhaps
# incorrect, but it is the behavior of shapely.
thomcom marked this conversation as resolved.
Show resolved Hide resolved
return self._sort_linestrings(
lhs[lengths_equal].lines,
rhs[lengths_equal].lines,
lengths_equal,
)
else:
return self._cancel_op(lhs, rhs, lengths_equal)
elif contains_only_polygons(lhs):
geoms_equal = self._offset_equals(
lhs.polygons.part_offset, rhs.polygons.part_offset
)
lengths_equal = self._offset_equals(
lhs[geoms_equal].polygons.ring_offset,
rhs[geoms_equal].polygons.ring_offset,
)
if lengths_equal.any():
isVoid marked this conversation as resolved.
Show resolved Hide resolved
# Don't sort polygons
return (
lhs[lengths_equal].polygons,
rhs[lengths_equal].polygons,
lengths_equal,
)
else:
return self._cancel_op(lhs, rhs, lengths_equal)
elif contains_only_points(lhs):
return (lhs.points, rhs.points, type_compare)

def postprocess(self, lengths_equal, point_result):
# if point_result is not a Series, preprocessing terminated
# the results early.
if isinstance(point_result, cudf.Series):
point_result = point_result.sort_index()
lengths_equal[point_result.index] = point_result
return cudf.Series(lengths_equal)

def _vertices_equals(self, lhs, rhs):
"""Compute the equals relationship between interleaved xy
coordinate buffers."""
length = min(len(lhs), len(rhs))
a = lhs[:length:2]._column == rhs[:length:2]._column
b = rhs[1:length:2]._column == lhs[1:length:2]._column
return a & b

def _op(self, lhs, rhs):
indices = lhs.point_indices()
result = self._vertices_equals(lhs.xy, rhs.xy)
result_df = cudf.DataFrame(
{"idx": indices[: len(result)], "equals": result}
)
gb_idx = result_df.groupby("idx")
result = (gb_idx.sum().sort_index() == gb_idx.count().sort_index())[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this doing?

Copy link
Contributor Author

@thomcom thomcom Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$a_i == b_i \leftrightarrow \forall j$ in len(a) $a_j == b_j$

"equals"
]
result.index.name = None
result.name = None
return result


class CrossesBinpred(EqualsBinpred):
thomcom marked this conversation as resolved.
Show resolved Hide resolved
def postprocess(self, point_indices, point_result):
if has_same_geometry(self.lhs, self.rhs) and contains_only_points(
self.lhs
):
return cudf.Series([False] * len(self.lhs))
df_result = cudf.DataFrame({"idx": point_indices, "pip": point_result})
point_result = cudf.Series(
df_result["pip"], index=cudf.RangeIndex(0, len(df_result))
)
point_result.name = None
return point_result


class CoversBinpred(EqualsBinpred):
thomcom marked this conversation as resolved.
Show resolved Hide resolved
def postprocess(self, point_indices, point_result):
return cudf.Series(point_result, index=point_indices)
Loading