BUG: nan segfault in KDTree, reject non-finite input #18230

tylerjereddy · 2023-04-01T21:03:17Z

Fixes BUG: cKDTree segmentation faults when NaN input and balanced_tree=False, compact_nodes=False #18223, though I'd prefer not do to this (see discussion below)
add regression test + fix for the above issue; in short, don't allow nan to be assigned as a maximum or minimum value array prior to tree construction logic when the array has size greater than 1
this is an improvement in the sense of segfault avoidance, though it doesn't necessarily guarantee that we get the correct tree/attributes when nan is present (indeed, the shim is pretty darn strange, see note in code...)
I'm a bit hesitant to add a bunch of nan support here generally though; for example, what does it even mean if a 3D coordinate has "x" as np.nan, but regular floats for "y" and "z?" I doubt we handle all of these kinds of scenarios "correctly," whatever correctly would even mean here
this really is a rabbit hole I think, there's a similar unresolved discussion for nan handling in the conceptually similar pdist/cdist usage here, unresolved after 9 years:
pdist/cdist with missing values #3870
I think I'm actually tempted to error out when nan is present, rather than maintaining a bunch of code that might otherwise have to temporarily mask out the nans and then recalculate indices before/after the C++ code or whatever, but this likely requires some discussion...

@sturlamolden @peterbell10 thoughts? I suspect you'll agree this is a rabbit hole, but curious if you'd think i.e., erroring out with np.nan present might be desirable long-term?

tylerjereddy · 2023-04-01T21:05:50Z

Also, I know I talked about KDTree stuff with @thomasjpfan, I wonder if this np.nan handling has come up for scikit-learn, which also has KDTree?

WarrenWeckesser · 2023-04-01T21:26:01Z

I think I'm actually tempted to error out when nan is present,

This seems like the sanest approach, especially since the current behavior with nan input is undefined. If there is demand, an option to specify how to handle nan can be added in the future (along the lines of nan_policy in the stats functions).

sturlamolden · 2023-04-02T06:52:01Z

In my opionion…

A nan in kd-tree data or query data should probably generate an error.
An inf should have the meaning ”far away from anything” unless a rectangular box is used, which indicates toroid space (used in cosmology). In that case an inf would actually be a line, not a point. Until this is implemented and bullet proofed, we should just bail out with an error here as well.

sturlamolden · 2023-04-02T07:01:49Z

The reason why it becomes a line when space is toroid, is because to reach that point we have to circumnavigate the universe (along the dimension with an inf) an infinite number of times. So the point infinitely far away is actually any point on that circumnavigation line.

sturlamolden · 2023-04-02T07:18:49Z

But for now, just scan the data for non-finite values and raise ValueError accordingly.

* KDTree now raises a ValueError when provided with non-finite data, per discussion in scipygh-18230 * adjust regression tests accordingly

tylerjereddy · 2023-04-04T23:33:27Z

Based on discussion above, I've revised KDTree to raise a ValueError when provided with non-finite data, for now.

@anntzer note that this changes the behavior you added in gh-16242--hopefully you'll agree that full-blown np.nan/np.inf support isn't ready at this time.

anntzer · 2023-04-05T07:20:37Z

I would be quite annoyed if support for nans was removed in the default compact_nodes=True, balanced_tree=True case as well (which, I guess(?), works), but I don't really have the bandwidth to look more into this right now, so just go ahead with what you think it's best and I'll pin scipy as needed.

I'm a bit hesitant to add a bunch of nan support here generally though; for example, what does it even mean if a 3D coordinate has "x" as np.nan, but regular floats for "y" and "z?" I doubt we handle all of these kinds of scenarios "correctly," whatever correctly would even mean here

I think nan is like inf here: a point where any coordinate is nan is infinitely far away from any other one (as dist(x, x') < dmax is always false -- yes, I realize that dist(x, x') >= dmax is also always false...).

sturlamolden · 2023-04-05T08:09:09Z

A nan in a kd-tree makes no sense, mathematically speaking. It is wrong to say we currently have support for nans, because there is no way to support them.

An inf we can deal with, but if the space is toroid then an inf adds another layer of complexity.
A nan is nothing like an inf when it comes to spatial data.

There is the question of backwards compatibility, though. KDTree has been around for a while. Maybe someone rely on the bogus behavior?

However I am not convinced anyone should rely on any particularly behavior when it comes to feeding a nan into a kd-tree. Currently it is garbage in - garbage out.

anntzer · 2023-04-06T13:40:19Z

Maybe someone rely on the bogus behavior?

At least I assumed (perhaps wrongly) that nans were treated as being infinitely far away from everyone else.

dopplershift · 2023-04-06T17:58:01Z

In the past I have blindly sent nans in and gotten clearly unspecified behavior (from "doesn't crash" to "crashes" #14527). While I'm sure the ValueError will catch some people, it's much better than getting behavior that is unspecified and might not even be actually working correctly.

peterbell10

Would it be worth adding a check_finite=True argument to the constructor? This would allow someone to preserve backward compatibility but also acknowledge that non-finites aren't supported.

scipy/spatial/_ckdtree.pyx

anntzer · 2023-04-11T21:32:30Z

If they are clearly not supported, I'd rather get an actual error out rather than nonsense; I was simply under the (mistaken?) impression that they are.

tylerjereddy · 2023-04-28T18:35:40Z

I think there's a bit of debate here, but I believe the outcome is that we really do need to raise an error on non-finite input until we have a solid way to handle it. I'll probably make the small change Peter suggested, but not much else beyond that I don't think. Sounds like breaking backwards compatibility is "ok" because we probably should never have allowed the non-finite input in the first place.

sturlamolden · 2023-04-29T03:02:26Z

If they are clearly not supported, I'd rather get an actual error out rather than nonsense; I was simply under the (mistaken?) impression that they are.

It is not just that you get nonsense returned. You may also get a segfault.

* KDTree now raises a ValueError when provided with non-finite data, per discussion in scipygh-18230 * adjust regression tests accordingly

tylerjereddy · 2023-04-30T19:02:06Z

I pushed in the revision requested by Peter, and the tests did pass for me locally after rebasing on latest main. I'm expecting to see a few unrelated CI failures based on recent PRs.

Let me know if there's anything else we want to adjust here.

* Fixes scipy#18223, though I'd prefer not do to this (see discussion below) * add regression test + fix for the above issue; in short, don't allow `nan` to be assigned as a maximum or minimum value array prior to tree construction logic when the array has size greater than 1 * this is an improvement in the sense of segfault avoidance, though it doesn't necessarily guarantee that we get the correct tree/attributes when `nan` is present (indeed, the shim is pretty darn strange, see note in code...) * I'm a bit hesitant to add a bunch of `nan` support here generally though; for example, what does it even mean if a 3D coordinate has "x" as np.nan, but regular floats for "y" and "z?" I doubt we handle all of these kinds of scenarios "correctly," whatever correctly would even mean here * this really is a rabbit hole I think, there's a similar unresolved discussion for `nan` handling in the conceptually similar `pdist/cdist` usage here, unresolved after 9 years: scipy#3870 * I think I'm actually tempted to error out when `nan` is present, rather than maintaining a bunch of code that might otherwise have to temporarily mask out the nans and then recalculate indices before/afer the C++ code or whatever, but this likely requires some discussion...

* KDTree now raises a ValueError when provided with non-finite data, per discussion in scipygh-18230 * adjust regression tests accordingly

* style fix on the KDTree finite input check based on reviewer feedback

tylerjereddy · 2023-05-20T19:50:00Z

Although the CI failures appeared to be unrelated, there were tons of them, so I've tried rebase/force push to see if cleaner now.

sturlamolden · 2023-05-20T20:13:39Z

We need an isfinite ckeck on query vectors as well, not just the kd-tree data.

tylerjereddy · 2023-05-21T18:14:25Z

We need an isfinite ckeck on query vectors as well, not just the kd-tree data.

I think I'll open a follow-up issue for that and use release manager discretion a bit to merge this in to guard against at least the original segfault before we branch. Hopefully we guard against the queries as well before branching.

tylerjereddy · 2023-05-21T18:15:03Z

2/3 failing CI tests I've seen in other unrelated PRs, and the optimize prelease failure doesn't look related either (I hope.. didn't see it locally...).

tylerjereddy added defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.spatial labels Apr 1, 2023

tylerjereddy requested a review from peterbell10 as a code owner April 1, 2023 21:03

tylerjereddy added a commit to tylerjereddy/scipy that referenced this pull request Apr 4, 2023

MAINT: PR 18230 revisions

66e3b8a

* KDTree now raises a ValueError when provided with non-finite data, per discussion in scipygh-18230 * adjust regression tests accordingly

peterbell10 reviewed Apr 11, 2023

View reviewed changes

scipy/spatial/_ckdtree.pyx Outdated Show resolved Hide resolved

rgommers mentioned this pull request Apr 12, 2023

RFC: SciPy array types & libraries support #18286

Open

tylerjereddy added the backport-candidate This fix should be ported by a maintainer to previous SciPy versions. label Apr 29, 2023

tylerjereddy added a commit to tylerjereddy/scipy that referenced this pull request Apr 30, 2023

MAINT: PR 18230 revisions

f22261d

* KDTree now raises a ValueError when provided with non-finite data, per discussion in scipygh-18230 * adjust regression tests accordingly

tylerjereddy force-pushed the treddy_issue_18223 branch from 66e3b8a to 78355a8 Compare April 30, 2023 18:58

tylerjereddy changed the title ~~WIP, BUG: nan segfault in KDTree~~ BUG: nan segfault in KDTree, reject non-finite input Apr 30, 2023

tylerjereddy added this to the 1.11.0 milestone Apr 30, 2023

tylerjereddy closed this May 20, 2023

tylerjereddy reopened this May 20, 2023

tylerjereddy added 2 commits May 20, 2023 13:45

MAINT: PR 18230 revisions

746d76b

* KDTree now raises a ValueError when provided with non-finite data, per discussion in scipygh-18230 * adjust regression tests accordingly

MAINT: PR 18230 revisions

a897f5c

* style fix on the KDTree finite input check based on reviewer feedback

tylerjereddy force-pushed the treddy_issue_18223 branch from 78355a8 to a897f5c Compare May 20, 2023 19:49

tylerjereddy merged commit 7921d48 into scipy:main May 21, 2023

tylerjereddy deleted the treddy_issue_18223 branch May 21, 2023 18:15

tylerjereddy mentioned this pull request May 21, 2023

MAINT, BUG: guard against non-finite kd-tree queries #18497

Closed

martinfleis mentioned this pull request Jun 30, 2023

BUG: cKDtree.query no longer accepts DataFrame as input #18800

Closed

tylerjereddy removed the backport-candidate This fix should be ported by a maintainer to previous SciPy versions. label Jun 30, 2023

Uh oh!

BUG: nan segfault in KDTree, reject non-finite input #18230

BUG: nan segfault in KDTree, reject non-finite input #18230

Uh oh!

Conversation

tylerjereddy commented Apr 1, 2023

Uh oh!

tylerjereddy commented Apr 1, 2023

Uh oh!

WarrenWeckesser commented Apr 1, 2023

Uh oh!

sturlamolden commented Apr 2, 2023

Uh oh!

sturlamolden commented Apr 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sturlamolden commented Apr 2, 2023

Uh oh!

tylerjereddy commented Apr 4, 2023

Uh oh!

anntzer commented Apr 5, 2023

Uh oh!

sturlamolden commented Apr 5, 2023

Uh oh!

anntzer commented Apr 6, 2023

Uh oh!

dopplershift commented Apr 6, 2023

Uh oh!

peterbell10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

anntzer commented Apr 11, 2023

Uh oh!

tylerjereddy commented Apr 28, 2023

Uh oh!

sturlamolden commented Apr 29, 2023

Uh oh!

tylerjereddy commented Apr 30, 2023

Uh oh!

tylerjereddy commented May 20, 2023

Uh oh!

sturlamolden commented May 20, 2023

Uh oh!

tylerjereddy commented May 21, 2023

Uh oh!

tylerjereddy commented May 21, 2023

Uh oh!

Uh oh!

sturlamolden commented Apr 2, 2023 •

edited

Loading