Skip to content

BUG: nan segfault in KDTree, reject non-finite input #18230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 21, 2023

Conversation

tylerjereddy
Copy link
Contributor

  • Fixes BUG: cKDTree segmentation faults when NaN input and balanced_tree=False, compact_nodes=False #18223, though I'd prefer not do to this (see discussion below)

  • add regression test + fix for the above issue; in short, don't allow nan to be assigned as a maximum or minimum value array prior to tree construction logic when the array has size greater than 1

  • this is an improvement in the sense of segfault avoidance, though it doesn't necessarily guarantee that we get the correct tree/attributes when nan is present (indeed, the shim is pretty darn strange, see note in code...)

  • I'm a bit hesitant to add a bunch of nan support here generally though; for example, what does it even mean if a 3D coordinate has "x" as np.nan, but regular floats for "y" and "z?" I doubt we handle all of these kinds of scenarios "correctly," whatever correctly would even mean here

  • this really is a rabbit hole I think, there's a similar unresolved discussion for nan handling in the conceptually similar pdist/cdist usage here, unresolved after 9 years:
    pdist/cdist with missing values #3870

  • I think I'm actually tempted to error out when nan is present, rather than maintaining a bunch of code that might otherwise have to temporarily mask out the nans and then recalculate indices before/after the C++ code or whatever, but this likely requires some discussion...

@sturlamolden @peterbell10 thoughts? I suspect you'll agree this is a rabbit hole, but curious if you'd think i.e., erroring out with np.nan present might be desirable long-term?

@tylerjereddy tylerjereddy added defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.spatial labels Apr 1, 2023
@tylerjereddy
Copy link
Contributor Author

Also, I know I talked about KDTree stuff with @thomasjpfan, I wonder if this np.nan handling has come up for scikit-learn, which also has KDTree?

@WarrenWeckesser
Copy link
Member

I think I'm actually tempted to error out when nan is present,

This seems like the sanest approach, especially since the current behavior with nan input is undefined. If there is demand, an option to specify how to handle nan can be added in the future (along the lines of nan_policy in the stats functions).

@sturlamolden
Copy link
Contributor

In my opionion…

  • A nan in kd-tree data or query data should probably generate an error.

  • An inf should have the meaning ”far away from anything” unless a rectangular box is used, which indicates toroid space (used in cosmology). In that case an inf would actually be a line, not a point. Until this is implemented and bullet proofed, we should just bail out with an error here as well.

@sturlamolden
Copy link
Contributor

sturlamolden commented Apr 2, 2023

The reason why it becomes a line when space is toroid, is because to reach that point we have to circumnavigate the universe (along the dimension with an inf) an infinite number of times. So the point infinitely far away is actually any point on that circumnavigation line.

@sturlamolden
Copy link
Contributor

But for now, just scan the data for non-finite values and raise ValueError accordingly.

tylerjereddy added a commit to tylerjereddy/scipy that referenced this pull request Apr 4, 2023
* KDTree now raises a ValueError when
provided with non-finite data, per discussion
in scipygh-18230

* adjust regression tests accordingly
@tylerjereddy
Copy link
Contributor Author

Based on discussion above, I've revised KDTree to raise a ValueError when provided with non-finite data, for now.

@anntzer note that this changes the behavior you added in gh-16242--hopefully you'll agree that full-blown np.nan/np.inf support isn't ready at this time.

@anntzer
Copy link
Contributor

anntzer commented Apr 5, 2023

I would be quite annoyed if support for nans was removed in the default compact_nodes=True, balanced_tree=True case as well (which, I guess(?), works), but I don't really have the bandwidth to look more into this right now, so just go ahead with what you think it's best and I'll pin scipy as needed.

I'm a bit hesitant to add a bunch of nan support here generally though; for example, what does it even mean if a 3D coordinate has "x" as np.nan, but regular floats for "y" and "z?" I doubt we handle all of these kinds of scenarios "correctly," whatever correctly would even mean here

I think nan is like inf here: a point where any coordinate is nan is infinitely far away from any other one (as dist(x, x') < dmax is always false -- yes, I realize that dist(x, x') >= dmax is also always false...).

@sturlamolden
Copy link
Contributor

A nan in a kd-tree makes no sense, mathematically speaking. It is wrong to say we currently have support for nans, because there is no way to support them.

An inf we can deal with, but if the space is toroid then an inf adds another layer of complexity.
A nan is nothing like an inf when it comes to spatial data.

There is the question of backwards compatibility, though. KDTree has been around for a while. Maybe someone rely on the bogus behavior?

However I am not convinced anyone should rely on any particularly behavior when it comes to feeding a nan into a kd-tree. Currently it is garbage in - garbage out.

@anntzer
Copy link
Contributor

anntzer commented Apr 6, 2023

Maybe someone rely on the bogus behavior?

At least I assumed (perhaps wrongly) that nans were treated as being infinitely far away from everyone else.

@dopplershift
Copy link

In the past I have blindly sent nans in and gotten clearly unspecified behavior (from "doesn't crash" to "crashes" #14527). While I'm sure the ValueError will catch some people, it's much better than getting behavior that is unspecified and might not even be actually working correctly.

Copy link
Member

@peterbell10 peterbell10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth adding a check_finite=True argument to the constructor? This would allow someone to preserve backward compatibility but also acknowledge that non-finites aren't supported.

@anntzer
Copy link
Contributor

anntzer commented Apr 11, 2023

If they are clearly not supported, I'd rather get an actual error out rather than nonsense; I was simply under the (mistaken?) impression that they are.

@tylerjereddy
Copy link
Contributor Author

I think there's a bit of debate here, but I believe the outcome is that we really do need to raise an error on non-finite input until we have a solid way to handle it. I'll probably make the small change Peter suggested, but not much else beyond that I don't think. Sounds like breaking backwards compatibility is "ok" because we probably should never have allowed the non-finite input in the first place.

@sturlamolden
Copy link
Contributor

If they are clearly not supported, I'd rather get an actual error out rather than nonsense; I was simply under the (mistaken?) impression that they are.

It is not just that you get nonsense returned. You may also get a segfault.

@tylerjereddy tylerjereddy added the backport-candidate This fix should be ported by a maintainer to previous SciPy versions. label Apr 29, 2023
tylerjereddy added a commit to tylerjereddy/scipy that referenced this pull request Apr 30, 2023
* KDTree now raises a ValueError when
provided with non-finite data, per discussion
in scipygh-18230

* adjust regression tests accordingly
@tylerjereddy tylerjereddy changed the title WIP, BUG: nan segfault in KDTree BUG: nan segfault in KDTree, reject non-finite input Apr 30, 2023
@tylerjereddy tylerjereddy added this to the 1.11.0 milestone Apr 30, 2023
@tylerjereddy
Copy link
Contributor Author

I pushed in the revision requested by Peter, and the tests did pass for me locally after rebasing on latest main. I'm expecting to see a few unrelated CI failures based on recent PRs.

Let me know if there's anything else we want to adjust here.

* Fixes scipy#18223, though I'd prefer not do to this (see discussion
below)

* add regression test + fix for the above issue; in short,
don't allow `nan` to be assigned as a maximum or minimum
value array prior to tree construction logic when the array
has size greater than 1

* this is an improvement in the sense of segfault avoidance,
though it doesn't necessarily guarantee that we get the
correct tree/attributes when `nan` is present (indeed, the shim
is pretty darn strange, see note in code...)

* I'm a bit hesitant to add a bunch of `nan` support here generally
though; for example, what does it even mean if a 3D coordinate has
"x" as np.nan, but regular floats for "y" and "z?" I doubt we handle
all of these kinds of scenarios "correctly," whatever correctly would
even mean here

* this really is a rabbit hole I think, there's a similar unresolved
discussion for `nan` handling in the conceptually similar `pdist/cdist`
usage here, unresolved after 9 years:
scipy#3870

* I think I'm actually tempted to error out when `nan` is present,
rather than maintaining a bunch of code that might otherwise
have to temporarily mask out the nans and then recalculate indices
before/afer the C++ code or whatever, but this likely requires
some discussion...
* KDTree now raises a ValueError when
provided with non-finite data, per discussion
in scipygh-18230

* adjust regression tests accordingly
* style fix on the KDTree finite input
check based on reviewer feedback
@tylerjereddy
Copy link
Contributor Author

Although the CI failures appeared to be unrelated, there were tons of them, so I've tried rebase/force push to see if cleaner now.

@sturlamolden
Copy link
Contributor

We need an isfinite ckeck on query vectors as well, not just the kd-tree data.

@tylerjereddy
Copy link
Contributor Author

We need an isfinite ckeck on query vectors as well, not just the kd-tree data.

I think I'll open a follow-up issue for that and use release manager discretion a bit to merge this in to guard against at least the original segfault before we branch. Hopefully we guard against the queries as well before branching.

@tylerjereddy
Copy link
Contributor Author

2/3 failing CI tests I've seen in other unrelated PRs, and the optimize prelease failure doesn't look related either (I hope.. didn't see it locally...).

@tylerjereddy tylerjereddy merged commit 7921d48 into scipy:main May 21, 2023
@tylerjereddy tylerjereddy deleted the treddy_issue_18223 branch May 21, 2023 18:15
@tylerjereddy tylerjereddy removed the backport-candidate This fix should be ported by a maintainer to previous SciPy versions. label Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.spatial
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: cKDTree segmentation faults when NaN input and balanced_tree=False, compact_nodes=False
6 participants