-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
BUG: nan segfault in KDTree, reject non-finite input #18230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Also, I know I talked about KDTree stuff with @thomasjpfan, I wonder if this |
This seems like the sanest approach, especially since the current behavior with |
In my opionion…
|
The reason why it becomes a line when space is toroid, is because to reach that point we have to circumnavigate the universe (along the dimension with an inf) an infinite number of times. So the point infinitely far away is actually any point on that circumnavigation line. |
But for now, just scan the data for non-finite values and raise ValueError accordingly. |
* KDTree now raises a ValueError when provided with non-finite data, per discussion in scipygh-18230 * adjust regression tests accordingly
I would be quite annoyed if support for nans was removed in the default compact_nodes=True, balanced_tree=True case as well (which, I guess(?), works), but I don't really have the bandwidth to look more into this right now, so just go ahead with what you think it's best and I'll pin scipy as needed.
I think nan is like inf here: a point where any coordinate is nan is infinitely far away from any other one (as |
A nan in a kd-tree makes no sense, mathematically speaking. It is wrong to say we currently have support for nans, because there is no way to support them. An inf we can deal with, but if the space is toroid then an inf adds another layer of complexity. There is the question of backwards compatibility, though. KDTree has been around for a while. Maybe someone rely on the bogus behavior? However I am not convinced anyone should rely on any particularly behavior when it comes to feeding a nan into a kd-tree. Currently it is garbage in - garbage out. |
At least I assumed (perhaps wrongly) that nans were treated as being infinitely far away from everyone else. |
In the past I have blindly sent nans in and gotten clearly unspecified behavior (from "doesn't crash" to "crashes" #14527). While I'm sure the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth adding a check_finite=True
argument to the constructor? This would allow someone to preserve backward compatibility but also acknowledge that non-finites aren't supported.
If they are clearly not supported, I'd rather get an actual error out rather than nonsense; I was simply under the (mistaken?) impression that they are. |
I think there's a bit of debate here, but I believe the outcome is that we really do need to raise an error on non-finite input until we have a solid way to handle it. I'll probably make the small change Peter suggested, but not much else beyond that I don't think. Sounds like breaking backwards compatibility is "ok" because we probably should never have allowed the non-finite input in the first place. |
It is not just that you get nonsense returned. You may also get a segfault. |
* KDTree now raises a ValueError when provided with non-finite data, per discussion in scipygh-18230 * adjust regression tests accordingly
66e3b8a
to
78355a8
Compare
I pushed in the revision requested by Peter, and the tests did pass for me locally after rebasing on latest Let me know if there's anything else we want to adjust here. |
* Fixes scipy#18223, though I'd prefer not do to this (see discussion below) * add regression test + fix for the above issue; in short, don't allow `nan` to be assigned as a maximum or minimum value array prior to tree construction logic when the array has size greater than 1 * this is an improvement in the sense of segfault avoidance, though it doesn't necessarily guarantee that we get the correct tree/attributes when `nan` is present (indeed, the shim is pretty darn strange, see note in code...) * I'm a bit hesitant to add a bunch of `nan` support here generally though; for example, what does it even mean if a 3D coordinate has "x" as np.nan, but regular floats for "y" and "z?" I doubt we handle all of these kinds of scenarios "correctly," whatever correctly would even mean here * this really is a rabbit hole I think, there's a similar unresolved discussion for `nan` handling in the conceptually similar `pdist/cdist` usage here, unresolved after 9 years: scipy#3870 * I think I'm actually tempted to error out when `nan` is present, rather than maintaining a bunch of code that might otherwise have to temporarily mask out the nans and then recalculate indices before/afer the C++ code or whatever, but this likely requires some discussion...
* KDTree now raises a ValueError when provided with non-finite data, per discussion in scipygh-18230 * adjust regression tests accordingly
* style fix on the KDTree finite input check based on reviewer feedback
78355a8
to
a897f5c
Compare
Although the CI failures appeared to be unrelated, there were tons of them, so I've tried rebase/force push to see if cleaner now. |
We need an isfinite ckeck on query vectors as well, not just the kd-tree data. |
I think I'll open a follow-up issue for that and use release manager discretion a bit to merge this in to guard against at least the original segfault before we branch. Hopefully we guard against the queries as well before branching. |
2/3 failing CI tests I've seen in other unrelated PRs, and the |
Fixes BUG: cKDTree segmentation faults when NaN input and balanced_tree=False, compact_nodes=False #18223, though I'd prefer not do to this (see discussion below)
add regression test + fix for the above issue; in short, don't allow
nan
to be assigned as a maximum or minimum value array prior to tree construction logic when the array has size greater than 1this is an improvement in the sense of segfault avoidance, though it doesn't necessarily guarantee that we get the correct tree/attributes when
nan
is present (indeed, the shim is pretty darn strange, see note in code...)I'm a bit hesitant to add a bunch of
nan
support here generally though; for example, what does it even mean if a 3D coordinate has "x" as np.nan, but regular floats for "y" and "z?" I doubt we handle all of these kinds of scenarios "correctly," whatever correctly would even mean herethis really is a rabbit hole I think, there's a similar unresolved discussion for
nan
handling in the conceptually similarpdist/cdist
usage here, unresolved after 9 years:pdist/cdist with missing values #3870
I think I'm actually tempted to error out when
nan
is present, rather than maintaining a bunch of code that might otherwise have to temporarily mask out the nans and then recalculate indices before/after the C++ code or whatever, but this likely requires some discussion...@sturlamolden @peterbell10 thoughts? I suspect you'll agree this is a rabbit hole, but curious if you'd think i.e., erroring out with
np.nan
present might be desirable long-term?