-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR: deprecate get_values #26409
DEPR: deprecate get_values #26409
Conversation
This comment has been minimized.
This comment has been minimized.
(wanted to make this a draft PR, but forgot and I don't see a way to change the status once created) |
This comment has been minimized.
This comment has been minimized.
Agree with putting "problem" in quotes. As you show, this is surprisingly hard to use correctly. I'd be happy to see it go away. |
Well, that's the part I didn't do yet :-) But the problem is that I suppose in some cases, this difference might not matter, and we can replace it with eg |
you need to be extremely careful here as this may have perf implications |
Do those two aspects not go together? We try to change it to Note that it is actually the "just change to use As explained above, the current |
pandas/core/arrays/categorical.py
Outdated
@@ -1505,6 +1505,9 @@ def get_values(self): | |||
A numpy array of the same dtype as categorical.categories.dtype or | |||
Index if datetime / periods. | |||
""" | |||
raise Exception("USING GET_VALUES") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I be concerned that CI is passing with this here? :)
Perhaps we need a test ensuring that Categorical.get_values()
raises a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I be concerned that CI is passing with this here? :)
No, that's a good sign :)
As that was my way to find all the places where it was being called, to fix those (although it is not a perfect way, as in certain places the error could also be catched)
But for sure still need to convert this into a warning and add tests that checks that.
OK, to get this into a merge-able state, I see currently two options: I think we need to keep for now the functionality of "get_values" (which means: give back some numpy array, special cases behaviour depending on the dtype, see above). So we can:
I think I personally like the first one better, but that has possible the complication that we need to repeat the logic in the json C code (didn't check that in detail though). And also, in general we always have this logic on the object itself (eg the existing |
slightly annoying, but you could call |
@@ -177,6 +177,9 @@ def get_values(self, dtype=None): | |||
return self.values.astype(object) | |||
return self.values | |||
|
|||
def get_block_values(self, dtype=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we really need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the JSON C code does handle blocks (if I remove the get_block_values
part in objToJSON.C that I introduced in this PR, a couple tests fail).
I could name this here the same as for Series/INdex (i.e. _internal_get_values
), but I prefer a distinct name to make it clear that the json code is handling blocks and not series/index (and that also makes it clear that all other places where _internal_get_values
is used is not handling blocks). That will also make it easier to isolate and try to remove the block handling in the json C code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #27164 for follow-up on this JSON issue
@@ -162,15 +162,15 @@ def test_values(self): | |||
|
|||
exp = np.array([], dtype=np.object) | |||
tm.assert_numpy_array_equal(idx.values, exp) | |||
tm.assert_numpy_array_equal(idx.get_values(), exp) | |||
#tm.assert_numpy_array_equal(idx.get_values(), exp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove and/or use filterwarnings on the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, this is what's failing :_.>
looks good, just some checks failing. |
Generally a big win, especially in sparse. Can we think of a better name than internal values? That doesn’t do much to tell me what it’s for. |
“_dense_values”? |
_lossless_values? If I’m writing an EA and need to implement it, what would this be for eg RangeArray? |
EA's don't have to implement this. Currently in master, Categorical is the only one that does. In general, the |
It's not fully lossless (eg for categorical you loose the categories and orderedness).
I hope to at some point get rid of it. But of course, that's maybe not for the short term. |
As said above (#26409 (comment)), I can also try to get rid of it entirely by concentrating the special case logic in |
The issue about getting a "dense" array (but preserving dtype) of a Categorical is #26410 |
@jreback do you want to do another check, or is this good to go? (the one failure is flaky test) |
lgtm but a check is failing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, question. can you create a followup issue to actually see what we can fix of the remaining uses (unless you already did).
@@ -177,6 +177,9 @@ def get_values(self, dtype=None): | |||
return self.values.astype(object) | |||
return self.values | |||
|
|||
def get_block_values(self, dtype=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still needed?
Travis failure is the pyyaml issue, so ignoring that one. |
Closes #19617
Reviving an old branch I had lying around. So this was more a quick experiment to see what it involves. The main "problem" is that we ourselves use
get_values
in a bunch of places (egvalues_from_object
which is massively used).The current situation:
DataFrame.get_values
simply points to.values
so is (I think?) always a numpy ndarray.Series.get_values
points to_data.get_values()
(the SingleBlockManager), which in turn doesnp.array(self._block.to_dense(), copy=False)
.So the
Block.to_dense()
in general returns a view on the array, although for ExtensionBlock it doesnp.array(self.values)
, with the exception for DatetimeTZBlock, which ensures a non-object array withnp.asarray(self.values, dtype=_NS_DTYPE)
, and for CategoricalBlock, it returns the Categorical.get_values which can be a DatetimeIndex.Index.get_values
points to.values
, so is a numpy ndarray with the exception of:CategoricalIndex
overrides the parent method to return theget_values
of the underlying Categorical, which in turn mostly returns an ndarray, but an EA for interval-backed dtypes and DatetimeIndex for datetimetz aware backed categorical.Summary: a complete mess :-)
This PR is for now just an experiment to see where it is used. Now, we should see for all those places what would be the best replacement (unless we are fine with keeping such a
_internal_get_values
, but it would be good to clean this up somewhat ..)