-
Notifications
You must be signed in to change notification settings - Fork 623
Support for numpy.ndarray and pandas.Series with any python object as entry #4444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Support for numpy.ndarray and pandas.Series with any python object as entry #4444
Conversation
- Use `.iat` instead of `.iloc` to set values in pandas strategies
…rage since we now actually cover all types and this line is now not covered
|
Some form of timeout error in CI |
|
@tybug (I've hit retry, should be OK soon 🤞) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for your PR, Shaun!
This is looking good, and I'm excited to ship it soon! Small comments below about testing and code-comments; and I can always push something to the changelog when I work out what I wanted for that.
hypothesis-python/RELEASE.rst
Outdated
| This version adds support for generating numpy.ndarray and pandas.Series with any python object as an element. | ||
| Effectively, hypothesis can now generate ``np.array([MyObject()], dtype=object)``. | ||
| The first use-case for this is with Pandas and Pandera where it is possible and sometimes required to have columns which themselves contain structured datatypes. | ||
| Pandera seems to be waiting for this change to support ``PythonDict, PythonTypedDict, PythonNamedTuple`` etc. | ||
|
|
||
| --- | ||
|
|
||
| - Accept ``dtype.kind = 'O'`` in ``from_dtype`` | ||
| - Use ``.iat`` instead of ``.iloc`` to set values in pandas strategies No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(apologies for this comment, it's late at night & I don't really know what I want to do instead, but thought it better to send a review now than wait until later)
I'd like to rework this note, to focus more tightly on the specific changes - as prose, not dot-points - and then afterwards note why this is valuable, with pandera only mentioned as one possible case for structured data within a pandas series. I'd also include cross-references to each class you mention, and (optional but encouraged) a thank-you note to yourself at the end of the changelog ("Thanks to Shaun Read for identifying and fixing these issues!" or similar).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, ok I've reworded the release notes and implemented all the suggestions
|
Some interesting error is occurring outside of the changes in this PR... |
|
sorry for dropping the requested review here, I'd want to be confident I understand the pandas interactions first and I don't have that requisite knowledge at the moment 😅 That failure might be a real crosshair failure, but I'm not sure it's worth pursuing with such a non-reproducer. |
As far as I understand Seems to be vaguely related. But the important points are:
From the docstrings: Demonstration: prints out: |
|
When do you think we could merge this? |
|
I'll take a look today, thanks for your patience (and contribution!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! I updated the changelog to be a bit more concise, and would like to improve our testing:
- I'd like to see a test combining
dtype="O"with a strategy that generates a custom (data)class, for both numpy and pandas- A test for combining custom objects and normal types in the same dtype="O" array/series would be nice as well
| # such as the string problems we had a while back. | ||
| raise HypothesisException( | ||
| f"Internal error when checking element={val!r} of {val.dtype!r} " | ||
| f"Internal error when checking element={val!r} of {getattr(val, 'dtype', type(val))!r} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't help that this code isn't typed (not your fault!), but do we expect it be be possible that val is not a numpy object and therefore doesn't have dtype? If so, we should also change the possible mismatch of time units in dtypes code above. It seems this might be a latent bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this it is now possible that val is not a numpy object so we need to fallback to use type(). However, I think we still need the above anyway just in case the wrong elements are used in a typed array.
|
I'm back again! |
As in: I'd like to see a test which defines a class or dataclass |
This reverts commit 1975676.
|
I've added some more rigorous tests and your predicted OverFlowError came up but for a different reason: printing out a dataframe which itself contains a numpy array as an entry will overflow if the array is too big. This is a "problem" with pandas so I just restricted the input strategy. There are now a threading and coverage failures which I need help interpreting please! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay Shaun - some comments below; I think the threading issue was probably just a flake.
| f"Could not add element={getattr(val, 'dtype', type(val))!r} of " | ||
| f"{getattr(val, 'dtype', type(val))!r} to array of " | ||
| f"{getattr(result, 'dtype', type(result))!r} - possible mismatch of time units in dtypes?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these expressions would be easier to read outside of the f-string. It also looks like the dtype has been duplicated into the value position. For the type, consider using repr for dtypes and get_pretty_function_representation for types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is get_pretty_function_representation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that's hypothesis.internal.reflection.get_pretty_function_description
| elem_changed = self._check_elements and val != result[idx] and val == val | ||
| elem_changed = self._check_elements and ( | ||
| not _array_or_scalar_equal(val, result[idx]) | ||
| and _array_or_scalar_equal(val, val) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that this will be more expensive in a pretty hot loop - can we check that it doesn't slow down e.g. arrays or floats, or perhaps do this more complicated check only for object arrays?
| if dtype.kind == "O": | ||
| return value # for objects, just use the object other numpy might convert it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite clear on what this comment means - make it a line or two above the return?
| try: | ||
| len(x) | ||
| except OverflowError: | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the coverage tests are reporting that this line was not executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Out of interest, how do you find that out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
click on the "check-coverage" job, the scroll down to the summary table; direct link here
|
I'll push the changes in a bit |
…andas.impl). The tests are now a bit smarter (ensuring elements that go into an array/dataframe/series) are exactly the same on access.
|
Changes:
|
|
(looks like you merged master mid-release-process, and that's where the conflicts are coming from) |
…numpy_arrays_and_pandas_series
Ok finally figured that out |
|
Can I get a review? |
Bump |
…hub.com:philastrophist/hypothesis into allow_objects_in_numpy_arrays_and_pandas_series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made several direct changes here, since by the time I was deep enough in the review to give actionable feedback it was less effort to o so myself. I have one comment about the pandas changes, and then I think this is close to being ready.
hypothesis-python/docs/prolog.rst
Outdated
| .. |numpy| replace:: :ref:`NumPy <hypothesis-numpy>` | ||
| .. |pandas| replace:: :ref:`pandas <hypothesis-pandas>` | ||
| .. |django| replace:: :ref:`Django <hypothesis-django>` | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've avoided doing this because it's not obvious whether |numpy| refers to hypothesis-numpy or :pypi:'numpy'. I prefer being explicit with e.g. |hypothesis-numpy|.
|
|
||
| data = OrderedDict((c.name, None) for c in rewritten_columns) | ||
|
|
||
| # Depending on how the columns are going to be generated we group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the code in this pull changes how data is ordered in some cases, which is key for good shrinking behavior. I'd like to keep the ordering from this OrderedDict approach
This change would add support for generating numpy.ndarray and pandas.Series with any python object as an element.
Effectively, hypothesis can now generate
np.array([MyObject()], dtype=object).The first use-case for this is with Pandas and Pandera where it is possible and sometimes required to have columns which themselves contain structured datatypes.
Pandera seems to be waiting for this change to support PythonDict, PythonTypedDict, PythonNamedTuple etc.
from_dtypeUse.iatinstead of.ilocto set values in pandas strategies (this allows setting of dictionaries as elements etc)