Skip to content

Conversation

philastrophist
Copy link

@philastrophist philastrophist commented Jun 20, 2025

This change would add support for generating numpy.ndarray and pandas.Series with any python object as an element.
Effectively, hypothesis can now generate np.array([MyObject()], dtype=object).
The first use-case for this is with Pandas and Pandera where it is possible and sometimes required to have columns which themselves contain structured datatypes.
Pandera seems to be waiting for this change to support PythonDict, PythonTypedDict, PythonNamedTuple etc.

  • Accept dtype.kind = 'O' in from_dtype
  • Add the base case of any type
  • Use .iat instead of .iloc to set values in pandas strategies (this allows setting of dictionaries as elements etc)
  • Construct Series rather than setting elements in pandas strategies (this allows dictionaries as elements etc)

@Zac-HD Zac-HD requested a review from Liam-DeVoe June 26, 2025 02:08
Shaun Read added 3 commits July 2, 2025 14:46
@philastrophist
Copy link
Author

Some form of timeout error in CI

@Zac-HD
Copy link
Member

Zac-HD commented Jul 3, 2025

@tybug FAILED hypothesis-python/tests/watchdog/test_database.py::test_database_listener_directory_move - Exception: timing out after waiting 1s for condition lambda: set(events) on Windows CI

(I've hit retry, should be OK soon 🤞)

Copy link
Member

@Zac-HD Zac-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for your PR, Shaun!

This is looking good, and I'm excited to ship it soon! Small comments below about testing and code-comments; and I can always push something to the changelog when I work out what I wanted for that.

Comment on lines 3 to 11
This version adds support for generating numpy.ndarray and pandas.Series with any python object as an element.
Effectively, hypothesis can now generate ``np.array([MyObject()], dtype=object)``.
The first use-case for this is with Pandas and Pandera where it is possible and sometimes required to have columns which themselves contain structured datatypes.
Pandera seems to be waiting for this change to support ``PythonDict, PythonTypedDict, PythonNamedTuple`` etc.

---

- Accept ``dtype.kind = 'O'`` in ``from_dtype``
- Use ``.iat`` instead of ``.iloc`` to set values in pandas strategies No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(apologies for this comment, it's late at night & I don't really know what I want to do instead, but thought it better to send a review now than wait until later)

I'd like to rework this note, to focus more tightly on the specific changes - as prose, not dot-points - and then afterwards note why this is valuable, with pandera only mentioned as one possible case for structured data within a pandas series. I'd also include cross-references to each class you mention, and (optional but encouraged) a thank-you note to yourself at the end of the changelog ("Thanks to Shaun Read for identifying and fixing these issues!" or similar).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, ok I've reworded the release notes and implemented all the suggestions

@philastrophist
Copy link
Author

Some interesting error is occurring outside of the changes in this PR...

@philastrophist philastrophist requested a review from Zac-HD July 3, 2025 09:16
@Liam-DeVoe
Copy link
Member

sorry for dropping the requested review here, I'd want to be confident I understand the pandas interactions first and I don't have that requisite knowledge at the moment 😅

That failure might be a real crosshair failure, but I'm not sure it's worth pursuing with such a non-reproducer.

@philastrophist
Copy link
Author

sorry for dropping the requested review here, I'd want to be confident I understand the pandas interactions first and I don't have that requisite knowledge at the moment 😅

As far as I understand at and iat are more basic indexers than loc and iloc in that they can only access a single entry rather than possibly an subset of entries.
But ignoring vector access here, loc will transform dicts into a series and then set them. There's an interesting note in their source here:

# TODO(EA): ExtensionBlock.setitem this causes issues with
# setting for extensionarrays that store dicts. Need to decide
# if it's worth supporting that.

Seems to be vaguely related.

But the important points are:

  1. loc does transformations to the given values stopping us from inserting dicts into series using iloc/loc. This may or may not be a bug. Either way, editing this logic within pandas is likely to be fraught and it's difficult to tell what other transforms might be applied.
  2. at is the intended way to set single values within a dataframe/series according to the docs. It's technically faster but more importantly it doesn't perform any checks or transformations on the value. The logic is a lot simpler. The reason ruff warns against it is that "iloc is more idiomatic and versatile". We know, that in our use-case, we will only ever be setting a series element by integer index, which is what iat is for.

From the docstrings:

DataFrame.iat : Access a single value for a row/column label pair by integer position(s).
DataFrame.iloc : Access a group of rows and columns by integer position(s).
Similar to ``iloc``, in that both provide integer-based lookups. Use
    ``iat`` if you only need to get or set a single value in a DataFrame
    or Series.

Demonstration:

import pandas as pd

s = pd.Series([1, 2, 3], dtype=object)  # object dtype so we dont get mismatch warnings

s.iloc[0] = {'a': 1}
print('series with iloc:\n', s)
print('entry type with iloc:', type(s.iloc[0]))

s.iat[0] = {'a': 1}
print('with iat:\n', s)
print('entry type with iat:', type(s.iat[0]))

prints out:

series with iloc:
 0    a    1
dtype: int64
1                      2
2                      3
dtype: object
entry type with iloc: <class 'pandas.core.series.Series'>
with iat:
 0    {'a': 1}
1           2
2           3
dtype: object
entry type with iat: <class 'dict'>

@philastrophist
Copy link
Author

When do you think we could merge this?

@Liam-DeVoe
Copy link
Member

I'll take a look today, thanks for your patience (and contribution!)

Copy link
Member

@Liam-DeVoe Liam-DeVoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I updated the changelog to be a bit more concise, and would like to improve our testing:

  • I'd like to see a test combining dtype="O" with a strategy that generates a custom (data)class, for both numpy and pandas
    • A test for combining custom objects and normal types in the same dtype="O" array/series would be nice as well

# such as the string problems we had a while back.
raise HypothesisException(
f"Internal error when checking element={val!r} of {val.dtype!r} "
f"Internal error when checking element={val!r} of {getattr(val, 'dtype', type(val))!r} "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't help that this code isn't typed (not your fault!), but do we expect it be be possible that val is not a numpy object and therefore doesn't have dtype? If so, we should also change the possible mismatch of time units in dtypes code above. It seems this might be a latent bug.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this it is now possible that val is not a numpy object so we need to fallback to use type(). However, I think we still need the above anyway just in case the wrong elements are used in a typed array.

@philastrophist
Copy link
Author

I'm back again!
Could you clarify "custom (data)class, for both numpy and pandas"?

@Liam-DeVoe
Copy link
Member

Could you clarify "custom (data)class, for both numpy and pandas"?

As in: I'd like to see a test which defines a class or dataclass A with a bunch of fields of different types, and passes elements=st.builds(A) to the pandas and numpy strategies which have newly-added support for dtype="O". Then check that you can pull out elements of type A from the pandas series or numpy array. I want to make sure that supplying complicated classes to dtype="O" is well supported!

@philastrophist
Copy link
Author

philastrophist commented Aug 8, 2025

I've added some more rigorous tests and your predicted OverFlowError came up but for a different reason: printing out a dataframe which itself contains a numpy array as an entry will overflow if the array is too big. This is a "problem" with pandas so I just restricted the input strategy.

There are now a threading and coverage failures which I need help interpreting please!

cc @tybug @Zac-HD

@Liam-DeVoe Liam-DeVoe self-requested a review August 17, 2025 07:06
Copy link
Member

@Zac-HD Zac-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay Shaun - some comments below; I think the threading issue was probably just a flake.

Comment on lines 288 to 290
f"Could not add element={getattr(val, 'dtype', type(val))!r} of "
f"{getattr(val, 'dtype', type(val))!r} to array of "
f"{getattr(result, 'dtype', type(result))!r} - possible mismatch of time units in dtypes?"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these expressions would be easier to read outside of the f-string. It also looks like the dtype has been duplicated into the value position. For the type, consider using repr for dtypes and get_pretty_function_representation for types.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is get_pretty_function_representation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that's hypothesis.internal.reflection.get_pretty_function_description

Comment on lines 246 to 296
elem_changed = self._check_elements and val != result[idx] and val == val
elem_changed = self._check_elements and (
not _array_or_scalar_equal(val, result[idx])
and _array_or_scalar_equal(val, val)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that this will be more expensive in a pretty hot loop - can we check that it doesn't slow down e.g. arrays or floats, or perhaps do this more complicated check only for object arrays?

Comment on lines 129 to 130
if dtype.kind == "O":
return value # for objects, just use the object other numpy might convert it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite clear on what this comment means - make it a line or two above the return?

try:
len(x)
except OverflowError:
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the coverage tests are reporting that this line was not executed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Out of interest, how do you find that out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

click on the "check-coverage" job, the scroll down to the summary table; direct link here

@philastrophist
Copy link
Author

I'll push the changes in a bit

Shaun Read added 3 commits August 27, 2025 18:14
…andas.impl). The tests are now a bit smarter (ensuring elements that go into an array/dataframe/series) are exactly the same on access.
@philastrophist
Copy link
Author

Changes:

  • Sped up the hot path in numpy set_element by skipping for non-object cases
  • dropped using iat with pandas and instead construct the series using lists (type errors are still raised by pandas) to avoid pandas coercing values we don't want it to coerce (much cleaner anyway)
  • Made the tests a bit more sophisticated (checking exact parity of elements that go into the pandas/numpy strategy and their values when accessed in those numpy/pandas containers)
  • Remove overflow check pre-filter since overflow only happens when pandas errors and tries to display the erroring row using string.ljust
  • Removed assert_safe_equals since we can just assert list equality

@philastrophist philastrophist requested a review from Zac-HD August 27, 2025 18:04
@Zac-HD
Copy link
Member

Zac-HD commented Aug 27, 2025

(looks like you merged master mid-release-process, and that's where the conflicts are coming from)

@philastrophist
Copy link
Author

(looks like you merged master mid-release-process, and that's where the conflicts are coming from)

Ok finally figured that out

@philastrophist
Copy link
Author

Can I get a review?

@philastrophist
Copy link
Author

Can I get a review?

Bump

Copy link
Member

@Liam-DeVoe Liam-DeVoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made several direct changes here, since by the time I was deep enough in the review to give actionable feedback it was less effort to o so myself. I have one comment about the pandas changes, and then I think this is close to being ready.

Comment on lines 168 to 171
.. |numpy| replace:: :ref:`NumPy <hypothesis-numpy>`
.. |pandas| replace:: :ref:`pandas <hypothesis-pandas>`
.. |django| replace:: :ref:`Django <hypothesis-django>`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've avoided doing this because it's not obvious whether |numpy| refers to hypothesis-numpy or :pypi:'numpy'. I prefer being explicit with e.g. |hypothesis-numpy|.


data = OrderedDict((c.name, None) for c in rewritten_columns)

# Depending on how the columns are going to be generated we group
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the code in this pull changes how data is ordered in some cases, which is key for good shrinking behavior. I'd like to keep the ordering from this OrderedDict approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants