Support for numpy.ndarray and pandas.Series with any python object as entry #4444

philastrophist · 2025-06-20T12:57:02Z

This change would add support for generating numpy.ndarray and pandas.Series with any python object as an element.
Effectively, hypothesis can now generate np.array([MyObject()], dtype=object).
The first use-case for this is with Pandas and Pandera where it is possible and sometimes required to have columns which themselves contain structured datatypes.
Pandera seems to be waiting for this change to support PythonDict, PythonTypedDict, PythonNamedTuple etc.

Accept dtype.kind = 'O' in from_dtype
Add the base case of any type
~~Use .iat instead of .iloc to set values in pandas strategies (this allows setting of dictionaries as elements etc)~~
Construct Series rather than setting elements in pandas strategies (this allows dictionaries as elements etc)

- Use `.iat` instead of `.iloc` to set values in pandas strategies

…rage since we now actually cover all types and this line is now not covered

philastrophist · 2025-07-02T15:39:16Z

Some form of timeout error in CI

Zac-HD · 2025-07-03T04:34:58Z

@tybug FAILED hypothesis-python/tests/watchdog/test_database.py::test_database_listener_directory_move - Exception: timing out after waiting 1s for condition lambda: set(events) on Windows CI

(I've hit retry, should be OK soon 🤞)

Zac-HD

Thanks so much for your PR, Shaun!

This is looking good, and I'm excited to ship it soon! Small comments below about testing and code-comments; and I can always push something to the changelog when I work out what I wanted for that.

hypothesis-python/src/hypothesis/extra/numpy.py

hypothesis-python/src/hypothesis/extra/pandas/impl.py

hypothesis-python/tests/numpy/test_argument_validation.py

hypothesis-python/tests/pandas/test_series.py

Zac-HD · 2025-07-03T05:35:38Z

hypothesis-python/RELEASE.rst

+This version adds support for generating numpy.ndarray and pandas.Series with any python object as an element.
+Effectively, hypothesis can now generate ``np.array([MyObject()], dtype=object)``.
+The first use-case for this is with Pandas and Pandera where it is possible and sometimes required to have columns which themselves contain structured datatypes.
+Pandera seems to be waiting for this change to support ``PythonDict, PythonTypedDict, PythonNamedTuple`` etc.
+
+---
+
+- Accept ``dtype.kind = 'O'`` in ``from_dtype``
+- Use ``.iat`` instead of ``.iloc`` to set values in pandas strategies


(apologies for this comment, it's late at night & I don't really know what I want to do instead, but thought it better to send a review now than wait until later)

I'd like to rework this note, to focus more tightly on the specific changes - as prose, not dot-points - and then afterwards note why this is valuable, with pandera only mentioned as one possible case for structured data within a pandas series. I'd also include cross-references to each class you mention, and (optional but encouraged) a thank-you note to yourself at the end of the changelog ("Thanks to Shaun Read for identifying and fixing these issues!" or similar).

Great, ok I've reworded the release notes and implemented all the suggestions

philastrophist · 2025-07-03T09:16:06Z

Some interesting error is occurring outside of the changes in this PR...

Liam-DeVoe · 2025-07-03T20:37:49Z

sorry for dropping the requested review here, I'd want to be confident I understand the pandas interactions first and I don't have that requisite knowledge at the moment 😅

That failure might be a real crosshair failure, but I'm not sure it's worth pursuing with such a non-reproducer.

philastrophist · 2025-07-04T06:37:26Z

sorry for dropping the requested review here, I'd want to be confident I understand the pandas interactions first and I don't have that requisite knowledge at the moment 😅

As far as I understand at and iat are more basic indexers than loc and iloc in that they can only access a single entry rather than possibly an subset of entries.
But ignoring vector access here, loc will transform dicts into a series and then set them. There's an interesting note in their source here:

# TODO(EA): ExtensionBlock.setitem this causes issues with
# setting for extensionarrays that store dicts. Need to decide
# if it's worth supporting that.

Seems to be vaguely related.

But the important points are:

loc does transformations to the given values stopping us from inserting dicts into series using iloc/loc. This may or may not be a bug. Either way, editing this logic within pandas is likely to be fraught and it's difficult to tell what other transforms might be applied.
at is the intended way to set single values within a dataframe/series according to the docs. It's technically faster but more importantly it doesn't perform any checks or transformations on the value. The logic is a lot simpler. The reason ruff warns against it is that "iloc is more idiomatic and versatile". We know, that in our use-case, we will only ever be setting a series element by integer index, which is what iat is for.

From the docstrings:

DataFrame.iat : Access a single value for a row/column label pair by integer position(s).
DataFrame.iloc : Access a group of rows and columns by integer position(s).
Similar to ``iloc``, in that both provide integer-based lookups. Use
    ``iat`` if you only need to get or set a single value in a DataFrame
    or Series.

Demonstration:

import pandas as pd

s = pd.Series([1, 2, 3], dtype=object)  # object dtype so we dont get mismatch warnings

s.iloc[0] = {'a': 1}
print('series with iloc:\n', s)
print('entry type with iloc:', type(s.iloc[0]))

s.iat[0] = {'a': 1}
print('with iat:\n', s)
print('entry type with iat:', type(s.iat[0]))

prints out:

series with iloc:
 0    a    1
dtype: int64
1                      2
2                      3
dtype: object
entry type with iloc: <class 'pandas.core.series.Series'>
with iat:
 0    {'a': 1}
1           2
2           3
dtype: object
entry type with iat: <class 'dict'>

philastrophist · 2025-07-10T15:58:44Z

When do you think we could merge this?

Liam-DeVoe · 2025-07-10T16:49:19Z

I'll take a look today, thanks for your patience (and contribution!)

…eries

Liam-DeVoe

Looking good! I updated the changelog to be a bit more concise, and would like to improve our testing:

I'd like to see a test combining dtype="O" with a strategy that generates a custom (data)class, for both numpy and pandas
- A test for combining custom objects and normal types in the same dtype="O" array/series would be nice as well

hypothesis-python/src/hypothesis/extra/pandas/impl.py

Liam-DeVoe · 2025-07-11T20:20:51Z

hypothesis-python/src/hypothesis/extra/numpy.py

            # such as the string problems we had a while back.
            raise HypothesisException(
-                f"Internal error when checking element={val!r} of {val.dtype!r} "
+                f"Internal error when checking element={val!r} of {getattr(val, 'dtype', type(val))!r} "


It doesn't help that this code isn't typed (not your fault!), but do we expect it be be possible that val is not a numpy object and therefore doesn't have dtype? If so, we should also change the possible mismatch of time units in dtypes code above. It seems this might be a latent bug.

Yes this it is now possible that val is not a numpy object so we need to fallback to use type(). However, I think we still need the above anyway just in case the wrong elements are used in a typed array.

hypothesis-python/tests/pandas/test_series.py

philastrophist · 2025-08-04T17:30:53Z

I'm back again!
Could you clarify "custom (data)class, for both numpy and pandas"?

Liam-DeVoe · 2025-08-05T17:24:51Z

Could you clarify "custom (data)class, for both numpy and pandas"?

As in: I'd like to see a test which defines a class or dataclass A with a bunch of fields of different types, and passes elements=st.builds(A) to the pandas and numpy strategies which have newly-added support for dtype="O". Then check that you can pull out elements of type A from the pandas series or numpy array. I want to make sure that supplying complicated classes to dtype="O" is well supported!

This reverts commit 1975676.

philastrophist · 2025-08-08T08:16:52Z

I've added some more rigorous tests and your predicted OverFlowError came up but for a different reason: printing out a dataframe which itself contains a numpy array as an entry will overflow if the array is too big. This is a "problem" with pandas so I just restricted the input strategy.

There are now a threading and coverage failures which I need help interpreting please!

cc @tybug @Zac-HD

…eries

Zac-HD

Sorry for the delay Shaun - some comments below; I think the threading issue was probably just a flake.

Zac-HD · 2025-08-20T04:14:52Z

hypothesis-python/src/hypothesis/extra/numpy.py

+                f"Could not add element={getattr(val, 'dtype', type(val))!r} of "
+                f"{getattr(val, 'dtype', type(val))!r} to array of "
+                f"{getattr(result, 'dtype', type(result))!r} - possible mismatch of time units in dtypes?"


I think these expressions would be easier to read outside of the f-string. It also looks like the dtype has been duplicated into the value position. For the type, consider using repr for dtypes and get_pretty_function_representation for types.

Where is get_pretty_function_representation?

Sorry, that's hypothesis.internal.reflection.get_pretty_function_description

Zac-HD · 2025-08-20T04:16:18Z

hypothesis-python/src/hypothesis/extra/numpy.py

-            elem_changed = self._check_elements and val != result[idx] and val == val
+            elem_changed = self._check_elements and (
+                not _array_or_scalar_equal(val, result[idx])
+                and _array_or_scalar_equal(val, val)
+            )


I'm concerned that this will be more expensive in a pretty hot loop - can we check that it doesn't slow down e.g. arrays or floats, or perhaps do this more complicated check only for object arrays?

Zac-HD · 2025-08-20T04:17:03Z

hypothesis-python/src/hypothesis/extra/pandas/impl.py

+            if dtype.kind == "O":
+                return value  # for objects, just use the object other numpy might convert it


I'm not quite clear on what this comment means - make it a line or two above the return?

Zac-HD · 2025-08-20T04:22:12Z

hypothesis-python/src/hypothesis/extra/numpy.py

+    try:
+        len(x)
+    except OverflowError:
+        return False


It looks like the coverage tests are reporting that this line was not executed.

Thanks. Out of interest, how do you find that out?

click on the "check-coverage" job, the scroll down to the summary table; direct link here

philastrophist · 2025-08-25T18:44:02Z

I'll push the changes in a bit

…andas.impl). The tests are now a bit smarter (ensuring elements that go into an array/dataframe/series) are exactly the same on access.

philastrophist · 2025-08-27T17:50:24Z

Changes:

Sped up the hot path in numpy set_element by skipping for non-object cases
dropped using iat with pandas and instead construct the series using lists (type errors are still raised by pandas) to avoid pandas coercing values we don't want it to coerce (much cleaner anyway)
Made the tests a bit more sophisticated (checking exact parity of elements that go into the pandas/numpy strategy and their values when accessed in those numpy/pandas containers)
Remove overflow check pre-filter since overflow only happens when pandas errors and tries to display the erroring row using string.ljust
Removed assert_safe_equals since we can just assert list equality

…eries

Zac-HD · 2025-08-27T18:19:23Z

(looks like you merged master mid-release-process, and that's where the conflicts are coming from)

…numpy_arrays_and_pandas_series

philastrophist · 2025-08-27T18:51:26Z

(looks like you merged master mid-release-process, and that's where the conflicts are coming from)

Ok finally figured that out

philastrophist · 2025-09-01T09:42:19Z

Can I get a review?

philastrophist · 2025-09-15T12:09:33Z

Can I get a review?

Bump

…eries

…hub.com:philastrophist/hypothesis into allow_objects_in_numpy_arrays_and_pandas_series

Liam-DeVoe

I've made several direct changes here, since by the time I was deep enough in the review to give actionable feedback it was less effort to o so myself. I have one comment about the pandas changes, and then I think this is close to being ready.

Liam-DeVoe · 2025-08-17T20:57:24Z

hypothesis-python/docs/prolog.rst

+.. |numpy| replace:: :ref:`NumPy <hypothesis-numpy>`
+.. |pandas| replace:: :ref:`pandas <hypothesis-pandas>`
+.. |django| replace:: :ref:`Django <hypothesis-django>`
+


I've avoided doing this because it's not obvious whether |numpy| refers to hypothesis-numpy or :pypi:'numpy'. I prefer being explicit with e.g. |hypothesis-numpy|.

Liam-DeVoe · 2025-09-17T18:20:01Z

hypothesis-python/src/hypothesis/extra/pandas/impl.py


-            data = OrderedDict((c.name, None) for c in rewritten_columns)
-
            # Depending on how the columns are going to be generated we group


I believe the code in this pull changes how data is ordered in some cases, which is key for good shrinking behavior. I'd like to keep the ordering from this OrderedDict approach

Shaun Read added 7 commits June 20, 2025 13:48

- Accept dtype.kind = 'O' in from_dtype

77fc61e

- Use `.iat` instead of `.iloc` to set values in pandas strategies

ruff: yes we really want .iat

895e360

linting

8fe26b8

add test for failing coverage and dtypes

d2bf820

linter

918a9f0

rst mistakes and linting

b514789

comparable datatypes only

07f8ea0

Zac-HD requested a review from Liam-DeVoe June 26, 2025 02:08

Shaun Read added 3 commits July 2, 2025 14:46

still keep the else line to catch unknown dtypes but remove from cove…

aa0ab3f

…rage since we now actually cover all types and this line is now not covered

make test agree with the from_dtype strategy

b9380b7

formatting

e0c2909

Zac-HD reviewed Jul 3, 2025

View reviewed changes

Shaun Read added 4 commits July 3, 2025 09:19

addressed comments :)

6aea8bc

formatting

b65e335

formatting

212a628

Got rst sytnax wrong again...

088d272

philastrophist requested a review from Zac-HD July 3, 2025 09:16

Liam-DeVoe added 2 commits July 11, 2025 15:58

Merge branch 'master' into allow_objects_in_numpy_arrays_and_pandas_s…

c52bb66

…eries

clean up some things

f8b6ad6

Liam-DeVoe reviewed Jul 11, 2025

View reviewed changes

restrict objects allowed in arrays, better tests

56ce942

Shaun Read added 3 commits August 8, 2025 08:01

Revert "formatting"

d0d57f4

This reverts commit 1975676.

formatting

f471a18

use proper linting tooling

1d6b5aa

Liam-DeVoe self-requested a review August 17, 2025 07:06

Merge branch 'master' into allow_objects_in_numpy_arrays_and_pandas_s…

7556a9a

…eries

Zac-HD reviewed Aug 20, 2025

View reviewed changes

Shaun Read added 3 commits August 27, 2025 18:14

cleaned up and simplified the implementation a lot (particularly in p…

74520e3

…andas.impl). The tests are now a bit smarter (ensuring elements that go into an array/dataframe/series) are exactly the same on access.

formatting/linting

f87f00d

removed assert_safe_equals; using list equality now

a32eb9c

philastrophist requested a review from Zac-HD August 27, 2025 18:04

Merge branch 'master' into allow_objects_in_numpy_arrays_and_pandas_s…

877212e

…eries

Shaun Read added 3 commits August 27, 2025 19:33

not sure how that happened

6226f0e

Merge remote-tracking branch 'upstream/master' into allow_objects_in_…

704d80e

…numpy_arrays_and_pandas_series

format

808c01c

Liam-DeVoe added 7 commits September 16, 2025 22:13

Merge branch 'master' into allow_objects_in_numpy_arrays_and_pandas_s…

77fac3b

…eries

Merge branch 'allow_objects_in_numpy_arrays_and_pandas_series' of git…

c1d6fc4

…hub.com:philastrophist/hypothesis into allow_objects_in_numpy_arrays_and_pandas_series

refactor tests

631ff0f

simplify numpy code

1f70970

format

5982df7

bring back array equality check

b26eaac

comment, weaker series dtype test

d3e5f3b

Liam-DeVoe reviewed Sep 17, 2025

View reviewed changes

simplify pandas code

adab86e

		if dtype.kind == "O":
		return value # for objects, just use the object other numpy might convert it


		data = OrderedDict((c.name, None) for c in rewritten_columns)

		# Depending on how the columns are going to be generated we group

Support for numpy.ndarray and pandas.Series with any python object as entry #4444

Are you sure you want to change the base?

Support for numpy.ndarray and pandas.Series with any python object as entry #4444

Uh oh!

Conversation

philastrophist commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

philastrophist commented Jul 2, 2025

Uh oh!

Zac-HD commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zac-HD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

philastrophist commented Jul 3, 2025

Uh oh!

Liam-DeVoe commented Jul 3, 2025

Uh oh!

philastrophist commented Jul 4, 2025

Uh oh!

philastrophist commented Jul 10, 2025

Uh oh!

Liam-DeVoe commented Jul 10, 2025

Uh oh!

Liam-DeVoe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

philastrophist commented Aug 4, 2025

Uh oh!

Liam-DeVoe commented Aug 5, 2025

Uh oh!

philastrophist commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zac-HD left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

philastrophist commented Aug 25, 2025

Uh oh!

philastrophist commented Aug 27, 2025

Uh oh!

Zac-HD commented Aug 27, 2025

Uh oh!

philastrophist commented Aug 27, 2025

Uh oh!

philastrophist commented Sep 1, 2025

philastrophist commented Jun 20, 2025 •

edited

Loading

Zac-HD commented Jul 3, 2025 •

edited

Loading

philastrophist commented Aug 8, 2025 •

edited

Loading