Skip to content

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Dec 1, 2024

This improved construct_1d_object_array_from_listlike, especially for the case where the objects inside the array like are itself array-likes with a potentially expensive conversion to numpy.
It seems that when doing result[:] = values, numpy will still check the __array__ method for each object in values, while when iterating and assigning the objects one by one, that does not happen.

And even in the case where __array__ is not expensive at all (or is absent), it seems that iterating is faster than the single assignment:

In [12]: class A:
    ...:     def __init__(self):
    ...:         self.data = np.random.randn(5)
    ...:     def __array__(self, dtype=None, copy=None):
    ...:         #print("calling __array__")
    ...:         return self.data

In [13]: N = 10_000

In [14]: data = [A() for _ in range(N)]

In [17]: %%timeit
    ...: arr = np.empty((N, ), dtype=object)
    ...: arr[:] = data
    ...: 
5.39 ms ± 33.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [18]: %%timeit
    ...: arr = np.empty((N, ), dtype=object)
    ...: for i, obj in enumerate(data):
    ...:     arr[i] = obj
    ...: 
424 µs ± 6.78 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

This is a useful performance improvement in general, I assume, but I am specifically doing it to fix the performance issue reported in #59657. That does is mostly for the 2.3.x branch, because that issue is avoided on main because of #57205 (avoid Series construction, which ends up calling construct_1d_object_array_from_listlike, in the first place)

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Constructors Series/DataFrame/Index/pd.array Constructors labels Dec 1, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Dec 1, 2024
@mroeschke
Copy link
Member

Some mypy errors, but nice find!

mypy.....................................................................Failed
- hook id: mypy
- duration: 88.21s
- exit code: 1

pandas/core/dtypes/cast.py:1604: error: Argument 1 to "len" has incompatible type "Iterable[Any]"; expected "Sized"  [arg-type]
pandas/core/common.py:256: error: Unused "type: ignore" comment  [unused-ignore]
Found 2 errors in 2 files (checked 1446 source files)

@@ -1602,7 +1602,8 @@ def construct_1d_object_array_from_listlike(values: Sized) -> np.ndarray:
# numpy will try to interpret nested lists as further dimensions, hence
# making a 1D array that contains list-likes is a bit tricky:
result = np.empty(len(values), dtype="object")
result[:] = values
for i, obj in enumerate(values):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any advantage of using np.fromiter(values, dtype="object", count=len(values))?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nice, wasn't aware of that. From a quick test that seems to be even a bit faster

@jorisvandenbossche
Copy link
Member Author

Some mypy errors,

If I want something that is both Iterable and Sized, then that's Collection or Sequence ?

@mroeschke
Copy link
Member

If I want something that is both Iterable and Sized, then that's Collection or Sequence ?

Appears either should work from the inheritance structure (Sequence inherits from Collection) https://docs.python.org/3/library/collections.abc.html#collections-abstract-base-classes

@mroeschke mroeschke merged commit 8695401 into pandas-dev:main Dec 3, 2024
47 of 51 checks passed
@mroeschke
Copy link
Member

Thanks @jorisvandenbossche

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Dec 3, 2024
mroeschke pushed a commit that referenced this pull request Dec 3, 2024
…_array_from_listlike) (#60483)

Backport PR #60461: PERF: improve construct_1d_object_array_from_listlike

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche jorisvandenbossche deleted the construction-1d-object branch December 3, 2024 20:51
KevsterAmp pushed a commit to KevsterAmp/pandas that referenced this pull request Mar 12, 2025
* PERF: improve construct_1d_object_array_from_listlike

* use np.fromiter and update annotation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Melt 2x slower when future.infer_string option enabled
2 participants