Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: enable setitem dim2 test to work for EA with complex128 dtype #54445

Open
1 of 3 tasks
MichaelTiemannOSC opened this issue Aug 7, 2023 · 5 comments
Open
1 of 3 tasks
Labels
Complex Complex Numbers Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Testing pandas testing functions or related to the test suite

Comments

@MichaelTiemannOSC
Copy link
Contributor

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I have read the threads related to DISCUSS/API: setitem-like operations should only update inplace (#39584) and friends (including #47577).

My problem arises from this test code from the Pint-Pandas test suite:

class TestSetitem(base.BaseSetitemTests):
    def test_setitem_2d_values(self, data):
        # GH50085                                                                                                                                                                                                                                                                                                                                                                                                       
        original = data.copy()
        df = pd.DataFrame({"a": data, "b": data})
	df.loc[[0, 1], :] = df.loc[[1, 0], :].values
        assert (df.loc[0, :] == original[1]).all()
        assert (df.loc[1, :] == original[0]).all()

As a result of PR #54441 I'm able to test Pint-Pandas with complex128 datatypes. PintArray EAs work perfectly with float and integer magnitudes, but fail in just this one case, with complex128. The failure starts in core/internals/managers.py in the fast_xs method:

        # GH#46406                                                                                                                                                                                                                                                                                                                                                                                                      
        immutable_ea = isinstance(dtype, ExtensionDtype) and dtype._is_immutable

        if isinstance(dtype, ExtensionDtype) and not immutable_ea:
            cls = dtype.construct_array_type()
            result = cls._empty((n,), dtype=dtype)

result becomes a PintArray backed by a FloatingArray :

<PintArray>
[<NA>, <NA>]
Length: 2, dtype: pint[nanometer]
(Pdb) result._data
<FloatingArray>
[<NA>, <NA>]
Length: 2, dtype: Float64

The FloatingArray comes when the PintArray initializer finds nothing helpful in either dtype nor values and falls back to creating a pd.array(values, ...):

    def __init__(self, values, dtype=None, copy=False):
        if dtype is None:
            if isinstance(values, _Quantity):
                dtype = values.units
            elif isinstance(values, PintArray):
                dtype = values._dtype
        if dtype is None:
            raise NotImplementedError

        if not isinstance(dtype, PintType):
            dtype = PintType(dtype)
        self._dtype = dtype

        if isinstance(values, _Quantity):
            values = values.to(dtype.units).magnitude
        elif isinstance(values, PintArray):
            values = values._data
        if isinstance(values, np.ndarray):
            dtype = values.dtype
            if dtype in dtypemap:
                dtype = dtypemap[dtype]
            values = pd.array(values, copy=copy, dtype=dtype)
            copy = False
        elif not isinstance(values, pd.core.arrays.numeric.NumericArray):
            values = pd.array(values, copy=copy)
        if copy:
            values = values.copy()
        self._data = values
        self._Q = self.dtype.ureg.Quantity

The fast_xs fails when result[rl] is not ready to accept the complex128 data coming from blk.iget((i, loc)):

        for blk in self.blocks:
            # Such assignment may incorrectly coerce NaT to None                                                                                                                                                                                                                                                                                                                                                        
            # result[blk.mgr_locs] = blk._slice((slice(None), loc))                                                                                                                                                                                                                                                                                                                                                     
            for i, rl in enumerate(blk.mgr_locs):
                result[rl] = blk.iget((i, loc))

As I see it, the problem is that we commit too soon to building our backing array with too-limited information.

@andrewgsavage
@topper-123
@jbrockmendel
@mroeschke

Feature Description

I will point out for the record:

  • values = [v[loc] for v in self.arrays]

So we have everything we need within the environment of fast_xs. Should we use this knowledge as power to create a result that can hold slices of data beyond float64? Here's code that tries to use the fast path, but if an exception is raised, it does the sure thing:

        try:
            for blk in self.blocks:
                # Such assignment may incorrectly coerce NaT to None                                                                                                                                                                                                                                                                                                                                                    
                # result[blk.mgr_locs] = blk._slice((slice(None), loc))                                                                                                                                                                                                                                                                                                                                                 
                for i, rl in enumerate(blk.mgr_locs):
                    result[rl] = blk.iget((i, loc))
        except TypeError:
            if isinstance(dtype, ExtensionDtype) and not immutable_ea:
		values = [v[loc] for v in self.arrays]
                result = cls._from_sequence(values, dtype)
            else:
                raise TypeError

Alternative Solutions

I'm open to alternative solutions, but the above actually causes the test case to pass. Should I submit a PR?

Additional Context

No response

@MichaelTiemannOSC MichaelTiemannOSC added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 7, 2023
MichaelTiemannOSC added a commit to MichaelTiemannOSC/pint-pandas that referenced this issue Aug 7, 2023
Additional code synchronizations (and the addition of a dtype-preserving map method).  These changes were initially developed to support uncertainties, but the uncertainty changes have all been stripped out to simplify merging of underlying code.  Once these changes are fully synced with a release version of Pandas 2.1, we can look at adding back uncertainties.

These changes also tolerate complex128 as a base type for magnitudes, with one except (under discussion as pandas-dev/pandas#54445).

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
MichaelTiemannOSC added a commit to MichaelTiemannOSC/pint-pandas that referenced this issue Aug 14, 2023
Until Pandas resolves pandas-dev/pandas#54445 we cannot feed complex128 types to the test_setitem_2d_values test case.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
MichaelTiemannOSC added a commit to MichaelTiemannOSC/pandas that referenced this issue Aug 14, 2023
When a loc indexer goes to create a fresh array to hold values from an Extension Array, it makes the array allocation based on the na_value of the EA, but that na_value may be smaller than the size of the things that the EA can hold (such as complex128).  Note that np.nan is itself a legitimate complex128 value.

If the allocated array cannot hold the values from the block manager, and if the EA is not immutable, it will try a second strategy of allocating an array based on the dtype of the values of the blocks.  If the blocks hold complex numbers, the array will be properly sized.

This should close pandas-dev#54445.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
@jbrockmendel
Copy link
Member

The FloatingArray comes when the PintArray initializer finds nothing helpful in either dtype nor values and falls back to creating a pd.array(values, ...):

From the wording it sounds like FloatingArray is not what you would actually want here? What type/dtype would you want here?

@MichaelTiemannOSC
Copy link
Contributor Author

We have complex128 magnitudes to store, which cannot be stored in a FloatingArray. An array sized for complex128 would be great.

@MichaelTiemannOSC
Copy link
Contributor Author

Apologies for slow responses...storms knocked out power yesterday and still no estimate of restoration. I did some tests and noticed that Pandas 2.0.2 passes this test with complex128, but not 2.1.0rc0. Still digging in...

@MichaelTiemannOSC
Copy link
Contributor Author

Tracked it down: The fast_xs method in 2.1.0rc0 makes PintArray with [, ] whereas 2.0.x version makes it with [np.nan, np.nan]. The former demands FloatingArray, whereas the latter creates float64 array. 2.0.x silently converts (2+0j) to 2.0 before sticking it into the float64 array. 2.1.0rc0 properly complains that complex won't fit in float.

When using a "true" complex number (such as (2+1j) the 2.0.x version complains that float argument must be a string or a real number, not 'complex'.

When I get internet back I'll fix the test case. But the problem remains. One hack would be to not create an empty array, but an array holding the first value, then discard that value with [:1], preserving the underlying array type.

@lukemanley
Copy link
Member

sorry if this is adding noise, but any chance #54508 fixes the fast_xs issue here?

MichaelTiemannOSC added a commit to MichaelTiemannOSC/pandas that referenced this issue Aug 17, 2023
As requested in review feedback, change tests to use request.node.addmarker in test_*.py testcases, rather than xfailing in base class tests.

Also, prototyped an attempt to create properly-sized EAs for complex numbers so that dim2 test case passes for complex (pandas-dev#54445).

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
@mroeschke mroeschke added Testing pandas testing functions or related to the test suite Complex Complex Numbers ExtensionArray Extending pandas with custom dtypes or arrays. and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Complex Complex Numbers Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants