BUG: iloc setting columns not taking effect #33198

jbrockmendel · 2020-04-01T03:56:38Z

arr = np.random.randn(10 ** 6).reshape(500, 2000).astype(np.float64)
df = pd.DataFrame(arr)
df.iloc[:, 1000:] = df.iloc[:, 1000:].astype(np.float32)

>>> df.dtypes.value_counts()
float64    2000
dtype: int64

I would expect 1000 float64 columns and 1000 float32 columns. I get the expected behavior if I set a single column before trying to set all 1000:

df = pd.DataFrame(arr)
df[1000] = df[1000].astype(np.float32)
df.iloc[:, 1000:] = df.iloc[:, 1000:].astype(np.float32)

>>> df.dtypes.value_counts()
float32    1000
float64    1000
dtype: int64

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2020-04-06T17:55:26Z

@MomIsBestFriend this is non-trivial, might be worth taking a shot at

ShaharNaveh · 2020-04-07T11:16:13Z

@MomIsBestFriend this is non-trivial, might be worth taking a shot at

@jbrockmendel I Will definitely give it a shot, thank you for giving me things I might be able to work on!

ShaharNaveh · 2020-04-07T19:00:48Z

@jbrockmendel I have tried to solve this, but I am no where near close to do this, I guess this one is non-trivial 😞

But I did learn a lot about pdb while trying to solve this, so I guess that wasn't all a waste of time :)

jbrockmendel · 2020-04-07T19:05:45Z

how about let's walk through the process [i would use] to address this?

The line that isn't working as expected is df.iloc[:, 1000:] = df.iloc[:, 1000:].astype(np.float32). What is the actual function call that this is doing?

ShaharNaveh · 2020-04-07T19:09:18Z

how about let's walk through the process [i would use] to address this?

The line that isn't working as expected is df.iloc[:, 1000:] = df.iloc[:, 1000:].astype(np.float32). What is the actual function call that this is doing?

This takes a slice of df and cast it to be with dtype of float32, and then assigning it back into df into the position of the taken slice?

Edit:
this call __setitem__

jbrockmendel · 2020-04-07T19:12:48Z

yes. more specifically, this is equivalent to

key = (slice(None), slice(1000, None))
value = df.iloc[:, 1000:].astype(np.float32)

df.iloc.__setitem__(key, value)

You can check that value here has the expected dtypes. So lets look at the __setitem__ call. Where inside that call should we look at first?

ShaharNaveh · 2020-04-07T19:15:24Z

I have looked mainly at

pandas/pandas/core/indexing.py

Lines 657 to 666 in 70086c5

    
           def __setitem__(self, key, value): 
        
               if isinstance(key, tuple): 
        
                   key = tuple(com.apply_if_callable(x, self.obj) for x in key) 
        
               else: 
        
                   key = com.apply_if_callable(key, self.obj) 
        
               indexer = self._get_setitem_indexer(key) 
        
               self._has_valid_setitem_indexer(key) 
        
               iloc = self if self.name == "iloc" else self.obj.iloc 
        
               iloc._setitem_with_indexer(indexer, value)

ShaharNaveh · 2020-04-07T19:18:55Z

I have looked mainly at

pandas/pandas/core/indexing.py

Lines 657 to 666 in 70086c5

def __setitem__(self, key, value):

if isinstance(key, tuple):

key = tuple(com.apply_if_callable(x, self.obj) for x in key)

else:

key = com.apply_if_callable(key, self.obj)

indexer = self._get_setitem_indexer(key)

self._has_valid_setitem_indexer(key)

iloc = self if self.name == "iloc" else self.obj.iloc

iloc._setitem_with_indexer(indexer, value)

Tried to have things like:

 key = com.apply_if_callable(key, self.obj, dtype=self.obj.dtypes.dtype)

jbrockmendel · 2020-04-07T19:22:45Z

(stop me if you dont want to engage in this exercise. i think it may be useful)

thats the right place to look. So setting self = df.iloc and walking through this, 658 is True, 659 is going to end up returning key unchanged, and 662 is going to end up giving indexer = key. 665 gives iloc = self, so let's look at _setitem_with_indexer

Looking at _setitem_with_indexer, where do things start to go wrong?

ShaharNaveh · 2020-04-07T19:29:46Z

(stop me if you dont want to engage in this exercise. i think it may be useful)

It's very useful!

thats the right place to look. So setting self = df.iloc and walking through this, 658 is True, 659 is going to end up returning key unchanged, and 662 is going to end up giving indexer = key. 665 gives iloc = self, so let's look at _setitem_with_indexer

Looking at _setitem_with_indexer, where do things start to go wrong?

I don't know, can you elaborate?

jbrockmendel · 2020-04-07T19:31:03Z

So looking through _setitem_with_indexer, from 1549-1574 its not doing any setting, just figuring out take_split_path. What is take_split_path going to end up being?

ShaharNaveh · 2020-04-07T19:32:20Z

So looking through _setitem_with_indexer, from 1549-1574 its not doing any setting, just figuring out take_split_path. What is take_split_path going to end up being?

False

ShaharNaveh · 2020-04-07T19:40:30Z

@jbrockmendel removing the not at L1562 solves this issue!
I guess that in order to not break other things, there needs to be some extra logic.

jbrockmendel · 2020-04-07T20:31:22Z

removing the not at L1562 solves this issue!
I guess that in order to not break other things, there needs to be some extra logic.

I think you're right on both counts. My best guess as to the underlying issue is that the Block checks if it can hold the values being set, finds that it can, and so sets them inplace.

How should we proceed?

ShaharNaveh · 2020-04-07T20:37:14Z

removing the not at L1562 solves this issue!
I guess that in order to not break other things, there needs to be some extra logic.

I think you're right on both counts. My best guess as to the underlying issue is that the Block checks if it can hold the values being set, finds that it can, and so sets them inplace.

How should we proceed?

I think I got it

take_split_path = blk._can_hold_element(val) or self.obj._is_mixed_type

seems to not break anything (passed pandas/tests/frame/methods/**/*), and still to give the expected output of

float32    1000
float64    1000
dtype: int64

jbrockmendel · 2020-04-07T20:39:00Z

seems to not break anything (passed pandas/tests/frame/methods/**/*),

I'll be pretty surprised if that passes the full test suite, but give it a try

ShaharNaveh · 2020-04-07T22:16:13Z

removing the not at L1562 solves this issue!
I guess that in order to not break other things, there needs to be some extra logic.

I think you're right on both counts. My best guess as to the underlying issue is that the Block checks if it can hold the values being set, finds that it can, and so sets them inplace.

How should we proceed?

@jbrockmendel I don't know, how to proceed from here.

I am trying to learn how to fix bugs in general and what is a good way of thinking, when trying to solve bugs, how would you proceed from here?

jbrockmendel · 2020-04-07T22:25:03Z

So in _setitem_with_indexer we found that take_split_path is False, then L1575-1632 leaves our indexer unchanged (worth double-checking me on this)

item_labels gets defined on L1642, then since take_split_path is False all we have left to look at is L1781-L1816. What part of that chunk of code is causing the problem?

ShaharNaveh · 2020-04-07T22:28:45Z

So in _setitem_with_indexer we found that take_split_path is False, then L1575-1632 leaves our indexer unchanged (worth double-checking me on this)

item_labels gets defined on L1642, then since take_split_path is False all we have left to look at is L1781-L1816. What part of that chunk of code is causing the problem?

AFAICT L1814-L1816, but I am really not sure about that.

EDIT:

Maybe L1979 that is returning before the set is happening on L1815?

jbrockmendel · 2020-04-07T22:32:26Z

AFAICT L1814-L1816, but I am really not sure about that.

Until you get an intuition for this part of the code, trial and error.

Maybe L1979 that is returning before the set is happening on L1815?

Do we go through that path? I expected the condition on L1786-1795 wouldnt hold

ShaharNaveh · 2020-04-07T22:37:13Z

AFAICT L1814-L1816, but I am really not sure about that.

Until you get an intuition for this part of the code, trial and error.

Thank you for teaching me, this is really not taken for granted.

Maybe L1979 that is returning before the set is happening on L1815?

Do we go through that path? I expected the condition on L1786-1795 wouldnt hold

For some reason I thought we do, I just debugged it, the if statement evaluate to False.

jbrockmendel · 2020-04-07T22:56:54Z

For some reason I thought we do, I just debugged it, the if statement evaluate to False.

OK, so we're back at L1814-1816. Spoiler: 1815 is the relevant line. It ends up doing block.values[indexer] = value on the one block in self.obj._mgr (and that __setitem__ call works correctly).

From this we deduce that we'd be better off going down the take_split_path route.

So as you surmised early on, the _can_hold_element check is where we need to focus our attention.

In #33381 you found that flipping that check breaks a bunch of tests, so the _can_hold_element check is correct in those cases. This leaves three possibilties to consider:

the _can_hold_element implementation isn't what we want
checking _can_hold_element isnt what we want
[you tell me]

ShaharNaveh · 2020-04-07T23:20:55Z

For some reason I thought we do, I just debugged it, the if statement evaluate to False.

OK, so we're back at L1814-1816. Spoiler: 1815 is the relevant line. It ends up doing block.values[indexer] = value on the one block in self.obj._mgr (and that __setitem__ call works correctly).

From this we deduce that we'd be better off going down the take_split_path route.

So as you surmised early on, the _can_hold_element check is where we need to focus our attention.

In #33381 you found that flipping that check breaks a bunch of tests, so the _can_hold_element check is correct in those cases. This leaves three possibilties to consider:
1. the _can_hold_element implementation isn't what we want

2. checking _can_hold_element isnt what we want

3. [you tell me]

I want to say check for take_split_path with a different implementation? (have no clue what is the correct one)

or maybe change L1815, to something else?

@jbrockmendel As you can tell I am not sure about anything here.

jbrockmendel · 2020-04-07T23:35:13Z

As you can tell I am not sure about anything here.

No worries, we were all in this position once.

I want to say check for take_split_path with a different implementation? (have no clue what is the correct one)

That is the right answer, but for the sake of learning the methodology id like you to think of what the 3rd possibility to consider would be.

ShaharNaveh · 2020-04-08T00:12:38Z

That is the right answer, but for the sake of learning the methodology id like you to think of what the 3rd possibility to consider would be.

@jbrockmendel I have looked at dir(blk), the only things that looked somewhat reasonable are: should_store and _can_consolidate. am I in the right direction? doesn't seems so, I have also tried crazy things like isinstance(value.dtypes.iloc[0], type(blk.dtype)).

Am I spouse to say what method of blk is the correct one?
or the answer is something more of the isinstance?

@jbrockmendel I need a hint (another) if I'm in the right direction.

jbrockmendel · 2020-04-08T01:04:48Z

The third possibility I was looking for is that this may not be possible; its conceivable that the current behavior is intentional, and there is no good way to get the desired behavior without breaking an existing test.

So the next thing I would do is look at the failed tests in #33381 and ask: what is the difference between all of those cases and this case?

ShaharNaveh · 2020-04-08T15:58:56Z

So the next thing I would do is look at the failed tests in #33381 and ask: what is the difference between all of those cases and this case?

@jbrockmendel AFICT the difference is with the multi-index and categorical-index.

jbrockmendel · 2020-04-08T16:17:11Z

AFICT the difference is with the multi-index and categorical-index.

Let's look at the key/indexer being used in each case. Recall in this issue we're looking at [:, 1000:] AKA (slice(None), slice(1000, None))

Browsing through the cases in #33381 (lets just look at setting with .iloc; granted most of the cases in there are with .loc which is a little more difficult exposiiton-wise), I see:

df2.iloc[1, 2] =...

df.iloc[[0, 1], [0, 1]] = ...

Since there are fewer useful examples in there than I expected, I'll make up some more and we'll pretend that they showed up in the logs:

df.iloc[:4, 9] = ...

df.iloc[[4, 6, 5], :] = ...

df.iloc[np.arange(4), [4, 6]] = ...

So what is the difference between the case we care about and all of those cases? Why would it that in our case we want to get a new dtype (backed by a new ndarray), whereas in those cases we want to set values into the existing ndarray?

ShaharNaveh · 2020-04-08T16:28:59Z

So what is the difference between the case we care about and all of those cases? Why would it that in our case we want to get a new dtype (backed by a new ndarray), whereas in those cases we want to set values into the existing ndarray?

Because in our example we look for a range of indexes? (RangeIndex)

EDIT:

or that we have a null slice?

I feel like I am missing something very basic here.

jbrockmendel · 2020-04-08T16:40:48Z

or that we have a null slice?

bingo!

In this case we are saying "set these entire columns" while in the other cases we are saying "set a subset of rows in these columns".

We're almost there. How to we change the behavior in the new cases without breaking the old cases?

ShaharNaveh · 2020-04-08T17:35:15Z

We're almost there. How to we change the behavior in the new cases without breaking the old cases?

@jbrockmendel I have tried in the past hour some things, this is the solution that have the less amount of errors (from what I tried)

        # if there is only one block/type, still have to take split path
        # unless the block is one-dimensional or it can hold the value                                        
        if not take_split_path and self.obj._mgr.blocks:
            (blk,) = self.obj._mgr.blocks
            if 1 < blk.ndim:  # in case of dict, keys are indices
                val = list(value.values()) if isinstance(value, dict) else value

                if isinstance(indexer, tuple) and any( 
                    com.is_null_slice(idx) for idx in indexer
                ):   
                    take_split_path = blk._can_hold_element(val)
                else:
                    take_split_path = not blk._can_hold_element(val)

jbrockmendel · 2020-04-08T19:36:25Z

if isinstance(indexer, tuple) and any(com.is_null_slice(idx) for idx in indexer):

good! do all the tests pass with this?

I haven't tried it yet, but my guess is going to be

if not take_split_path and self.obj._mgr.blocks:
    (blk,) = self.obj._mgr.blocks
    if 1 < blk.ndim:
        take_split_path = not blk._can_hold_element(val)
        if not take_split_path and isinstance(indexer, tuple) and com.is_null_slice(indexer[0]):
            take_split_path=True

So the main difference is check only for the first element of the tuple being a null slice, not any element. Can you tell me why?

ShaharNaveh · 2020-04-08T20:06:53Z

Can you tell me why?

@jbrockmendel My guess is that the indexer[0] is our null slice, so that means the whole dataframe (all columns)?

jbrockmendel · 2020-04-08T21:59:20Z

Important background: to a very rough approximation, we can think of a DataFrame as a dict of Series with the keys being df.columns. So under the hood, the data in any given column is a single array. But the data in a single row contains elements from many columns, which may be many arrays.

So when we do df.iloc[:, foo] = bar we can say "cleanly swap out some new arrays for the old ones" whereas we cannot do that with df.iloc[foo, :] = bar

jbrockmendel added Bug Indexing Related to indexing on series/frames, not to indexes themselves labels Apr 2, 2020

ShaharNaveh mentioned this issue Apr 7, 2020

BUG: iloc setting columns not taking effect #33381

Closed

5 tasks

jbrockmendel mentioned this issue Apr 10, 2020

REGR: setting column with setitem should not modify existing array inplace #33457

Open

rhshadrach mentioned this issue May 16, 2020

BUG: setting with iloc and no labels does not effect certain data types #34217

Closed

3 tasks

jbrockmendel mentioned this issue Nov 17, 2020

BUG: Bug in loc did not change dtype when complete column was assigned #37749

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: iloc setting columns not taking effect #33198

BUG: iloc setting columns not taking effect #33198

jbrockmendel commented Apr 1, 2020

jbrockmendel commented Apr 6, 2020

ShaharNaveh commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020 •

edited

Loading

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020 •

edited

Loading

ShaharNaveh commented Apr 7, 2020 •

edited

Loading

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020 •

edited

Loading

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020 •

edited

Loading

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020 •

edited

Loading

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 8, 2020

jbrockmendel commented Apr 8, 2020

ShaharNaveh commented Apr 8, 2020

jbrockmendel commented Apr 8, 2020

ShaharNaveh commented Apr 8, 2020 •

edited

Loading

jbrockmendel commented Apr 8, 2020

ShaharNaveh commented Apr 8, 2020

jbrockmendel commented Apr 8, 2020

ShaharNaveh commented Apr 8, 2020 •

edited

Loading

jbrockmendel commented Apr 8, 2020

BUG: iloc setting columns not taking effect #33198

BUG: iloc setting columns not taking effect #33198

Comments

jbrockmendel commented Apr 1, 2020

jbrockmendel commented Apr 6, 2020

ShaharNaveh commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020 • edited Loading

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020 • edited Loading

ShaharNaveh commented Apr 7, 2020 • edited Loading

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020 • edited Loading

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020 • edited Loading

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020 • edited Loading

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 7, 2020

jbrockmendel commented Apr 7, 2020

ShaharNaveh commented Apr 8, 2020

jbrockmendel commented Apr 8, 2020

ShaharNaveh commented Apr 8, 2020

jbrockmendel commented Apr 8, 2020

ShaharNaveh commented Apr 8, 2020 • edited Loading

jbrockmendel commented Apr 8, 2020

ShaharNaveh commented Apr 8, 2020

jbrockmendel commented Apr 8, 2020

ShaharNaveh commented Apr 8, 2020 • edited Loading

jbrockmendel commented Apr 8, 2020

ShaharNaveh commented Apr 7, 2020 •

edited

Loading

ShaharNaveh commented Apr 7, 2020 •

edited

Loading

ShaharNaveh commented Apr 7, 2020 •

edited

Loading

ShaharNaveh commented Apr 7, 2020 •

edited

Loading

ShaharNaveh commented Apr 7, 2020 •

edited

Loading

ShaharNaveh commented Apr 7, 2020 •

edited

Loading

ShaharNaveh commented Apr 8, 2020 •

edited

Loading

ShaharNaveh commented Apr 8, 2020 •

edited

Loading