Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: iloc setting columns not taking effect #33198

Open
jbrockmendel opened this issue Apr 1, 2020 · 34 comments
Open

BUG: iloc setting columns not taking effect #33198

jbrockmendel opened this issue Apr 1, 2020 · 34 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@jbrockmendel
Copy link
Member

arr = np.random.randn(10 ** 6).reshape(500, 2000).astype(np.float64)
df = pd.DataFrame(arr)
df.iloc[:, 1000:] = df.iloc[:, 1000:].astype(np.float32)

>>> df.dtypes.value_counts()
float64    2000
dtype: int64

I would expect 1000 float64 columns and 1000 float32 columns. I get the expected behavior if I set a single column before trying to set all 1000:

df = pd.DataFrame(arr)
df[1000] = df[1000].astype(np.float32)
df.iloc[:, 1000:] = df.iloc[:, 1000:].astype(np.float32)

>>> df.dtypes.value_counts()
float32    1000
float64    1000
dtype: int64
@jbrockmendel jbrockmendel added Bug Indexing Related to indexing on series/frames, not to indexes themselves labels Apr 2, 2020
@jbrockmendel
Copy link
Member Author

@MomIsBestFriend this is non-trivial, might be worth taking a shot at

@ShaharNaveh
Copy link
Member

@MomIsBestFriend this is non-trivial, might be worth taking a shot at

@jbrockmendel I Will definitely give it a shot, thank you for giving me things I might be able to work on!

@ShaharNaveh
Copy link
Member

@jbrockmendel I have tried to solve this, but I am no where near close to do this, I guess this one is non-trivial 😞

But I did learn a lot about pdb while trying to solve this, so I guess that wasn't all a waste of time :)

@jbrockmendel
Copy link
Member Author

how about let's walk through the process [i would use] to address this?

The line that isn't working as expected is df.iloc[:, 1000:] = df.iloc[:, 1000:].astype(np.float32). What is the actual function call that this is doing?

@ShaharNaveh
Copy link
Member

ShaharNaveh commented Apr 7, 2020

how about let's walk through the process [i would use] to address this?

The line that isn't working as expected is df.iloc[:, 1000:] = df.iloc[:, 1000:].astype(np.float32). What is the actual function call that this is doing?

This takes a slice of df and cast it to be with dtype of float32, and then assigning it back into df into the position of the taken slice?


Edit:
this call __setitem__

@jbrockmendel
Copy link
Member Author

yes. more specifically, this is equivalent to

key = (slice(None), slice(1000, None))
value = df.iloc[:, 1000:].astype(np.float32)

df.iloc.__setitem__(key, value)

You can check that value here has the expected dtypes. So lets look at the __setitem__ call. Where inside that call should we look at first?

@ShaharNaveh
Copy link
Member

ShaharNaveh commented Apr 7, 2020

I have looked mainly at

def __setitem__(self, key, value):
if isinstance(key, tuple):
key = tuple(com.apply_if_callable(x, self.obj) for x in key)
else:
key = com.apply_if_callable(key, self.obj)
indexer = self._get_setitem_indexer(key)
self._has_valid_setitem_indexer(key)
iloc = self if self.name == "iloc" else self.obj.iloc
iloc._setitem_with_indexer(indexer, value)

@ShaharNaveh
Copy link
Member

ShaharNaveh commented Apr 7, 2020

I have looked mainly at

def __setitem__(self, key, value):
if isinstance(key, tuple):
key = tuple(com.apply_if_callable(x, self.obj) for x in key)
else:
key = com.apply_if_callable(key, self.obj)
indexer = self._get_setitem_indexer(key)
self._has_valid_setitem_indexer(key)
iloc = self if self.name == "iloc" else self.obj.iloc
iloc._setitem_with_indexer(indexer, value)

Tried to have things like:

 key = com.apply_if_callable(key, self.obj, dtype=self.obj.dtypes.dtype) 

@jbrockmendel
Copy link
Member Author

(stop me if you dont want to engage in this exercise. i think it may be useful)

thats the right place to look. So setting self = df.iloc and walking through this, 658 is True, 659 is going to end up returning key unchanged, and 662 is going to end up giving indexer = key. 665 gives iloc = self, so let's look at _setitem_with_indexer

Looking at _setitem_with_indexer, where do things start to go wrong?

@ShaharNaveh
Copy link
Member

(stop me if you dont want to engage in this exercise. i think it may be useful)

It's very useful!

thats the right place to look. So setting self = df.iloc and walking through this, 658 is True, 659 is going to end up returning key unchanged, and 662 is going to end up giving indexer = key. 665 gives iloc = self, so let's look at _setitem_with_indexer

Looking at _setitem_with_indexer, where do things start to go wrong?

I don't know, can you elaborate?

@jbrockmendel
Copy link
Member Author

So looking through _setitem_with_indexer, from 1549-1574 its not doing any setting, just figuring out take_split_path. What is take_split_path going to end up being?

@ShaharNaveh
Copy link
Member

So looking through _setitem_with_indexer, from 1549-1574 its not doing any setting, just figuring out take_split_path. What is take_split_path going to end up being?

False

@ShaharNaveh
Copy link
Member

ShaharNaveh commented Apr 7, 2020

@jbrockmendel removing the not at L1562 solves this issue!
I guess that in order to not break other things, there needs to be some extra logic.

@jbrockmendel
Copy link
Member Author

removing the not at L1562 solves this issue!
I guess that in order to not break other things, there needs to be some extra logic.

I think you're right on both counts. My best guess as to the underlying issue is that the Block checks if it can hold the values being set, finds that it can, and so sets them inplace.

How should we proceed?

@ShaharNaveh
Copy link
Member

removing the not at L1562 solves this issue!
I guess that in order to not break other things, there needs to be some extra logic.

I think you're right on both counts. My best guess as to the underlying issue is that the Block checks if it can hold the values being set, finds that it can, and so sets them inplace.

How should we proceed?

I think I got it

take_split_path = blk._can_hold_element(val) or self.obj._is_mixed_type

seems to not break anything (passed pandas/tests/frame/methods/**/*), and still to give the expected output of

float32    1000
float64    1000
dtype: int64

@jbrockmendel
Copy link
Member Author

seems to not break anything (passed pandas/tests/frame/methods/**/*),

I'll be pretty surprised if that passes the full test suite, but give it a try

@ShaharNaveh
Copy link
Member

removing the not at L1562 solves this issue!
I guess that in order to not break other things, there needs to be some extra logic.

I think you're right on both counts. My best guess as to the underlying issue is that the Block checks if it can hold the values being set, finds that it can, and so sets them inplace.

How should we proceed?

@jbrockmendel I don't know, how to proceed from here.

I am trying to learn how to fix bugs in general and what is a good way of thinking, when trying to solve bugs, how would you proceed from here?

@jbrockmendel
Copy link
Member Author

So in _setitem_with_indexer we found that take_split_path is False, then L1575-1632 leaves our indexer unchanged (worth double-checking me on this)

item_labels gets defined on L1642, then since take_split_path is False all we have left to look at is L1781-L1816. What part of that chunk of code is causing the problem?

@ShaharNaveh
Copy link
Member

ShaharNaveh commented Apr 7, 2020

So in _setitem_with_indexer we found that take_split_path is False, then L1575-1632 leaves our indexer unchanged (worth double-checking me on this)

item_labels gets defined on L1642, then since take_split_path is False all we have left to look at is L1781-L1816. What part of that chunk of code is causing the problem?

AFAICT L1814-L1816, but I am really not sure about that.


EDIT:

Maybe L1979 that is returning before the set is happening on L1815?

@jbrockmendel
Copy link
Member Author

AFAICT L1814-L1816, but I am really not sure about that.

Until you get an intuition for this part of the code, trial and error.

Maybe L1979 that is returning before the set is happening on L1815?

Do we go through that path? I expected the condition on L1786-1795 wouldnt hold

@ShaharNaveh
Copy link
Member

ShaharNaveh commented Apr 7, 2020

AFAICT L1814-L1816, but I am really not sure about that.

Until you get an intuition for this part of the code, trial and error.

Thank you for teaching me, this is really not taken for granted.

Maybe L1979 that is returning before the set is happening on L1815?

Do we go through that path? I expected the condition on L1786-1795 wouldnt hold

For some reason I thought we do, I just debugged it, the if statement evaluate to False.

@jbrockmendel
Copy link
Member Author

For some reason I thought we do, I just debugged it, the if statement evaluate to False.

OK, so we're back at L1814-1816. Spoiler: 1815 is the relevant line. It ends up doing block.values[indexer] = value on the one block in self.obj._mgr (and that __setitem__ call works correctly).

From this we deduce that we'd be better off going down the take_split_path route.

So as you surmised early on, the _can_hold_element check is where we need to focus our attention.

In #33381 you found that flipping that check breaks a bunch of tests, so the _can_hold_element check is correct in those cases. This leaves three possibilties to consider:

  1. the _can_hold_element implementation isn't what we want
  2. checking _can_hold_element isnt what we want
  3. [you tell me]

@ShaharNaveh
Copy link
Member

For some reason I thought we do, I just debugged it, the if statement evaluate to False.

OK, so we're back at L1814-1816. Spoiler: 1815 is the relevant line. It ends up doing block.values[indexer] = value on the one block in self.obj._mgr (and that __setitem__ call works correctly).

From this we deduce that we'd be better off going down the take_split_path route.

So as you surmised early on, the _can_hold_element check is where we need to focus our attention.

In #33381 you found that flipping that check breaks a bunch of tests, so the _can_hold_element check is correct in those cases. This leaves three possibilties to consider:

1. the _can_hold_element implementation isn't what we want

2. checking _can_hold_element isnt what we want

3. [you tell me]

I want to say check for take_split_path with a different implementation? (have no clue what is the correct one)

or maybe change L1815, to something else?


@jbrockmendel As you can tell I am not sure about anything here.

@jbrockmendel
Copy link
Member Author

As you can tell I am not sure about anything here.

No worries, we were all in this position once.

I want to say check for take_split_path with a different implementation? (have no clue what is the correct one)

That is the right answer, but for the sake of learning the methodology id like you to think of what the 3rd possibility to consider would be.

@ShaharNaveh
Copy link
Member

That is the right answer, but for the sake of learning the methodology id like you to think of what the 3rd possibility to consider would be.

@jbrockmendel I have looked at dir(blk), the only things that looked somewhat reasonable are: should_store and _can_consolidate. am I in the right direction? doesn't seems so, I have also tried crazy things like isinstance(value.dtypes.iloc[0], type(blk.dtype)).


Am I spouse to say what method of blk is the correct one?
or the answer is something more of the isinstance?


@jbrockmendel I need a hint (another) if I'm in the right direction.

@jbrockmendel
Copy link
Member Author

The third possibility I was looking for is that this may not be possible; its conceivable that the current behavior is intentional, and there is no good way to get the desired behavior without breaking an existing test.

So the next thing I would do is look at the failed tests in #33381 and ask: what is the difference between all of those cases and this case?

@ShaharNaveh
Copy link
Member

So the next thing I would do is look at the failed tests in #33381 and ask: what is the difference between all of those cases and this case?

@jbrockmendel AFICT the difference is with the multi-index and categorical-index.

@jbrockmendel
Copy link
Member Author

AFICT the difference is with the multi-index and categorical-index.

Let's look at the key/indexer being used in each case. Recall in this issue we're looking at [:, 1000:] AKA (slice(None), slice(1000, None))

Browsing through the cases in #33381 (lets just look at setting with .iloc; granted most of the cases in there are with .loc which is a little more difficult exposiiton-wise), I see:

df2.iloc[1, 2] =...

df.iloc[[0, 1], [0, 1]] = ...

Since there are fewer useful examples in there than I expected, I'll make up some more and we'll pretend that they showed up in the logs:

df.iloc[:4, 9] = ...

df.iloc[[4, 6, 5], :] = ...

df.iloc[np.arange(4), [4, 6]] = ...

So what is the difference between the case we care about and all of those cases? Why would it that in our case we want to get a new dtype (backed by a new ndarray), whereas in those cases we want to set values into the existing ndarray?

@ShaharNaveh
Copy link
Member

ShaharNaveh commented Apr 8, 2020

So what is the difference between the case we care about and all of those cases? Why would it that in our case we want to get a new dtype (backed by a new ndarray), whereas in those cases we want to set values into the existing ndarray?

Because in our example we look for a range of indexes? (RangeIndex)

EDIT:

or that we have a null slice?


I feel like I am missing something very basic here.

@jbrockmendel
Copy link
Member Author

or that we have a null slice?

bingo!

In this case we are saying "set these entire columns" while in the other cases we are saying "set a subset of rows in these columns".

We're almost there. How to we change the behavior in the new cases without breaking the old cases?

@ShaharNaveh
Copy link
Member

We're almost there. How to we change the behavior in the new cases without breaking the old cases?

@jbrockmendel I have tried in the past hour some things, this is the solution that have the less amount of errors (from what I tried)

        # if there is only one block/type, still have to take split path
        # unless the block is one-dimensional or it can hold the value                                        
        if not take_split_path and self.obj._mgr.blocks:
            (blk,) = self.obj._mgr.blocks
            if 1 < blk.ndim:  # in case of dict, keys are indices
                val = list(value.values()) if isinstance(value, dict) else value

                if isinstance(indexer, tuple) and any( 
                    com.is_null_slice(idx) for idx in indexer
                ):   
                    take_split_path = blk._can_hold_element(val)
                else:
                    take_split_path = not blk._can_hold_element(val)

@jbrockmendel
Copy link
Member Author

if isinstance(indexer, tuple) and any(com.is_null_slice(idx) for idx in indexer):

good! do all the tests pass with this?

I haven't tried it yet, but my guess is going to be

if not take_split_path and self.obj._mgr.blocks:
    (blk,) = self.obj._mgr.blocks
    if 1 < blk.ndim:
        take_split_path = not blk._can_hold_element(val)
        if not take_split_path and isinstance(indexer, tuple) and com.is_null_slice(indexer[0]):
            take_split_path=True

So the main difference is check only for the first element of the tuple being a null slice, not any element. Can you tell me why?

@ShaharNaveh
Copy link
Member

ShaharNaveh commented Apr 8, 2020

Can you tell me why?

@jbrockmendel My guess is that the indexer[0] is our null slice, so that means the whole dataframe (all columns)?

@jbrockmendel
Copy link
Member Author

Important background: to a very rough approximation, we can think of a DataFrame as a dict of Series with the keys being df.columns. So under the hood, the data in any given column is a single array. But the data in a single row contains elements from many columns, which may be many arrays.

So when we do df.iloc[:, foo] = bar we can say "cleanly swap out some new arrays for the old ones" whereas we cannot do that with df.iloc[foo, :] = bar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
2 participants