Skip to content

Behavior of pandas.DataFrame.duplicated() #8505

Closed
@wikiped

Description

@wikiped

While trying to get handle on duplicated records I stumble upon this which lead to conclusion that .duplicated(take_last=True) seem to be taking the first of duplicate rows and .duplicate(take_last=False) takes the last rows.
Here is an illustration:

import pandas as pd
data = { 'key1':[1,2,3,1,2,3,2,2,2],
      'key2':[2,2,1,2,2,2,2,2,2],
      'dup':['d1_1','d2_1', 'n_d','d1_2','d2_2', 'n_d','d2_3','d2_4','d2_5']}
df = pd.DataFrame(data,columns=['key1','key2','dup'])
print df
   key1  key2  dup
0     1     2  d1_1
1     2     2  d2_1
2     3     1   n_d
3     1     2  d1_2
4     2     2  d2_2
5     3     2   n_d
6     2     2  d2_3
7     2     2  d2_4
8     2     2  d2_5

Now with take_last=False it would be fair to expect dn_1s to be in output, but this is not the case:

c1 = df.duplicated(['key1', 'key2'], take_last=False)
df[c1]
   key1  key2   dup
3     1     2  d1_2
4     2     2  d2_2
6     2     2  d2_3
7     2     2  d2_4
8     2     2  d2_5

And take_last=True outputs the first rows:

c2 = df.duplicated(['key1', 'key2'], take_last=True)
df[c2]
   key1  key2   dup
0     1     2  d1_1
1     2     2  d2_1
4     2     2  d2_2
6     2     2  d2_3
7     2     2  d2_4

Unless I am misunderstanding the doc:

take_last : boolean, default False
    Take the last observed row in a row. Defaults to the first row

it does feel that .duplicated() could be improved by fixing this behavior.
And additionally it would have one more parameter:

take_all : boolean, default False
    Take all observed rows. Overrides take_last

or alternatively a keyword parameter:

take : 'last', 'first', 'all', default 'last'
    Sets which observed duplicated rows to take. Default: take last observed rows.

Right now trying to get all observed duplicates requires applying two above conditions c1 | c2.
This was done with pandas 0.14.1.
Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIndexingRelated to indexing on series/frames, not to indexes themselvesReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions