Closed
Description
While trying to get handle on duplicated records I stumble upon this which lead to conclusion that .duplicated(take_last=True)
seem to be taking the first of duplicate rows and .duplicate(take_last=False)
takes the last rows.
Here is an illustration:
import pandas as pd
data = { 'key1':[1,2,3,1,2,3,2,2,2],
'key2':[2,2,1,2,2,2,2,2,2],
'dup':['d1_1','d2_1', 'n_d','d1_2','d2_2', 'n_d','d2_3','d2_4','d2_5']}
df = pd.DataFrame(data,columns=['key1','key2','dup'])
print df
key1 key2 dup
0 1 2 d1_1
1 2 2 d2_1
2 3 1 n_d
3 1 2 d1_2
4 2 2 d2_2
5 3 2 n_d
6 2 2 d2_3
7 2 2 d2_4
8 2 2 d2_5
Now with take_last=False
it would be fair to expect dn_1s to be in output, but this is not the case:
c1 = df.duplicated(['key1', 'key2'], take_last=False)
df[c1]
key1 key2 dup
3 1 2 d1_2
4 2 2 d2_2
6 2 2 d2_3
7 2 2 d2_4
8 2 2 d2_5
And take_last=True
outputs the first rows:
c2 = df.duplicated(['key1', 'key2'], take_last=True)
df[c2]
key1 key2 dup
0 1 2 d1_1
1 2 2 d2_1
4 2 2 d2_2
6 2 2 d2_3
7 2 2 d2_4
Unless I am misunderstanding the doc:
take_last : boolean, default False
Take the last observed row in a row. Defaults to the first row
it does feel that .duplicated()
could be improved by fixing this behavior.
And additionally it would have one more parameter:
take_all : boolean, default False
Take all observed rows. Overrides take_last
or alternatively a keyword parameter:
take : 'last', 'first', 'all', default 'last'
Sets which observed duplicated rows to take. Default: take last observed rows.
Right now trying to get all observed duplicates requires applying two above conditions c1 | c2
.
This was done with pandas 0.14.1.
Thank you.