Skip to content

Result of function duplicated is incorrect #1987

@prutskov

Description

@prutskov

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • Modin version: 0.8.0
  • Python version: 3.8

Describe the problem

Result of function DataFrame.duplicated doesn't match between modin and pandas for dataframes which has more than 32 rows.

Source code / logs

import pandas
import modin.pandas as pd

data = [[5, 's0'], [3, 's1'], [3, 's2'], [5, 's0'], [6, 's5'],
            [5, 's0'], [3, 's1'], [3, 's2'], [5, 's0'], [6, 's5'],
            [5, 's0'], [3, 's1'], [3, 's2'], [5, 's0'], [6, 's5'],
            [5, 's0'], [3, 's1'], [3, 's2'], [5, 's0'], [6, 's5'],
            [5, 's0'], [3, 's1'], [3, 's2'], [5, 's0'], [6, 's5'],
            [5, 's0'], [3, 's1'], [3, 's2'], [5, 's0'], [6, 's5'],
            [5, 's0'], [3, 's1'], [3, 's2']]
pdf = pandas.DataFrame(data).duplicated()
mdf = pd.DataFrame(data).duplicated()

print(f'pandas res\n {pdf}')
print(f'modin res\n {mdf}')
Result
pandas res
...
32     True
dtype: bool

modin res
 ...
32    False
dtype: bool

Looks like function works in each partition separately.

Metadata

Metadata

Assignees

Labels

bug 🦗Something isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions