Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect result of nlargest function #1734

Closed
dchigarev opened this issue Jul 15, 2020 · 0 comments · Fixed by #1727
Closed

Incorrect result of nlargest function #1734

dchigarev opened this issue Jul 15, 2020 · 0 comments · Fixed by #1727
Assignees
Labels
bug 🦗 Something isn't working
Milestone

Comments

@dchigarev
Copy link
Collaborator

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • Modin version (modin.__version__): 0.7.3+200.g8e7a5682
  • Python version: 3.7.5
  • Code we can use to reproduce:
if __name__ == "__main__":
    import modin.pandas as pd
    import pandas
    
     # test case from test_nlargest
    data = {
            "population": [
                59000000,
                65000000,
                434000,
                434000,
                434000,
                337000,
                11300,
                11300,
                11300,
            ],
            "GDP": [1937894, 2583560, 12011, 4520, 12128, 17036, 182, 38, 311],
            "alpha-2": ["IT", "FR", "MT", "MV", "BN", "IS", "NR", "TV", "AI"],
        }
    index = [
        "Italy",
        "France",
        "Malta",
        "Maldives",
        "Brunei",
        "Iceland",
        "Nauru",
        "Tuvalu",
        "Anguilla",
    ]
    modin_df = pd.DataFrame(data=data, index=index)
    pandas_df = pandas.DataFrame(data=data, index=index)

    md_res = modin_df.nlargest(3, "population")
    pd_res = pandas_df.nlargest(3, "population")

    print("pd_res:\n", pd_res, sep="")
    print("\nmd_res_partitions:\n", md_res._query_compiler._modin_frame._partitions[0][0].to_pandas(), sep="")
    print("\nmd_res:\n", md_res, sep="") # Exception: IndexError('positional indexers are out-of-bounds')
Output:
pd_res:
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Malta       434000    12011      MT

md_res_partitions:
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Malta       434000    12011      MT

md_res:
distributed.worker - WARNING -  Compute Failed
Function:  apply_list_of_funcs
args:      ([[b"\x80\x04\x95\xb0\x03\x00\x00\x00\x00\x00\x00\x8c\x17cloudpickle.cloudpickle\x94\x8c\x0e_fill_function\x94\x93\x94(h\x00\x8c\x0f_make_skel_func\x94\x93\x94h\x00\x8c\r_builtin_type\x94\x93\x94\x8c\x08CodeType\x94\x85\x94R\x94(K\x01K\x00K\x01K\x05K\x13C\x14t\x00\xa0\x01|\x00j\x02\x88\x01\x88\x00f\x02\x19\x00\xa1\x01S\x00\x94N\x85\x94\x8c\x06pandas\x94\x8c\tDataFrame\x94\x8c\x04iloc\x94\x87\x94\x8c\x02df\x94\x85\x94\x8cZc:\\users\\dchigare\\desktop\\repos\\modin\\modin\\engines\\dask\\pandas_on_dask\\frame\\partition.py\x94\x8c\x08<lambda>\x94KcC\x00\x94\x8c\x0bcol_indices\x94\x8c\x0brow_indices\x94\x86\x94)t\x94R\x94K\x02}\x94(\x8c\x0b__package__\x94\x8c'modin.engines.dask.pandas_on_dask.frame\x94\x8c\x08__name__\x94\x8c1modin.engines.dask.pandas_on_dask.frame.partition\x94\x8c\x08__file__\x94h\x12u\x87\x94R\x94}\x94(\x8c\x07globals\x94}\x94h\x0ch\x00\x8c\tsubimport\x94\x93\x94h\x0c\x85\x94R\x94s\x8c\x08defaults\x94N\x8c\x04dict\x94}\x94\x8c\x0eclosure_values\x94]\x94(\x8c\x15numpy.c
kwargs:    {}
Exception: IndexError('positional indexers are out-of-bounds')

Traceback (most recent call last):
  File "C:\Users\dchigare\Desktop\REPOS\TESTS\reprod.py", line 39, in <module>
    print("\nmd_res:\n", md_res, sep="") # Exception: IndexError('positional indexers are out-of-bounds')
  File "c:\users\dchigare\desktop\repos\modin\modin\pandas\base.py", line 3465, in __str__
    return repr(self)
  File "c:\users\dchigare\desktop\repos\modin\modin\pandas\dataframe.py", line 162, in __repr__
    result = repr(self._build_repr_df(num_rows, num_cols))
  File "c:\users\dchigare\desktop\repos\modin\modin\pandas\base.py", line 100, in _build_repr_df
    return self.iloc[indexer]._query_compiler.to_pandas()
  File "c:\users\dchigare\desktop\repos\modin\modin\backends\pandas\query_compiler.py", line 188, in to_pandas
    return self._modin_frame.to_pandas()
  File "c:\users\dchigare\desktop\repos\modin\modin\engines\base\frame\data.py", line 1350, in to_pandas
    df = self._frame_mgr_cls.to_pandas(self._partitions)
  File "c:\users\dchigare\desktop\repos\modin\modin\engines\base\frame\partition_manager.py", line 258, in to_pandas
    retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
  File "c:\users\dchigare\desktop\repos\modin\modin\engines\base\frame\partition_manager.py", line 258, in <listcomp>
    retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
  File "c:\users\dchigare\desktop\repos\modin\modin\engines\base\frame\partition_manager.py", line 258, in <listcomp>
    retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
  File "c:\users\dchigare\desktop\repos\modin\modin\engines\dask\pandas_on_dask\frame\partition.py", line 127, in to_pandas
    dataframe = self.get()
  File "c:\users\dchigare\desktop\repos\modin\modin\engines\dask\pandas_on_dask\frame\partition.py", line 63, in get
    return self.future.result()
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python37_64\lib\site-packages\distributed\client.py", line 218, in result
    raise exc.with_traceback(tb)
  File "c:\users\dchigare\desktop\repos\modin\modin\engines\dask\pandas_on_dask\frame\partition.py", line 27, in apply_list_of_funcs
    df = func(df, **kwargs)
  File "c:\users\dchigare\desktop\repos\modin\modin\engines\dask\pandas_on_dask\frame\partition.py", line 99, in <lambda>
    lambda df: pandas.DataFrame(df.iloc[row_indices, col_indices])
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python37_64\lib\site-packages\pandas\core\indexing.py", line 1762, in __getitem__
    return self._getitem_tuple(key)
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python37_64\lib\site-packages\pandas\core\indexing.py", line 2067, in _getitem_tuple
    self._has_valid_tuple(tup)
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python37_64\lib\site-packages\pandas\core\indexing.py", line 703, in _has_valid_tuple
    self._validate_key(k, i)
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python37_64\lib\site-packages\pandas\core\indexing.py", line 2009, in _validate_key
    raise IndexError("positional indexers are out-of-bounds")
IndexError: positional indexers are out-of-bounds

Describe the problem

If we print columns and index of md_res in debug, we can see that it contains incorrect values. But as we see from the output of above code, partitions itself contains the correct result, that's actually the reason why that tests don't fail in CI (df_equals that used in tests converts modin result via to_pandas which consider information only from partitions)

(Pdb) md_res.columns
Index(['__reduced__'], dtype='object')
(Pdb) md_res.index
Index(['Italy', 'France', 'Malta', 'Maldives', 'Brunei', 'Iceland', 'Nauru',
       'Tuvalu', 'Anguilla'],
      dtype='object')
@dchigarev dchigarev added the bug 🦗 Something isn't working label Jul 15, 2020
@dchigarev dchigarev added this to the 0.8.0 milestone Jul 15, 2020
@dchigarev dchigarev self-assigned this Jul 15, 2020
@dchigarev dchigarev linked a pull request Jul 15, 2020 that will close this issue
5 tasks
dchigarev added a commit to dchigarev/modin that referenced this issue Jul 23, 2020
This fix is also fixing inconsistent indices in some of operations (modin-project#1731, modin-project#1732 and modin-project#1734)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant