Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the number of column partitions dependent on the number of row groups at .read_parquet() #6558

Closed
dchigarev opened this issue Sep 14, 2023 · 0 comments · Fixed by #6559
Assignees
Labels
P2 Minor bugs or low-priority feature requests pandas.io

Comments

@dchigarev
Copy link
Collaborator

At the current implementation, modin's parquet reader tries to create as many column partitions as possible (even if all of them will consist of only 1 column), not considering the amount of row partitions being generated naturally by parquet's row groups.

def build_columns(cls, columns):

This leads to that .read_parquet() naturally produces square-frames (#5296) that performs poorly in modin.

import tempfile
import pandas
import modin.pandas as pd

import modin.config as cfg

cfg.NPartitions.put(16)

NROWS = 100
NCOLS = 16

with tempfile.TemporaryDirectory() as file:
    [
        pandas.DataFrame({f"col{j}": range(NROWS) for j in range(NCOLS)}).to_parquet(
            f"{file}/{i}.parquet"
        )
        for i in range(16)
    ]

    df = pd.read_parquet(file)
    print(df._query_compiler._modin_frame._partitions.shape)  # (16, 16)

We may want to change the logic of generating column partitions so if there are already enough row parts then it will only generate column partitions in accordance with the cfg.MinPartitionSize parameter, and not in 1 column per 1 partition style.

@dchigarev dchigarev added pandas.io P2 Minor bugs or low-priority feature requests labels Sep 14, 2023
@dchigarev dchigarev self-assigned this Sep 14, 2023
dchigarev added a commit to dchigarev/modin that referenced this issue Sep 14, 2023
…er '.read_parquet()'

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
dchigarev added a commit to dchigarev/modin that referenced this issue Sep 15, 2023
…ad_parquet()'

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
anmyachev added a commit that referenced this issue Sep 16, 2023
…#6559)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Minor bugs or low-priority feature requests pandas.io
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant