Skip to content

min/max_row_groups not respected #2814

Closed
@vincenzon

Description

@vincenzon

I'm using deltalake version 0.19.1 and trying to make it so the parquet files in my deltalake table have a large number of row groups. I tried setting min_row_groups = 10000 and max_row_groups = 100000 for a 1000000 row table but I get a single row group.

Specifically I ran:

import os
import polars as pl
import pandas as pd
from deltalake import DeltaTable, write_deltalake
from pyarrow.parquet import read_metadata

nr = 1000000
df = pl.DataFrame({
    'P': ['X'] * nr,
    'A': [f'abc_{i}' for i in range(nr)],
    'B': [f'def_{i}' for i in range(nr)]
})

write_deltalake('data/row_groups',
                df.to_arrow(),
                partition_by='P',
                mode='overwrite',
                min_rows_per_group = nr // 100,
                max_rows_per_group = nr // 10,
                engine = 'rust'
                )

dt = DeltaTable('data/row_groups')

pq_file = os.path.join('data/row_groups/', dt.get_add_actions(flatten=True).to_pandas()['path'].values[0])

read_metadata(pq_file)

which shows:

<pyarrow._parquet.FileMetaData object at 0x7f04c8b3a390>
  created_by: parquet-rs version 52.2.0
  num_columns: 2
  num_rows: 1000000
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 504

I expected the min/max_row_groups settings to be respected at the level of the parquet file.

Am I misunderstanding what those settings are meant for? Thank you,

Matt

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions