Closed
Description
I'm using deltalake version 0.19.1 and trying to make it so the parquet files in my deltalake table have a large number of row groups. I tried setting min_row_groups = 10000 and max_row_groups = 100000 for a 1000000 row table but I get a single row group.
Specifically I ran:
import os
import polars as pl
import pandas as pd
from deltalake import DeltaTable, write_deltalake
from pyarrow.parquet import read_metadata
nr = 1000000
df = pl.DataFrame({
'P': ['X'] * nr,
'A': [f'abc_{i}' for i in range(nr)],
'B': [f'def_{i}' for i in range(nr)]
})
write_deltalake('data/row_groups',
df.to_arrow(),
partition_by='P',
mode='overwrite',
min_rows_per_group = nr // 100,
max_rows_per_group = nr // 10,
engine = 'rust'
)
dt = DeltaTable('data/row_groups')
pq_file = os.path.join('data/row_groups/', dt.get_add_actions(flatten=True).to_pandas()['path'].values[0])
read_metadata(pq_file)
which shows:
<pyarrow._parquet.FileMetaData object at 0x7f04c8b3a390>
created_by: parquet-rs version 52.2.0
num_columns: 2
num_rows: 1000000
num_row_groups: 1
format_version: 1.0
serialized_size: 504
I expected the min/max_row_groups settings to be respected at the level of the parquet file.
Am I misunderstanding what those settings are meant for? Thank you,
Matt
Activity