Ability to append to an existing directory of parquet files with new partitions (mode=append) #18750

pascalwhoop · 2024-09-15T08:51:35Z

Description

Hey.
Spark has mode=append for writing parquet files. This is kind of useful, it just adds more partitions to the folder of an existing dataset. Great for writing in batches across multiple runs.

How would you solve this in polars? I know adding data to an existing parquet file is a whole different game but just adding more files should be fairly OK no? I suspect, just not overwriting / deleting the whole existing folder structure should do the trick.

Edit

Digging into this, I realize there's a way already with partitioned data when the partition we write to is unique / always new (e.g. by generating a run_id column)

Polars writes parquet like this

                pa.parquet.write_to_dataset(
                    table=tbl,
                    root_path=file,
                    **(pyarrow_options or {}),
                )

and yarrow has default behavior overwrite_or_ignore

so it should just add more files and ignore the existing ones. Exactly what I was looking for. Will whip up quick example.

The text was updated successfully, but these errors were encountered:

deanm0000 · 2024-09-17T04:15:29Z

It does seem nuts to me that it silently overwrites files #18242.

cceyda · 2024-10-19T10:50:37Z

( version 1.7.1 & 1.9.0 ) I think maybe this is solved because I was able to append just fine (as long as it is a new partition, haven't tested existing partition behaviour).

Example code modified (from the linked issue):

df_a = pl.DataFrame(
    {
        'type': ['a','b'],
        'date': ['2024-08-15','2024-08-16'],
        'value': [68,70]
    }
)
df_a.write_parquet(f'./example_part.parquet', partition_by='date')
df_b = pl.DataFrame(
    {
        'type': ['a','b'],
        'date': ['2024-08-17','2024-08-18'],
        'value': [72,74]
    }
)
df_b.write_parquet(f'./example_part.parquet', partition_by='date')

pl.read_parquet(f'./example_part.parquet')

returns:


type | date | value
-- | -- | --
str | str | i64
"a" | "2024-08-15" | 68
"b" | "2024-08-16" | 70
"a" | "2024-08-17" | 72
"b" | "2024-08-18" | 74

pascalwhoop added the enhancement New feature or an improvement of an existing feature label Sep 15, 2024

coastalwhite added the needs decision Awaiting decision by a maintainer label Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to append to an existing directory of parquet files with new partitions (mode=append) #18750

Ability to append to an existing directory of parquet files with new partitions (mode=append) #18750

pascalwhoop commented Sep 15, 2024 •

edited

Loading

deanm0000 commented Sep 17, 2024

cceyda commented Oct 19, 2024 •

edited

Loading

Ability to append to an existing directory of parquet files with new partitions (mode=append) #18750

Ability to append to an existing directory of parquet files with new partitions (mode=append) #18750

Comments

pascalwhoop commented Sep 15, 2024 • edited Loading

Description

deanm0000 commented Sep 17, 2024

cceyda commented Oct 19, 2024 • edited Loading

pascalwhoop commented Sep 15, 2024 •

edited

Loading

cceyda commented Oct 19, 2024 •

edited

Loading