Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to append to an existing directory of parquet files with new partitions (mode=append) #18750

Open
pascalwhoop opened this issue Sep 15, 2024 · 2 comments
Labels
enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer

Comments

@pascalwhoop
Copy link

pascalwhoop commented Sep 15, 2024

Description

Hey.
Spark has mode=append for writing parquet files. This is kind of useful, it just adds more partitions to the folder of an existing dataset. Great for writing in batches across multiple runs.

How would you solve this in polars? I know adding data to an existing parquet file is a whole different game but just adding more files should be fairly OK no? I suspect, just not overwriting / deleting the whole existing folder structure should do the trick.

Edit

Digging into this, I realize there's a way already with partitioned data when the partition we write to is unique / always new (e.g. by generating a run_id column)

Polars writes parquet like this

                pa.parquet.write_to_dataset(
                    table=tbl,
                    root_path=file,
                    **(pyarrow_options or {}),
                )
 

and yarrow has default behavior overwrite_or_ignore

so it should just add more files and ignore the existing ones. Exactly what I was looking for. Will whip up quick example.

@pascalwhoop pascalwhoop added the enhancement New feature or an improvement of an existing feature label Sep 15, 2024
@coastalwhite coastalwhite added the needs decision Awaiting decision by a maintainer label Sep 15, 2024
@deanm0000
Copy link
Collaborator

It does seem nuts to me that it silently overwrites files #18242.

@cceyda
Copy link

cceyda commented Oct 19, 2024

( version 1.7.1 & 1.9.0 ) I think maybe this is solved because I was able to append just fine (as long as it is a new partition, haven't tested existing partition behaviour).

Example code modified (from the linked issue):

df_a = pl.DataFrame(
    {
        'type': ['a','b'],
        'date': ['2024-08-15','2024-08-16'],
        'value': [68,70]
    }
)
df_a.write_parquet(f'./example_part.parquet', partition_by='date')
df_b = pl.DataFrame(
    {
        'type': ['a','b'],
        'date': ['2024-08-17','2024-08-18'],
        'value': [72,74]
    }
)
df_b.write_parquet(f'./example_part.parquet', partition_by='date')

pl.read_parquet(f'./example_part.parquet')

returns:


type | date | value
-- | -- | --
str | str | i64
"a" | "2024-08-15" | 68
"b" | "2024-08-16" | 70
"a" | "2024-08-17" | 72
"b" | "2024-08-18" | 74

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer
Projects
None yet
Development

No branches or pull requests

4 participants