-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with join operations followed by sink_csv on LazyFrame #15157
Comments
As the error states, we don't support the full query streaming yet, so This is expected behavior. In the future we might resolve the We are working on supporting more of our queries streaming, it is an ongoing process. |
Thanks for clarifying. Is it also expected behavior that the third examples runs without an error, but silently omits anything from the join? |
It would be great if an error could be raised when trying to sink a query that's not fully supported instead of generating an incorrect result. Unsurprisingly this affects sink_parquet too. |
EDIT: This might be different as there's not even a join here, just a concat. Should this be its own issue? Or is it the same underlying problem? Also ran across this problem, and worked up a minimal example. The streaming engine is probably the biggest draw of polars for me, so I'd really love to see this fixed. Minimal example
Contents of sunk.csv:
Contents of written.csv:
Interestingly this bug goes away if you omit the |
EDIT: This example doesn't take the same logical path—it's actually equivalent (I think) to my above example without
This produces identical CSVs containing both rows. |
@sclamons polars/py-polars/polars/functions/eager.py Line 163 in 64b45a8
It looks like (df1.join(df2, how="full", on=["Name", "X"], suffix="_PL_CONCAT_RIGHT")
.with_columns(
pl.coalesce([name, f"{name}_PL_CONCAT_RIGHT"])
for name in ["Name", "X"]
)
.collect(streaming=True)
)
# shape: (1, 4)
# ┌──────┬─────┬──────────────────────┬───────────────────┐
# │ Name ┆ X ┆ Name_PL_CONCAT_RIGHT ┆ X_PL_CONCAT_RIGHT │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str ┆ i64 │
# ╞══════╪═════╪══════════════════════╪═══════════════════╡
# │ B ┆ 2 ┆ B ┆ 2 │
# └──────┴─────┴──────────────────────┴───────────────────┘ I think your Rust example is just doing a default vertical concat, so it's not equivalent to the Python repro? |
@cmdlineluser Yes, you're right—the Rust example isn't taking the |
Checks
Reproducible example
The same problem exists if the overlap is not empty:
I originally noticed this after an additional concat operation, which does not error, but silently omits some of the data:
Log output
Issue description
sink_csv
does not behave as expected after join operations on LazyFrames. In some cases it errors. In other cases, it silently produces different results compared tocollect().write_csv()
Expected behavior
df.sink_csv(file)
anddf.collect().write_csv(file)
should lead to the identical output.Installed versions
The text was updated successfully, but these errors were encountered: