-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inner join involving hive-partitioned parquet dataset and filters on LHS and RHS causes panic #9797
Comments
Yes, is 36.0.0 the first version to include this fix? |
Yes, that is my understanding of the release notes: https://github.com/apache/arrow-datafusion/blob/main/dev/changelog/36.0.0.md#3600-2024-02-16 |
I'm not able to run the query with 36.0.0. Perhaps there was some breaking change in 35.0.0 or 36.0.0 (can't find one that looks relevant in the release notes) that affected the syntax for how hive-partitioned parquet datasets are registered? Because now, after loading one such dataset as an external table, via e.g.
the columns of the table are only the hive partition columns |
I am not quite sure what is going on here -- maybe @devinjdangelo or @metesynnada remembers 37.0.0 also has significant changes in these areas so maybe things will work in 37.0.0 🤔 |
If it is helpful I can try to produce a self-contained repro |
Yes that would be most helpful. I am not sure how to make this ticket actionable otherwise |
Perhaps the issue was with the syntax for the partitioned parquet file location(s). I did not change the wildcard based syntax I had been using before, shown at the top of this issue. However, I see in the docs now that the location only need be the parent directory of the top-level partition: Using a version of that example NYC taxi data (I actually just downloaded one file), I did verify that the wildcard based location doesn't produce an error during table creation but does not work. The error is slightly different -- the table is completely empty -- but maybe it's the same issue
I don't know if the wildcard syntax I used was what this DDL documentation page used to show or if I got the syntax wrong from the get-go, and it just happened to be an unsupported format that worked up until now? Naturally I'm attempting the query again with my original dataset, but since its a rather large amount of data re-recreating the external table will take some time. |
OK, with DataFusion 36.0.0, I still get a panic when running this query in its original context -- which is a rust module using the datafusion crate and its dependencies. |
Oops, apologies for the misleading noise in my previous comment -- I can't exactly reproduce what I did wrong but I must have run some subtly different query when using the DataFusion python modules. The panic occurs identically in 34 and 36, whether using the python module or my project using the equivalent crates. |
I wonder if this issue still happens? |
Describe the bug
I am attempting to optimize an inner join on two hive-partitioned parquet datasets
lhs
andrhs
, joined on one columnjoin_col
. In a base working query, the left-hand side of the inner join islhs
itself and the right-hand siderhs
filtered by some numeric columns. It just so happens for this data that equality onjoin_col
implies equality of a less granular partition columnjoin_col_partitioned
. However, a modification to the query, in which the left-hand sidelhs
is first filtered to records withjoin_col_partitioned
among the distinct values present in the right-hand side, results in a repeatable panic.To Reproduce
I cannot provide the raw data but will provide all the details that I can, with some details obfuscated.
External tables
The tables
lhs
andrhs
in question are created withand
The table
lhs
also contains columnskeep_col
andjoin_col
, as well as others not referenced in the query; similarlyrhs
contains columnscol2
andjoin_col
. Both tables have the same number of records, 28914441.Working query
The working base version of the query is
and returns 375130 records, which is precisely the number of records in
SELECT * FROM rhs WHERE col1=7 AND col2>=0 AND col2<=25000
.Modified panic-inducing query
The modification to the query that causes the panic is
The nested query
SELECT DISTINCT join_col_partitioned FROM rhs WHERE col1=7 AND col2>=0 AND col2<=25000
returns two distinct values ofjoin_col_partitioned
.Running this query dependably produces a panic (repeated 5x or so times).
Logical and physical plans and backtraces
I have included the the output of
EXPLAIN
for both of these queries in theAdditional context
section, as well as the regular (RUST_BACKTRACE=1
) and full (RUST_BACKTRACE=full
) backtraces for the panic.Expected behavior
I expected the second query to return the exact data as the first query (due to the aforementioned fact that for these datasets, equality on
join_col
implies equality onjoin_col_partitioned
, and not to result in a panic.Additional context
Environment
Explanation for working query
Output of
EXPLAIN SELECT a.keep_col FROM lhs AS a INNER JOIN (SELECT * FROM rhs WHERE col1=7 AND col2>=0 AND col2<=25000) AS b ON a.join_col = b.join_col;
:Explanation for failing query
Output of
EXPLAIN SELECT a.keep_col FROM (SELECT * FROM lhs WHERE join_col_partitioned IN (SELECT DISTINCT join_col_partitioned FROM rhs WHERE col1=7 AND col2>=0 AND col2<=25000)) AS a INNER JOIN (SELECT * FROM rhs WHERE col1=7 AND col2>=0 AND col2<=25000) AS b ON a.join_col = b.join_col;
:Regular backtrace
Full backtrace
The text was updated successfully, but these errors were encountered: