Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch between schema and batches on a CREATE TABLE with a windowing query #5695

Closed
milevin opened this issue Mar 22, 2023 · 15 comments · Fixed by #6417
Closed

Mismatch between schema and batches on a CREATE TABLE with a windowing query #5695

milevin opened this issue Mar 22, 2023 · 15 comments · Fixed by #6417
Labels
bug Something isn't working

Comments

@milevin
Copy link

milevin commented Mar 22, 2023

Describe the bug

This doesn't work but should:

DataFusion CLI v20.0.0
❯ create table temp as with orders as (
  select 1 as o_custkey
)
SELECT RANK() OVER (PARTITION BY o_custkey)
FROM orders;
Error during planning: Mismatch between schema and batches

To Reproduce

See above

Expected behavior

This should not throw the mismatch error

Additional context

http://sqlfiddle.com/#!17/1d310/1

Note: if I slap round(...) around the window expression, it begins to work:

DataFusion CLI v20.0.0
❯ create table temp as with orders as (
  select 1 as o_custkey
)
SELECT round(RANK() OVER (PARTITION BY o_custkey), 5)
FROM orders;
0 rows in set. Query took 0.012 seconds.
@milevin milevin added the bug Something isn't working label Mar 22, 2023
@comphead
Copy link
Contributor

Actually the problem is the column name mismatches between schema and the batch

[datafusion/core/src/datasource/memory.rs:60] &schema = Schema {
    fields: [
        Field {
            name: "RANK() PARTITION BY [orders.o_custkey] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING",
            data_type: UInt64,
            nullable: true,
            dict_id: 0,
            dict_is_ordered: false,
            metadata: {},
        },
    ],
    metadata: {},
}
[datafusion/core/src/datasource/memory.rs:61] &batches.schema() = Schema {
    fields: [
        Field {
            name: "RANK()",
            data_type: UInt64,
            nullable: false,
            dict_id: 0,
            dict_is_ordered: false,
            metadata: {},
        },
    ],
    metadata: {},
}
thread 'sql::parquet::parquet_query_with_max_min1' panicked at 'called `Result::unwrap()` on an `Err` value: Plan("Mismatch between schema and batches")', datafusion/core/tests/sql/mod.rs:1163:33

The reason for this I believe in for window function the name is shortened and looks not consistent

SELECT RANK() OVER (PARTITION BY x order by x) FROM (select 1 as x);
+--------+
| RANK() |
+--------+
| 1      |
+--------+

once you wrap it into the function it stops name shorten

SELECT abs(RANK() OVER (PARTITION BY x order by x)) FROM (select 1 as x);
+------------------------------------------------------------------------------------------------------------+
| abs(RANK() PARTITION BY [x] ORDER BY [x ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) |
+------------------------------------------------------------------------------------------------------------+
| 1.0                                                                                                        |
+------------------------------------------------------------------------------------------------------------+

@comphead
Copy link
Contributor

as a workarounds you may want to use

❯ create table temp1 as with orders as (
  select 1 as o_custkey
)
SELECT 0+RANK() OVER (PARTITION BY o_custkey)
FROM orders;
0 rows in set. Query took 0.008 seconds.

@milevin
Copy link
Author

milevin commented Mar 22, 2023

Great, thank you for looking into this! I also discovered a similar workaround (and added it into the Additional context).

We might have observed the same issue outside of windowing functions; I'll see if I can create more repros.

@comphead
Copy link
Contributor

I'm looking into this today!

@comphead
Copy link
Contributor

@milevin I have looked into the code and another workaround, more natural is to give an alias

❯ create table temp1 as with orders as (
  select 1 as o_custkey
)
SELECT RANK() OVER (PARTITION BY o_custkey) as a
FROM orders;
0 rows in set. Query took 0.010 seconds.

The code currently uses alias if its given or shortened the name to prevent huge unreadable names.
@alamb I'm not sure tbh if we should revert
https://github.com/apache/arrow-datafusion/blob/26e1b20ea3362ea62cb713004a0636b8af6a16d7/datafusion/core/src/physical_plan/planner.rs#L1630

@alamb
Copy link
Contributor

alamb commented Mar 23, 2023

The code currently uses alias if its given or shortened the name to prevent huge unreadable names.
@alamb I'm not sure tbh if we should revert

I don't know the context of why this was done / if we should do something more fancy

Perhaps @ozankabak and @mustafasrepo have some input into this

@milevin
Copy link
Author

milevin commented Mar 24, 2023

@comphead Here is another instance

❯ create table temp as WITH w1 AS (select 1 as x , max(10) as y), w2 AS (select 5 as n_regionkey)
select count(*) count, n_regionkey from w2 group by n_regionkey
union all
select x, y from w1 order by n_regionkey, count desc;
Error during planning: Mismatch between schema and batches

Note: no windowing here, and all the columns are aliased

@ozankabak
Copy link
Contributor

Note: no windowing here, and all the columns are aliased

You beat me to it, I looked around and was about to write that this seems to be a more general issue than windowing 🙂

@comphead
Copy link
Contributor

Yes @ozankabak the problem is broader, for the case above the diff is:

Schema {
    fields: [
        Field {
            name: "count",
            data_type: Int64,
            nullable: true,
            dict_id: 0,
            dict_is_ordered: false,
            metadata: {},
        },
        Field {
            name: "n_regionkey",
            data_type: Int64,
            nullable: true,
            dict_id: 0,
            dict_is_ordered: false,
            metadata: {},
        },
    ],
    metadata: {},
}
[arrow-datafusion/datafusion/core/src/datasource/memory.rs:61] &batches.schema() = Schema {
    fields: [
        Field {
            name: "count",
            data_type: Int64,
            nullable: true,
            dict_id: 0,
            dict_is_ordered: false,
            metadata: {},
        },
        Field {
            name: "y",
            data_type: Int64,
            nullable: true,
            dict_id: 0,
            dict_is_ordered: false,
            metadata: {},
        },
    ],
    metadata: {},
}
Error during planning: Mismatch between schema and batches

I'm checking if its related to WITH

@milevin
Copy link
Author

milevin commented Mar 24, 2023

Is there a bug in the logic of inferring output column names? Note how in this example, DF is assigning y to the second output column -- pretty randomly! -- instead of using the correct n_regionkey.

(It is my understanding that in a SELECT with a UNION clause, the output column names are derived from the first SELECT sub-clause; perhaps somebody can dig through the standard to confirm that.)

Either way, these look like two different issues, both manifesting as the mismatch error.

@comphead
Copy link
Contributor

you are right, col names from first union all branch are the driving

This case is not correct, col names has to be count, n_regionkey

❯ WITH w1 AS (select 1 as x , max(10) as y), w2 AS (select 5 as n_regionkey)
select count(*) count, n_regionkey from w2 group by n_regionkey
union all
select x, y from w1 order by n_regionkey, count desc;
+-------+----+
| count | y  |
+-------+----+
| 1     | 5  |
| 1     | 10 |
+-------+----+

If I remove order by I'm getting even more surprising

❯ WITH w1 AS (select 1 as x , max(10) as y), w2 AS (select 5 as n_regionkey)
select count(*) count, n_regionkey from w2 group by n_regionkey
union all
select x, y from w1;
+---+----+
| x | y  |
+---+----+
| 1 | 10 |
| 1 | 5  |
+---+----+

The bug partially related to wrong col name derivation in UNION ALL

❯ select  1 a, 2 b union all select 3 c, 4 d
;
+---+---+
| c | d |
+---+---+
| 3 | 4 |
| 1 | 2 |
+---+---+

I will prepare a fix for UNION ALL first and then test out other scenarios, like not deterministic column naming with and without ORDER BY

@milevin
Copy link
Author

milevin commented Mar 25, 2023

@comphead Another day another "Mismatch" instance:

DataFusion CLI v20.0.0
❯ create table temp as
with t1 as (select 1 as col1, 'asd' as col2), t2 as (select 1 as col3, 'sdf' as col4)
select col2, col4 from t1 left join t2 on col1 = col3;
Error during planning: Mismatch between schema and batches

Looks unrelated to the other two cases I provided.

@milevin
Copy link
Author

milevin commented Mar 25, 2023

In this last case, the issue is with inferring nullability incorrectly.

Schema {
    fields: [
        Field {
            name: "col2",
            data_type: Utf8,
            nullable: false,
            dict_id: 0,
            dict_is_ordered: false,
            metadata: {},
        },
        Field {
            name: "col4",
            data_type: Utf8,
            nullable: false,                                              <<<<<<<<----------------
            dict_id: 0,
            dict_is_ordered: false,
            metadata: {},
        },
    ],
    metadata: {},
}


[
    Schema {
        fields: [
            Field {
                name: "col2",
                data_type: Utf8,
                nullable: false,
                dict_id: 0,
                dict_is_ordered: false,
                metadata: {},
            },
            Field {
                name: "col4",
                data_type: Utf8,
                nullable: true,
                dict_id: 0,
                dict_is_ordered: false,
                metadata: {},
            },
        ],
        metadata: {},
    },
]

It incorrectly infers that col4 is non-nullable. It should be nullable.

Any idea which code is responsible for this?

@comphead
Copy link
Contributor

Thanks @milevin for reporting all of this. Tbh I think there are multiple places. I’ll start with Union all first. For nullability check we probably need to weaken the check itself to test names and data types only

@alamb
Copy link
Contributor

alamb commented Mar 28, 2023

Update is that @comphead is tracking the issues on #5747 (thanks @comphead !)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants