Skip to content

Optimizer Internal Error: 'eliminate_cross_join' causes schema drift when LEFT JOIN has scalar subquery over same CTE (SQLStorm) #17774

@2010YOUY01

Description

@2010YOUY01

Describe the bug

datafusion-cli is compiled from the latest main commit 5bbdb7e

To reproduce, tpch dataset can be generated via https://github.com/clflushopt/tpchgen-rs/tree/main/tpchgen-cli, and update the following paths

CREATE EXTERNAL TABLE customer
STORED AS PARQUET
LOCATION '/Users/yongting/Code/datafusion-sqlstorm/data/customer.parquet';

CREATE EXTERNAL TABLE nation
STORED AS PARQUET
LOCATION '/Users/yongting/Code/datafusion-sqlstorm/data/nation.parquet';

CREATE EXTERNAL TABLE part
STORED AS PARQUET
LOCATION '/Users/yongting/Code/datafusion-sqlstorm/data/part.parquet';

CREATE EXTERNAL TABLE region
STORED AS PARQUET
LOCATION '/Users/yongting/Code/datafusion-sqlstorm/data/region.parquet';

CREATE EXTERNAL TABLE lineitem
STORED AS PARQUET
LOCATION '/Users/yongting/Code/datafusion-sqlstorm/data/lineitem.parquet';

CREATE EXTERNAL TABLE orders
STORED AS PARQUET
LOCATION '/Users/yongting/Code/datafusion-sqlstorm/data/orders.parquet';

CREATE EXTERNAL TABLE partsupp
STORED AS PARQUET
LOCATION '/Users/yongting/Code/datafusion-sqlstorm/data/partsupp.parquet';

CREATE EXTERNAL TABLE supplier
STORED AS PARQUET
LOCATION '/Users/yongting/Code/datafusion-sqlstorm/data/supplier.parquet';

WITH MonthlySales AS (
  SELECT
    DATE_TRUNC('month', o_orderdate) AS sales_month,
    SUM(l_extendedprice)             AS total_sales
  FROM orders AS o
  JOIN lineitem AS l
    ON o.o_orderkey = l.l_orderkey
  GROUP BY sales_month
),
TopRegions AS (
  SELECT
    n.n_name                 AS region_name,
    SUM(ps.ps_supplycost)    AS region_supply_cost
  FROM partsupp AS ps
  JOIN supplier AS s
    ON ps.ps_suppkey = s.s_suppkey
  JOIN nation AS n
    ON s.s_nationkey = n.n_nationkey
  GROUP BY region_name
  ORDER BY region_supply_cost DESC
  LIMIT 5
)
SELECT
  ms.sales_month,
  ms.total_sales,
  tr.region_name,
  tr.region_supply_cost
FROM MonthlySales AS ms
LEFT JOIN TopRegions AS tr
  ON tr.region_supply_cost > (
       SELECT AVG(region_supply_cost) FROM TopRegions
     );

Optimizer rule 'eliminate_cross_join' failed
caused by
Check optimizer-specific invariants after optimizer rule: eliminate_cross_join
caused by
Internal error: Failed due to a difference in schemas: original schema: DFSchema { inner: Schema { fields: [Field { name: "sales_month", data_type: Timestamp(Nanosecond, None), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "total_sales", data_type: Decimal128(25, 2), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "region_name", data_type: Utf8View, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "region_supply_cost", data_type: Decimal128(25, 2), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [Some(Bare { table: "ms" }), Some(Bare { table: "ms" }), Some(Bare { table: "tr" }), Some(Bare { table: "tr" })], functional_dependencies: FunctionalDependencies { deps: [FunctionalDependence { source_indices: [0], target_indices: [0, 1], nullable: true, mode: Multi }, FunctionalDependence { source_indices: [2], target_indices: [2, 3], nullable: true, mode: Multi }] } }, new schema: DFSchema { inner: Schema { fields: [Field { name: "sales_month", data_type: Timestamp(Nanosecond, None), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "total_sales", data_type: Decimal128(25, 2), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "region_name", data_type: Utf8View, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "region_supply_cost", data_type: Decimal128(25, 2), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "avg(topregions.region_supply_cost)", data_type: Decimal128(29, 6), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [Some(Bare { table: "ms" }), Some(Bare { table: "ms" }), Some(Bare { table: "tr" }), Some(Bare { table: "tr" }), Some(Bare { table: "tr" })], functional_dependencies: FunctionalDependencies { deps: [FunctionalDependence { source_indices: [0], target_indices: [0, 1], nullable: true, mode: Multi }, FunctionalDependence { source_indices: [2], target_indices: [2, 3], nullable: true, mode: Multi }] } }.
This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues

To Reproduce

No response

Expected behavior

No response

Additional context

Found by SQLStorm #17698

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions