Skip to content

Internal error: Physical input schema should be the same as the one converted from logical input schema. #35

@alamb

Description

@alamb

Describe the bug

When writing queries on parquet files with field metadata and not stripping that
metadata, DataFusion errors out with the above error.

To Reproduce

Repro

-- First, ensure that parquet metadata is not skipped (it is skipped by default)
> set datafusion.execution.parquet.skip_metadata = false;

SELECT
  'foo' AS name,
  COUNT(
    CASE
      WHEN prev_value = false AND value = TRUE THEN 1
      ELSE NULL
      END
     ) AS count_true_rises
FROM
  (
    SELECT
      value,
      LAG(value) OVER (ORDER BY time ) AS prev_value
    FROM
      'repro.parquet'
);

Results in

Internal error: Physical input schema should be the same as the one converted from logical input schema. Differences: .
This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues

I made the parquet file available here:

parquet-with-metadata.zip

Here is the code to generate the parquet file (I am not sure how to create parquet files with metadata otherwise):

Details

use std::collections::HashMap;
use std::fs::File;
use std::sync::Arc;
use arrow::array::{BooleanArray, RecordBatch, TimestampNanosecondArray};
use arrow::datatypes::{DataType, Field, Schema, SchemaRef, TimeUnit};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // write a parquet file which has a metadata
    let mut metadata = HashMap::new();
    metadata.insert(String::from("year"), String::from("2015"));
    let schema: SchemaRef = Arc::new(Schema::new(vec![
        Field::new("time", DataType::Timestamp(TimeUnit::Nanosecond, None), false),
        Field::new("value", DataType::Boolean, false)
            .with_metadata(metadata),
    ]));

    let time = TimestampNanosecondArray::from(vec![1_420_070_400_000_000_000i64, 1_420_070_401_000_000_000i64]);
    let value = BooleanArray::from(vec![true, false]);
    let batch = RecordBatch::try_new(schema.clone(), vec![
        Arc::new(time),
        Arc::new(value),
    ])?;


    println!("Writing parquet file with metadata repro.parquet...");
    let writer = File::create("repro.parquet")?;
    let mut arrow_writer = parquet::arrow::ArrowWriter::try_new(
        writer,
        schema.clone(),
        None,
    )?;
    arrow_writer.write(&batch)?;
    arrow_writer.close()?;

    Ok(())
}

Note this is all the more confusing because the error lists no differences

...  converted from logical input schema. Differences: . <-- no differences are listed!!!

The difference is the metadata on the value field.

Expected behavior

I expect the query to pass without error

Additional context

_No response_¯

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions