Skip to content

When the file extension is changed, read silently fails. #7954

@jsimpson-gro

Description

@jsimpson-gro

Describe the bug

When reading from a parquet file, if the file extension doesn't happen to match the default configuration, an empty DataFrame is produced. I believe this is easily encountered as a beginner (reading a Parquet file using the default read options is one of the first things I tried), and the lack of feedback makes this a challenge to debug.

To Reproduce

//! ```cargo
//! [dependencies]
//! anyhow = "1.0"
//! datafusion = "32.0"
//! tokio = { version = "1.33", features = ["macros", "rt-multi-thread"] }
//! ```
use std::sync::Arc;

use datafusion::arrow::array::{Float32Array, Int32Array};
use datafusion::arrow::datatypes::{DataType, Field, Schema};
use datafusion::arrow::record_batch::RecordBatch;
use datafusion::dataframe::DataFrameWriteOptions;
use datafusion::parquet::basic::Compression;
use datafusion::parquet::file::properties::WriterProperties;
use datafusion::prelude::*;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let ctx = SessionContext::new();

    // Make up a new dataframe.
    let write_df = ctx.read_batch(RecordBatch::try_new(
        Arc::new(Schema::new(vec![
            Field::new("purchase_id", DataType::Int32, false),
            Field::new("price", DataType::Float32, false),
            Field::new("quantity", DataType::Int32, false),
        ])),
        vec![
            Arc::new(Int32Array::from(vec![1, 2, 3, 4, 5])),
            Arc::new(Float32Array::from(vec![1.12, 3.40, 2.33, 9.10, 6.66])),
            Arc::new(Int32Array::from(vec![1, 3, 2, 4, 3])),
        ],
    )?)?;

    write_df
        .write_parquet(
            "output.parquet.snappy",
            DataFrameWriteOptions::new().with_single_file_output(true),
            Some(
                WriterProperties::builder()
                    .set_compression(Compression::SNAPPY)
                    .build(),
            ),
        )
        .await?;

    let read_df = ctx
        .read_parquet(
            "output.parquet.snappy",
            ParquetReadOptions {
                // If this line is uncommented, the read will be successful.
                // file_extension: "parquet.snappy",
                ..Default::default()
            },
        )
        .await?;

    read_df.show().await?;

    Ok(())
}

Gives output:

$ cargo run
   Compiling bug_repro_open_no_error v0.1.0 (/home/jacob/src/gro-sandbox/user/jsimpson/app/bug_repro_open_no_error)
    Finished dev [unoptimized + debuginfo] target(s) in 7.56s
     Running `target/debug/bug_repro_open_no_error`
++
++

Expected behavior

I would like to get an error if the read fails, rather than an empty DataFrame.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions