-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
bugSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomers
Description
Describe the bug
When reading from a parquet file, if the file extension doesn't happen to match the default configuration, an empty DataFrame is produced. I believe this is easily encountered as a beginner (reading a Parquet file using the default read options is one of the first things I tried), and the lack of feedback makes this a challenge to debug.
To Reproduce
//! ```cargo
//! [dependencies]
//! anyhow = "1.0"
//! datafusion = "32.0"
//! tokio = { version = "1.33", features = ["macros", "rt-multi-thread"] }
//! ```
use std::sync::Arc;
use datafusion::arrow::array::{Float32Array, Int32Array};
use datafusion::arrow::datatypes::{DataType, Field, Schema};
use datafusion::arrow::record_batch::RecordBatch;
use datafusion::dataframe::DataFrameWriteOptions;
use datafusion::parquet::basic::Compression;
use datafusion::parquet::file::properties::WriterProperties;
use datafusion::prelude::*;
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let ctx = SessionContext::new();
// Make up a new dataframe.
let write_df = ctx.read_batch(RecordBatch::try_new(
Arc::new(Schema::new(vec![
Field::new("purchase_id", DataType::Int32, false),
Field::new("price", DataType::Float32, false),
Field::new("quantity", DataType::Int32, false),
])),
vec![
Arc::new(Int32Array::from(vec![1, 2, 3, 4, 5])),
Arc::new(Float32Array::from(vec![1.12, 3.40, 2.33, 9.10, 6.66])),
Arc::new(Int32Array::from(vec![1, 3, 2, 4, 3])),
],
)?)?;
write_df
.write_parquet(
"output.parquet.snappy",
DataFrameWriteOptions::new().with_single_file_output(true),
Some(
WriterProperties::builder()
.set_compression(Compression::SNAPPY)
.build(),
),
)
.await?;
let read_df = ctx
.read_parquet(
"output.parquet.snappy",
ParquetReadOptions {
// If this line is uncommented, the read will be successful.
// file_extension: "parquet.snappy",
..Default::default()
},
)
.await?;
read_df.show().await?;
Ok(())
}
Gives output:
$ cargo run
Compiling bug_repro_open_no_error v0.1.0 (/home/jacob/src/gro-sandbox/user/jsimpson/app/bug_repro_open_no_error)
Finished dev [unoptimized + debuginfo] target(s) in 7.56s
Running `target/debug/bug_repro_open_no_error`
++
++
Expected behavior
I would like to get an error if the read fails, rather than an empty DataFrame.
Additional context
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomers