-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Describe the bug
Given a table in a SessionContext
and the RecordBatch
that backs it (e.g. through ctx.register_batch()
), I want to refer to the table's columns using the field names found in the RecordBatch
's schema.
In some cases, this fails with col(col_name)
. I must instead use col(format!("\"{col_name}\""))
, which is hard to discover and likely to be missed even when one is aware of the issue. This is compounded by the fact that specific column names will trigger the failure, like "A", but not "Column A". (I'm now assuming that the space in the latter triggers some auto-escaping mechanism.)
To Reproduce
use arrow::array::{Int64Array, RecordBatch};
use arrow::datatypes::{DataType, Field, Schema};
use datafusion::common::DataFusionError;
use datafusion::logical_expr::col;
use datafusion::prelude::SessionContext;
use std::sync::Arc;
async fn test_single_column(col_name: &str) -> Result<(), DataFusionError> {
// create a simple batch
let column = Int64Array::from(vec![1, 2, 3]);
let schema = Schema::new(vec![Field::new(col_name, DataType::Int64, false)]);
println!("Column name: {col_name}");
println!("Initial arrow schema name: {}", schema.fields()[0].name());
let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(column)])
.expect("could not create record batch");
// create a DataFusion context
let ctx = SessionContext::new();
ctx.register_batch("test", batch)?;
println!(
"Session context schema name: {}",
ctx.table("test").await?.schema().fields()[0].name()
);
let result = ctx
.table("test")
.await?
.select(vec![col(col_name)])?
// use this instead to avoid the issue
//.select(vec![col(format!("\"{col_name}\""))])?
.collect()
.await?
.into_iter()
.last()
.ok_or(DataFusionError::External("no batch returned".into()))?;
println!(
"Result batch schema name: {}",
result.schema().fields()[0].name()
);
Ok(())
}
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let names = &["A", "a", "Column A"];
for name in names {
if let Err(e) = test_single_column(name).await {
eprintln!("Error processing column name '{}': {}", name, e);
}
println!("--------------------------------");
}
Ok(())
}
Result:
Column name: A
Initial arrow schema name: A
Session context schema name: A
Error processing column name 'A': Schema error: No field named a. Valid fields are test."A".
--------------------------------
Column name: a
Initial arrow schema name: a
Session context schema name: a
Result batch schema name: a
--------------------------------
Column name: Column A
Initial arrow schema name: Column A
Session context schema name: Column A
Result batch schema name: Column A
Noteworthy:
- The error message is particularly confusing, since I did use
"A"
(edit: well, once you know the issue, you may note the quotes making the message technically correct) - The (seemingly) inconsistent behaviour between "A" and "Column A" (with the latter actually working).
Expected behavior
All three test cases pass
Additional context
In this test case, like in the actual codebase I'm working on, I am not making use of any SQL. This makes name casing issue particularly unexpected.
Probably related: