Skip to content

Inconsistent column name case handling when round tripping names from arrow metadata #15922

@abey79

Description

@abey79

Describe the bug

Given a table in a SessionContext and the RecordBatch that backs it (e.g. through ctx.register_batch()), I want to refer to the table's columns using the field names found in the RecordBatch's schema.

In some cases, this fails with col(col_name). I must instead use col(format!("\"{col_name}\"")), which is hard to discover and likely to be missed even when one is aware of the issue. This is compounded by the fact that specific column names will trigger the failure, like "A", but not "Column A". (I'm now assuming that the space in the latter triggers some auto-escaping mechanism.)

To Reproduce

use arrow::array::{Int64Array, RecordBatch};
use arrow::datatypes::{DataType, Field, Schema};
use datafusion::common::DataFusionError;
use datafusion::logical_expr::col;
use datafusion::prelude::SessionContext;
use std::sync::Arc;

async fn test_single_column(col_name: &str) -> Result<(), DataFusionError> {
    // create a simple batch
    let column = Int64Array::from(vec![1, 2, 3]);
    let schema = Schema::new(vec![Field::new(col_name, DataType::Int64, false)]);

    println!("Column name: {col_name}");
    println!("Initial arrow schema name: {}", schema.fields()[0].name());

    let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(column)])
        .expect("could not create record batch");

    // create a DataFusion context
    let ctx = SessionContext::new();
    ctx.register_batch("test", batch)?;

    println!(
        "Session context schema name: {}",
        ctx.table("test").await?.schema().fields()[0].name()
    );

    let result = ctx
        .table("test")
        .await?
        .select(vec![col(col_name)])?
        // use this instead to avoid the issue
        //.select(vec![col(format!("\"{col_name}\""))])?
        .collect()
        .await?
        .into_iter()
        .last()
        .ok_or(DataFusionError::External("no batch returned".into()))?;

    println!(
        "Result batch schema name: {}",
        result.schema().fields()[0].name()
    );

    Ok(())
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let names = &["A", "a", "Column A"];

    for name in names {
        if let Err(e) = test_single_column(name).await {
            eprintln!("Error processing column name '{}': {}", name, e);
        }

        println!("--------------------------------");
    }

    Ok(())
}

Result:

Column name: A
Initial arrow schema name: A
Session context schema name: A
Error processing column name 'A': Schema error: No field named a. Valid fields are test."A".
--------------------------------
Column name: a
Initial arrow schema name: a
Session context schema name: a
Result batch schema name: a
--------------------------------
Column name: Column A
Initial arrow schema name: Column A
Session context schema name: Column A
Result batch schema name: Column A

Noteworthy:

  • The error message is particularly confusing, since I did use "A" (edit: well, once you know the issue, you may note the quotes making the message technically correct)
  • The (seemingly) inconsistent behaviour between "A" and "Column A" (with the latter actually working).

Expected behavior

All three test cases pass

Additional context

In this test case, like in the actual codebase I'm working on, I am not making use of any SQL. This makes name casing issue particularly unexpected.

Probably related:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions