Skip to content

Schema error when returning DenseUnion from ScalarUDF #13762

@kylebarron

Description

@kylebarron

Describe the bug

Returning any dense union from ScalarUDF currently fails.

To Reproduce

use std::any::Any;
use std::sync::{Arc, OnceLock};

use arrow::array::UnionBuilder;
use arrow::datatypes::{Float64Type, Int32Type};
use arrow_array::Array;
use arrow_schema::{DataType, Field, UnionFields, UnionMode};
use datafusion::logical_expr::{
    ColumnarValue, Documentation, ScalarUDFImpl, Signature, Volatility,
};

#[derive(Debug)]
pub(super) struct UnionExample {
    signature: Signature,
}

impl UnionExample {
    pub fn new() -> Self {
        Self {
            signature: Signature::any(0, Volatility::Immutable),
        }
    }
}

static DOC: OnceLock<Documentation> = OnceLock::new();

impl ScalarUDFImpl for UnionExample {
    fn as_any(&self) -> &dyn Any {
        self
    }

    fn name(&self) -> &str {
        "example_union"
    }

    fn signature(&self) -> &Signature {
        &self.signature
    }

    fn return_type(&self, _arg_types: &[DataType]) -> datafusion::error::Result<DataType> {
        let fields = UnionFields::new(
            vec![0, 1],
            vec![
                Arc::new(Field::new("a", DataType::Int32, false)),
                Arc::new(Field::new("b", DataType::Float64, false)),
            ],
        );
        Ok(DataType::Union(fields, UnionMode::Dense))
    }

    fn invoke(&self, args: &[ColumnarValue]) -> datafusion::error::Result<ColumnarValue> {
        todo!()
    }

    fn invoke_no_args(&self, _number_rows: usize) -> datafusion::error::Result<ColumnarValue> {
        let mut builder = UnionBuilder::new_dense();
        builder.append::<Int32Type>("a", 1).unwrap();
        builder.append::<Float64Type>("b", 3.0).unwrap();
        builder.append::<Int32Type>("a", 4).unwrap();
        let arr = builder.build().unwrap();

        assert_eq!(arr.type_id(0), 0);
        assert_eq!(arr.type_id(1), 1);
        assert_eq!(arr.type_id(2), 0);

        assert_eq!(arr.value_offset(0), 0);
        assert_eq!(arr.value_offset(1), 0);
        assert_eq!(arr.value_offset(2), 1);

        let arr = arr.slice(0, 1);

        assert!(matches!(
            arr.data_type(),
            DataType::Union(_, UnionMode::Dense)
        ));

        Ok(ColumnarValue::Array(Arc::new(arr)))
    }

    fn documentation(&self) -> Option<&Documentation> {
        Some(DOC.get_or_init(|| Documentation::builder().build().unwrap()))
    }
}

#[cfg(test)]
mod test {
    use super::*;
    use datafusion::prelude::*;

    #[tokio::test]
    async fn test() {
        let ctx = SessionContext::new();
        ctx.register_udf(UnionExample::new().into());

        let out = ctx.sql("SELECT example_union();").await.unwrap();
        out.show().await.unwrap();
    }
}

Gives

called `Result::unwrap()` on an `Err` value: 
ArrowError(InvalidArgumentError("column types must match schema types, expected 
Union([(0, Field { name: \"a\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), (1, Field { name: \"b\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} })], Dense) 
but found 
Union([(0, Field { name: \"a\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), (1, Field { name: \"b\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} })], Sparse) at column index 0"), None)

The only difference there is that "expected" has a Union type of Dense while "found" has a union type of Sparse. I'm returning dense array data from invoke_no_args and return_type() also returns a dense union. So it seems that internally the union array is being cast from dense to sparse somehow.

Expected behavior

Does not error with dense unions.

Additional context

I need to use a dense union to represent geospatial vector data of unknown geometry type and coordinate dimension. geoarrow/geoarrow#43

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions