Description
Describe the bug
When you use the lead or lag built in functions and the data type is either a list or struct, you will get a panic with error Exception: Arrow error: Compute error: concat requires input of at least one array
I have root caused this to list_to_array_of_size
in datafusion/common/src/scalar/mod.rs
where we do not check to see if the arrays we are attempting to concat have any contents, which they will not because in WindowAggState::new()
we are calling to_array_of_size(0)
. These calls work for primitive data, but for list data we need an additional check. I am submitting a PR to resolve the issue.
To Reproduce
Data file is a simple csv:
a,b,c
1,2,3
4,5,6
7,8,9
10,11,12
Code to reproduce:
use datafusion::{logical_expr::{expr::WindowFunction, BuiltInWindowFunction, WindowFrame, WindowFunctionDefinition}, prelude::*};
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
let ctx = SessionContext::new();
let mut df = ctx.read_csv("/Users/tsaucer/working/testing_ballista/lead_lag/example.csv", CsvReadOptions::default()).await?;
df = df.with_column("array_col", make_array(vec![col("a"), col("b"), col("c")]))?;
df.clone().show().await?;
let lag_expr = Expr::WindowFunction(WindowFunction::new(
WindowFunctionDefinition::BuiltInWindowFunction(
BuiltInWindowFunction::Lead,
),
vec![col("array_col")],
vec![],
vec![],
WindowFrame::new(None),
None,
));
df = df.select(vec![col("a"), col("b"), col("c"), col("array_col"), lag_expr.alias("lagged")])?;
df.show().await?;
Ok(())
}
Results:
+----+----+----+--------------+
| a | b | c | array_col |
+----+----+----+--------------+
| 1 | 2 | 3 | [1, 2, 3] |
| 4 | 5 | 6 | [4, 5, 6] |
| 7 | 8 | 9 | [7, 8, 9] |
| 10 | 11 | 12 | [10, 11, 12] |
+----+----+----+--------------+
Error: ArrowError(ComputeError("concat requires input of at least one array"), None)
Expected behavior
Expect lag to work on these structures. Here is output from the PR I will put up shortly.
+----+----+----+--------------+
| a | b | c | array_col |
+----+----+----+--------------+
| 1 | 2 | 3 | [1, 2, 3] |
| 4 | 5 | 6 | [4, 5, 6] |
| 7 | 8 | 9 | [7, 8, 9] |
| 10 | 11 | 12 | [10, 11, 12] |
+----+----+----+--------------+
+----+----+----+--------------+--------------+
| a | b | c | array_col | lagged |
+----+----+----+--------------+--------------+
| 1 | 2 | 3 | [1, 2, 3] | [4, 5, 6] |
| 4 | 5 | 6 | [4, 5, 6] | [7, 8, 9] |
| 7 | 8 | 9 | [7, 8, 9] | [10, 11, 12] |
| 10 | 11 | 12 | [10, 11, 12] | |
+----+----+----+--------------+--------------+
Additional context
This is the root cause for apache/datafusion-python#647