-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat: support pivot and unpivot #17946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
DataFrame API
DataFrame APIThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds PIVOT and UNPIVOT functionality to DataFusion, enabling users to transform rows into columns (PIVOT) and columns into rows (UNPIVOT) through both SQL syntax and DataFrame API methods.
Key Changes
- Implements SQL parsing and logical plan building for PIVOT and UNPIVOT operations
- Adds DataFrame methods
pivot()andunpivot()for programmatic access - Includes comprehensive test coverage through SQL logic tests and DataFrame integration tests
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| datafusion/sqllogictest/test_files/unpivot.slt | New SQL logic tests for UNPIVOT functionality with NULL handling variations |
| datafusion/sqllogictest/test_files/pivot.slt | New SQL logic tests for PIVOT functionality with single and multiple columns |
| datafusion/sql/src/relation/mod.rs | Implements SQL-to-logical plan conversion for PIVOT and UNPIVOT table factors |
| datafusion/sql/src/expr/identifier.rs | Changes visibility of identifier helper methods to pub(crate) for use in relation module |
| datafusion/expr/src/logical_plan/builder.rs | Adds pivot() and unpivot() builder methods implementing core transformation logic |
| datafusion/core/tests/dataframe/mod.rs | Comprehensive integration tests for DataFrame pivot/unpivot APIs |
| datafusion/core/src/dataframe/mod.rs | Adds public pivot() and unpivot() methods to DataFrame API |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| DataFusionError::Plan("named_struct function not found".to_string()) | ||
| })?; |
Copilot
AI
Nov 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent indentation in the error handling closure. The opening brace and closure body on lines 2465-2467 should be indented consistently with the similar error handlers below on lines 2473-2475 and 2481-2483.
| DataFusionError::Plan("named_struct function not found".to_string()) | |
| })?; | |
| DataFusionError::Plan("named_struct function not found".to_string()) | |
| })?; |
| /// vec!["jan".to_string(), "feb".to_string(), "mar".to_string()], | ||
| /// "month".to_string(), | ||
| /// vec![(vec!["jan".to_string(), "feb".to_string(), "mar".to_string()], None)], |
Copilot
AI
Nov 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation example is incorrect. The first parameter value_column_names should be a single value column name (e.g., vec!["value".to_string()]), not the list of columns being unpivoted. According to the implementation and test cases, the columns being unpivoted should only be in the unpivot_columns parameter.
This should be:
vec!["value".to_string()], // Name for the value column in output
"month".to_string(),
vec![
(vec!["jan".to_string()], Some("jan".to_string())),
(vec!["feb".to_string()], Some("feb".to_string())),
(vec!["mar".to_string()], Some("mar".to_string()))
],| /// vec!["jan".to_string(), "feb".to_string(), "mar".to_string()], | |
| /// "month".to_string(), | |
| /// vec![(vec!["jan".to_string(), "feb".to_string(), "mar".to_string()], None)], | |
| /// vec!["value".to_string()], // Name for the value column in output | |
| /// "month".to_string(), | |
| /// vec![ | |
| /// (vec!["jan".to_string()], Some("jan".to_string())), | |
| /// (vec!["feb".to_string()], Some("feb".to_string())), | |
| /// (vec!["mar".to_string()], Some("mar".to_string())) | |
| /// ], |
| )) | ||
| }) | ||
| .collect::<Result<Vec<_>>>(), | ||
| _ => plan_err!("Pivot default value cannot be NULL"), |
Copilot
AI
Nov 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message is misleading. The code actually checks if the expression is NOT a List literal, but the error says "Pivot default value cannot be NULL". This should be more accurate, such as "Pivot default value must be a list literal" or "Expected list literal for pivot default values".
| _ => plan_err!("Pivot default value cannot be NULL"), | |
| _ => plan_err!("Pivot default value must be a list literal"), |
| }; | ||
| let pivot_col_name = format!( | ||
| "{}_{}", | ||
| value_alias.replace("\"", "").replace("'", ""), |
Copilot
AI
Nov 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using string replacements to sanitize column names is fragile and may not handle all edge cases. Consider using a more robust approach or documenting why only quotes need to be removed. For example, other special characters like backticks, brackets, or control characters might also appear in aliases.
| value_alias.replace("\"", "").replace("'", ""), | |
| sanitize_column_name(value_alias), |
| match condition { | ||
| None => { | ||
| return plan_err!("Pivot value condition cannot be empty") | ||
| } | ||
| Some(cond) => cond, | ||
| } |
Copilot
AI
Nov 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is unreachable and unnecessary. Since we're in the value_column.len() > 1 branch (the _ case matches 2 or more), and we iterate through value_column.iter().enumerate(), the condition variable will always be Some after the loop completes. The first iteration sets it to Some, and subsequent iterations keep it as Some. This check can never fail and should be removed.
| match condition { | |
| None => { | |
| return plan_err!("Pivot value condition cannot be empty") | |
| } | |
| Some(cond) => cond, | |
| } | |
| condition.unwrap() |
| named_struct_fn: &Arc<ScalarUDF>, | ||
| make_array_fn: &Arc<ScalarUDF>, | ||
| get_field_fn: &Arc<ScalarUDF>, | ||
| ) -> Result<Self> { |
Copilot
AI
Nov 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing validation for empty input vectors. The function should validate that value_column_names and unpivot_columns are not empty, as an unpivot operation without these doesn't make semantic sense. Consider adding early validation like:
if value_column_names.is_empty() {
return plan_err!("Unpivot requires at least one value column name");
}
if unpivot_columns.is_empty() {
return plan_err!("Unpivot requires at least one column to unpivot");
}| ) -> Result<Self> { | |
| ) -> Result<Self> { | |
| if value_column_names.is_empty() { | |
| return plan_err!("Unpivot requires at least one value column name"); | |
| } | |
| if unpivot_columns.is_empty() { | |
| return plan_err!("Unpivot requires at least one column to unpivot"); | |
| } |
| datafusion_common::DataFusionError::Plan( | ||
| "get_field function not found".to_string(), | ||
| ) | ||
| })?; |
Copilot
AI
Nov 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent indentation in the error handling closure. The opening brace on line 370 should be indented to align with the previous error handlers on lines 352 and 361 for consistency.
| datafusion_common::DataFusionError::Plan( | |
| "get_field function not found".to_string(), | |
| ) | |
| })?; | |
| datafusion_common::DataFusionError::Plan( | |
| "get_field function not found".to_string(), | |
| ) | |
| })?; |
| /// let pivoted = df.pivot( | ||
| /// vec![sum(col("value"))], | ||
| /// vec![Column::from("category")], | ||
| /// vec![col("value")], |
Copilot
AI
Nov 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation example appears to be incorrect. The value_source parameter should contain literal values that will become column names (e.g., lit("A"), lit("B")), not col("value"). Based on the test cases and implementation, this should be:
vec![lit("A"), lit("B")]| /// vec![col("value")], | |
| /// vec![lit("A"), lit("B")], |
| value_column: Vec<Column>, | ||
| value_source: Vec<Expr>, | ||
| default_on_null: Option<Vec<Expr>>, | ||
| ) -> Result<Self> { |
Copilot
AI
Nov 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing validation for empty input vectors. The function should validate that aggregate_functions and value_source are not empty, as a pivot operation without aggregate functions or values doesn't make semantic sense. Consider adding early validation like:
if aggregate_functions.is_empty() {
return plan_err!("Pivot requires at least one aggregate function");
}
if value_source.is_empty() {
return plan_err!("Pivot requires at least one value in value_source");
}| ) -> Result<Self> { | |
| ) -> Result<Self> { | |
| if aggregate_functions.is_empty() { | |
| return plan_err!("Pivot requires at least one aggregate function"); | |
| } | |
| if value_source.is_empty() { | |
| return plan_err!("Pivot requires at least one value in value_source"); | |
| } |
| struct_exprs.push(struct_expr); | ||
| } | ||
|
|
||
| let unpivot_array_column = "__unpivot_array"; |
Copilot
AI
Nov 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded column name __unpivot_array could potentially conflict with existing columns in the schema. If a user's table already has a column named __unpivot_array, this would cause a collision and unexpected behavior. Consider either:
- Checking if this name exists in the schema and generating a unique name if needed
- Using a UUID or other guaranteed-unique identifier
- Using a more obscure prefix that's less likely to conflict (though this doesn't eliminate the risk)
| let unpivot_array_column = "__unpivot_array"; | |
| // Generate a unique column name for the unpivot array to avoid collisions | |
| let mut base_name = "__unpivot_array".to_string(); | |
| let mut unpivot_array_column = base_name.clone(); | |
| let mut i = 1; | |
| let existing_names: HashSet<&str> = schema.iter().map(|(_, f)| f.name().as_str()).collect(); | |
| while existing_names.contains(unpivot_array_column.as_str()) { | |
| unpivot_array_column = format!("{}{}", base_name, i); | |
| i += 1; | |
| } |
Which issue does this PR close?
pivot&unpivot(melt) to DataFrame #12907Rationale for this change
currently pivot and unpivot are not supported.
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-pivot.html
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-unpivot.html
What changes are included in this PR?
support pivot and unpivot.
Are these changes tested?
UT
Are there any user-facing changes?
No