Skip to content

Conversation

@chenkovsky
Copy link
Contributor

@chenkovsky chenkovsky commented Oct 6, 2025

Which issue does this PR close?

Rationale for this change

currently pivot and unpivot are not supported.

https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-pivot.html
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-unpivot.html

What changes are included in this PR?

support pivot and unpivot.

Are these changes tested?

UT

Are there any user-facing changes?

No

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Oct 6, 2025
@alamb alamb changed the title feat: support pivot and unpivot feat: support pivot and unpivot in DataFrame API Oct 7, 2025
@alamb alamb changed the title feat: support pivot and unpivot in DataFrame API feat: support pivot and unpivot Oct 7, 2025
@Weijun-H Weijun-H requested a review from Copilot November 9, 2025 16:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds PIVOT and UNPIVOT functionality to DataFusion, enabling users to transform rows into columns (PIVOT) and columns into rows (UNPIVOT) through both SQL syntax and DataFrame API methods.

Key Changes

  • Implements SQL parsing and logical plan building for PIVOT and UNPIVOT operations
  • Adds DataFrame methods pivot() and unpivot() for programmatic access
  • Includes comprehensive test coverage through SQL logic tests and DataFrame integration tests

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
datafusion/sqllogictest/test_files/unpivot.slt New SQL logic tests for UNPIVOT functionality with NULL handling variations
datafusion/sqllogictest/test_files/pivot.slt New SQL logic tests for PIVOT functionality with single and multiple columns
datafusion/sql/src/relation/mod.rs Implements SQL-to-logical plan conversion for PIVOT and UNPIVOT table factors
datafusion/sql/src/expr/identifier.rs Changes visibility of identifier helper methods to pub(crate) for use in relation module
datafusion/expr/src/logical_plan/builder.rs Adds pivot() and unpivot() builder methods implementing core transformation logic
datafusion/core/tests/dataframe/mod.rs Comprehensive integration tests for DataFrame pivot/unpivot APIs
datafusion/core/src/dataframe/mod.rs Adds public pivot() and unpivot() methods to DataFrame API

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2466 to +2467
DataFusionError::Plan("named_struct function not found".to_string())
})?;
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation in the error handling closure. The opening brace and closure body on lines 2465-2467 should be indented consistently with the similar error handlers below on lines 2473-2475 and 2481-2483.

Suggested change
DataFusionError::Plan("named_struct function not found".to_string())
})?;
DataFusionError::Plan("named_struct function not found".to_string())
})?;

Copilot uses AI. Check for mistakes.
Comment on lines +2443 to +2445
/// vec!["jan".to_string(), "feb".to_string(), "mar".to_string()],
/// "month".to_string(),
/// vec![(vec!["jan".to_string(), "feb".to_string(), "mar".to_string()], None)],
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation example is incorrect. The first parameter value_column_names should be a single value column name (e.g., vec!["value".to_string()]), not the list of columns being unpivoted. According to the implementation and test cases, the columns being unpivoted should only be in the unpivot_columns parameter.

This should be:

vec!["value".to_string()],  // Name for the value column in output
"month".to_string(),
vec![
    (vec!["jan".to_string()], Some("jan".to_string())),
    (vec!["feb".to_string()], Some("feb".to_string())),
    (vec!["mar".to_string()], Some("mar".to_string()))
],
Suggested change
/// vec!["jan".to_string(), "feb".to_string(), "mar".to_string()],
/// "month".to_string(),
/// vec![(vec!["jan".to_string(), "feb".to_string(), "mar".to_string()], None)],
/// vec!["value".to_string()], // Name for the value column in output
/// "month".to_string(),
/// vec![
/// (vec!["jan".to_string()], Some("jan".to_string())),
/// (vec!["feb".to_string()], Some("feb".to_string())),
/// (vec!["mar".to_string()], Some("mar".to_string()))
/// ],

Copilot uses AI. Check for mistakes.
))
})
.collect::<Result<Vec<_>>>(),
_ => plan_err!("Pivot default value cannot be NULL"),
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message is misleading. The code actually checks if the expression is NOT a List literal, but the error says "Pivot default value cannot be NULL". This should be more accurate, such as "Pivot default value must be a list literal" or "Expected list literal for pivot default values".

Suggested change
_ => plan_err!("Pivot default value cannot be NULL"),
_ => plan_err!("Pivot default value must be a list literal"),

Copilot uses AI. Check for mistakes.
};
let pivot_col_name = format!(
"{}_{}",
value_alias.replace("\"", "").replace("'", ""),
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using string replacements to sanitize column names is fragile and may not handle all edge cases. Consider using a more robust approach or documenting why only quotes need to be removed. For example, other special characters like backticks, brackets, or control characters might also appear in aliases.

Suggested change
value_alias.replace("\"", "").replace("'", ""),
sanitize_column_name(value_alias),

Copilot uses AI. Check for mistakes.
Comment on lines +1582 to +1587
match condition {
None => {
return plan_err!("Pivot value condition cannot be empty")
}
Some(cond) => cond,
}
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is unreachable and unnecessary. Since we're in the value_column.len() > 1 branch (the _ case matches 2 or more), and we iterate through value_column.iter().enumerate(), the condition variable will always be Some after the loop completes. The first iteration sets it to Some, and subsequent iterations keep it as Some. This check can never fail and should be removed.

Suggested change
match condition {
None => {
return plan_err!("Pivot value condition cannot be empty")
}
Some(cond) => cond,
}
condition.unwrap()

Copilot uses AI. Check for mistakes.
named_struct_fn: &Arc<ScalarUDF>,
make_array_fn: &Arc<ScalarUDF>,
get_field_fn: &Arc<ScalarUDF>,
) -> Result<Self> {
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validation for empty input vectors. The function should validate that value_column_names and unpivot_columns are not empty, as an unpivot operation without these doesn't make semantic sense. Consider adding early validation like:

if value_column_names.is_empty() {
    return plan_err!("Unpivot requires at least one value column name");
}
if unpivot_columns.is_empty() {
    return plan_err!("Unpivot requires at least one column to unpivot");
}
Suggested change
) -> Result<Self> {
) -> Result<Self> {
if value_column_names.is_empty() {
return plan_err!("Unpivot requires at least one value column name");
}
if unpivot_columns.is_empty() {
return plan_err!("Unpivot requires at least one column to unpivot");
}

Copilot uses AI. Check for mistakes.
Comment on lines +371 to +374
datafusion_common::DataFusionError::Plan(
"get_field function not found".to_string(),
)
})?;
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation in the error handling closure. The opening brace on line 370 should be indented to align with the previous error handlers on lines 352 and 361 for consistency.

Suggested change
datafusion_common::DataFusionError::Plan(
"get_field function not found".to_string(),
)
})?;
datafusion_common::DataFusionError::Plan(
"get_field function not found".to_string(),
)
})?;

Copilot uses AI. Check for mistakes.
/// let pivoted = df.pivot(
/// vec![sum(col("value"))],
/// vec![Column::from("category")],
/// vec![col("value")],
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation example appears to be incorrect. The value_source parameter should contain literal values that will become column names (e.g., lit("A"), lit("B")), not col("value"). Based on the test cases and implementation, this should be:

vec![lit("A"), lit("B")]
Suggested change
/// vec![col("value")],
/// vec![lit("A"), lit("B")],

Copilot uses AI. Check for mistakes.
value_column: Vec<Column>,
value_source: Vec<Expr>,
default_on_null: Option<Vec<Expr>>,
) -> Result<Self> {
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validation for empty input vectors. The function should validate that aggregate_functions and value_source are not empty, as a pivot operation without aggregate functions or values doesn't make semantic sense. Consider adding early validation like:

if aggregate_functions.is_empty() {
    return plan_err!("Pivot requires at least one aggregate function");
}
if value_source.is_empty() {
    return plan_err!("Pivot requires at least one value in value_source");
}
Suggested change
) -> Result<Self> {
) -> Result<Self> {
if aggregate_functions.is_empty() {
return plan_err!("Pivot requires at least one aggregate function");
}
if value_source.is_empty() {
return plan_err!("Pivot requires at least one value in value_source");
}

Copilot uses AI. Check for mistakes.
struct_exprs.push(struct_expr);
}

let unpivot_array_column = "__unpivot_array";
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded column name __unpivot_array could potentially conflict with existing columns in the schema. If a user's table already has a column named __unpivot_array, this would cause a collision and unexpected behavior. Consider either:

  1. Checking if this name exists in the schema and generating a unique name if needed
  2. Using a UUID or other guaranteed-unique identifier
  3. Using a more obscure prefix that's less likely to conflict (though this doesn't eliminate the risk)
Suggested change
let unpivot_array_column = "__unpivot_array";
// Generate a unique column name for the unpivot array to avoid collisions
let mut base_name = "__unpivot_array".to_string();
let mut unpivot_array_column = base_name.clone();
let mut i = 1;
let existing_names: HashSet<&str> = schema.iter().map(|(_, f)| f.name().as_str()).collect();
while existing_names.contains(unpivot_array_column.as_str()) {
unpivot_array_column = format!("{}{}", base_name, i);
i += 1;
}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add pivot & unpivot (melt) to DataFrame

1 participant