Skip to content

Conversation

@jcsherin
Copy link
Collaborator

@jcsherin jcsherin commented Nov 24, 2025

Upgrade Steps

  • Created branch-51-upstream which is upstream DF 51 at commit fd35a09.
  • This PR branch for upgrading the fork is checked out from above branch.
❯ git reflog show jacob/branch-51-upgrade | tail -n 1
fd35a0943 jacob/branch-51-upgrade@{22}: branch: Created from HEAD

Verified DF patches (present in this PR)

Missing patches (not yet applied to this PR)

1. make DefaultSchemaAdapter public Commit 848fd57

Reason: Moved out of core to datafusion-datasource. Needs to be patched once more.

Status: ✅ Fixed.

2. Add partition filters' equivalence classes info to the execution plan if it's DataSourceExec Commit c7628fb.

  • The branch for adding partition filters' equivalence classes into execution plan of DataSourceExec is missing.
  • The helper add_partition_filter_equivalence_info method is missing.

Status: ⏳ To be verified later if this is ported upstream. If not, add it back to fork.

Merged upstream

TODO

  • Add self-review comments
  • Verify patches which are not yet upstream exists in this branch
  • Identify and doc patches which are missing in this branch
  • Resolve review comments (wip)

andygrove and others added 30 commits November 4, 2024 19:40
* Initial commit

* Fix formatting

* Add across partitions check

* Add new test case

Add a new test case

* Fix buggy test
…#13909) (apache#13934)

* Set utf8view as return type when input type is the same

* Verify that the returned type from call to scalar function matches the return type specified in the return_type function

* Match return type to utf8view

Co-authored-by: Tim Saucer <timsaucer@gmail.com>
* fix: fetch is missed in the EnfoceSorting

* fix conflict

* resolve comments from alamb

* update
…e#14415) (apache#14453)

* chore: Fixed CI

* chore

* chore: Fixed clippy

* chore

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
* Test for string / numeric coercion

* fix tests

* Update tests

* Add tests to stringview

* add numeric coercion
@jcsherin jcsherin self-assigned this Nov 24, 2025
Comment on lines 34 to 54
<<<<<<< HEAD
fn udafs(&self) -> HashSet<String>;

/// Returns names of all available window user defined functions.
fn udwfs(&self) -> HashSet<String>;
=======
fn udafs(&self) -> HashSet<String> {
// This default implementation is provided temporarily
// to maintain backward compatibility for the 50.1 release.
// It will be reverted to a required method in future versions.
HashSet::default()
}

/// Returns names of all available window user defined functions.
fn udwfs(&self) -> HashSet<String> {
// This default implementation is provided temporarily
// to maintain backward compatibility for the 50.1 release.
// It will be reverted to a required method in future versions.
HashSet::default()
}
>>>>>>> origin/branch-51
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 218 to 225
=======
if let Some(comparison) = scalar.partial_cmp(current_best) {
let is_better = if find_greater {
comparison == std::cmp::Ordering::Greater
} else {
comparison == std::cmp::Ordering::Less
};
>>>>>>> origin/branch-51
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apache#16624 Upstream fix for using try_cmp instead of partial_cmp for scalar values.

let mut new_plan = AnalyzeExec::new(
self.verbose,
self.show_statistics,
self.metric_types.clone(),
Copy link
Collaborator Author

@jcsherin jcsherin Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AnalyzeExec::new method now takes metric_types: Vec<MetricType> as the third argument. https://docs.rs/datafusion/51.0.0/datafusion/physical_plan/analyze/struct.AnalyzeExec.html#method.new

) -> Result<Option<Arc<dyn ExecutionPlan>>> {
let mut new_plan =
ProjectionExec::try_new(self.expr.clone(), Arc::clone(self.input()))?;
ProjectionExec::try_new(self.expr().to_vec(), Arc::clone(self.input()))?;
Copy link
Collaborator Author

@jcsherin jcsherin Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true
}

#[allow(deprecated)]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deprecated since 44.0.0: Use UnionExec::try_new instead
https://docs.rs/datafusion/51.0.0/datafusion/physical_plan/union/struct.UnionExec.html#method.try_new

Maybe we can switch to try_new in a follow-on PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also okay to change it directly in the PR if not a big change

Comment on lines 631 to 647
=======
match reassign_predicate_columns(filter, &schema, true) {
Ok(filter) => {
match Self::add_filter_equivalence_info(
filter,
&mut eq_properties,
&schema,
) {
Ok(()) => {}
Err(e) => {
warn!("Failed to add filter equivalence info: {e}");
#[cfg(debug_assertions)]
panic!("Failed to add filter equivalence info: {e}");
}
}
}
>>>>>>> origin/branch-51
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picking upstream apache#17703

Comment on lines 868 to 886
=======
macro_rules! ignore_dangling_col {
($col:expr) => {
if let Some(col) = $col.as_any().downcast_ref::<Column>() {
if schema.index_of(col.name()).is_err() {
continue;
}
}
};
}

let (equal_pairs, _) = collect_columns_from_predicate(&filter);
for (lhs, rhs) in equal_pairs {
// Ignore any binary expressions that reference non-existent columns in the current schema
// (e.g. due to unnecessary projections being removed)
ignore_dangling_col!(lhs);
ignore_dangling_col!(rhs);
eq_properties.add_equal_conditions(Arc::clone(lhs), Arc::clone(rhs))?
>>>>>>> origin/branch-51
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picking upstream apache#17703

Comment on lines 2355 to 2359
<<<<<<< HEAD
.with_projection_indices(Some(vec![0, 1, 2]))
=======
.with_projection(Some(vec![0, 1, 2]))
>>>>>>> origin/branch-51
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FileScanConfigBuilder::with_projection() method has been deprecated in favor of with_projection_indices()

https://datafusion.apache.org/library-user-guide/upgrading.html#filescanconfig-projection-renamed-to-filescanconfig-projection-exprs

Comment on lines 47 to 65
<<<<<<< HEAD
/// Number of row groups whose bloom filters were checked, tracked with matched/pruned counts
pub row_groups_pruned_bloom_filter: PruningMetrics,
/// Number of row groups whose statistics were checked, tracked with matched/pruned counts
pub row_groups_pruned_statistics: PruningMetrics,
=======
/// Number of row groups whose bloom filters were checked and matched (not pruned)
pub row_groups_matched_bloom_filter: Count,
/// Number of row groups pruned by bloom filters
pub row_groups_pruned_bloom_filter: Count,
/// Number of row groups pruned due to limit pruning.
pub limit_pruned_row_groups: Count,
/// Number of row groups whose statistics were checked and fully matched
pub row_groups_fully_matched_statistics: Count,
/// Number of row groups whose statistics were checked and matched (not pruned)
pub row_groups_matched_statistics: Count,
/// Number of row groups pruned by statistics
pub row_groups_pruned_statistics: Count,
>>>>>>> origin/branch-51
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metric type changed to PruningMetrics.

Self {
files_ranges_pruned_statistics,
predicate_evaluation_errors,
row_groups_matched_bloom_filter,
Copy link
Collaborator Author

@jcsherin jcsherin Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one row_groups_matched_bloom_filter was missing, so added it back.

Comment on lines 195 to 200
<<<<<<< HEAD
metrics.row_groups_pruned_statistics.add_matched(1);
=======
fully_contained_candidates_original_idx.push(*idx);
metrics.row_groups_matched_statistics.add(1);
>>>>>>> origin/branch-51
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeps full_contained_candidates_original_idx + upstream API change to add_matched.

Comment on lines 56 to 60
<<<<<<< HEAD
=======
#[cfg(feature = "parquet_encryption")]
use datafusion_common::encryption::map_config_decryption_to_decryption;
>>>>>>> origin/branch-51
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safe to remove?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @zhuqi-lucas I forgot the discusssion result about the decryption

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be safe, we finally remove encryption for default, and we don't use it.

Comment on lines 2010 to 2015
<<<<<<< HEAD
truncated_rows__ =
=======

truncated_rows__ =
>>>>>>> origin/branch-51
Copy link
Collaborator Author

@jcsherin jcsherin Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with proto-common regen.sh script.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structs ListingOptions, ListingTable, and ListingTableConfig are now available within the datafusion-catalog-listing crate.

https://datafusion.apache.org/library-user-guide/upgrading.html#reorganization-of-listingtable-into-datafusion-catalog-listing-crate

Comment on lines 911 to 913
=======
- DataSourceExec: file_groups={2 groups: [[test1.parquet], [test2.parquet]]}, projection=[a, b, c], file_type=test, pushdown_supported=true, predicate=DynamicFilterPhysicalExpr [ true ]
>>>>>>> origin/branch-51
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rendering format changed to DynamicFilter. Keeping upstream.

Comment on lines +121 to +125
//!
//! // Workaround for `node_id` not being serializable:
//! let mut annotator = NodeIdAnnotator::new();
//! let physical_round_trip = annotate_node_id_for_execution_plan(&physical_round_trip, &mut annotator)?;
//!
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new doc test added for round-trip execution plan. The node_id is not currently serialized so had to manually annotate it before assertion.

Is this ok?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok due to node_id is our internal implementation.

fn add_merge_on_top(input: DistributionContext) -> DistributionContext {
/// Updated node with an execution plan, where desired single
/// distribution is satisfied by adding [`SortPreservingMergeExec`].
fn add_merge_on_top(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was previously named add_spm_on_top per DF patch tracking doc.

ints Map("entries": Struct("key": Utf8, "value": Int64), unsorted) NO
strings Map("entries": Struct("key": Utf8, "value": Utf8), unsorted) NO
timestamp Utf8View NO
timestamp Utf8 NO
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other tests too because of this setting: datafusion.execution.parquet.schema_force_view_types false

Comment on lines -433 to +441
BinaryView 616161 BinaryView 616161 BinaryView 616161
BinaryView 626262 BinaryView 626262 BinaryView 626262
BinaryView 636363 BinaryView 636363 BinaryView 636363
BinaryView 646464 BinaryView 646464 BinaryView 646464
BinaryView 656565 BinaryView 656565 BinaryView 656565
BinaryView 666666 BinaryView 666666 BinaryView 666666
BinaryView 676767 BinaryView 676767 BinaryView 676767
BinaryView 686868 BinaryView 686868 BinaryView 686868
BinaryView 696969 BinaryView 696969 BinaryView 696969
Binary 616161 LargeBinary 616161 BinaryView 616161
Binary 626262 LargeBinary 626262 BinaryView 626262
Binary 636363 LargeBinary 636363 BinaryView 636363
Binary 646464 LargeBinary 646464 BinaryView 646464
Binary 656565 LargeBinary 656565 BinaryView 656565
Binary 666666 LargeBinary 666666 BinaryView 666666
Binary 676767 LargeBinary 676767 BinaryView 676767
Binary 686868 LargeBinary 686868 BinaryView 686868
Binary 696969 LargeBinary 696969 BinaryView 696969
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datafusion.execution.parquet.schema_force_view_types false

@jcsherin jcsherin marked this pull request as ready for review November 24, 2025 05:28
@jcsherin jcsherin added the do not merge Do not merge until this label is removed label Nov 24, 2025
@jcsherin jcsherin marked this pull request as draft November 24, 2025 05:28
Comment on lines +121 to +125
//!
//! // Workaround for `node_id` not being serializable:
//! let mut annotator = NodeIdAnnotator::new();
//! let physical_round_trip = annotate_node_id_for_execution_plan(&physical_round_trip, &mut annotator)?;
//!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok due to node_id is our internal implementation.

Comment on lines 56 to 60
<<<<<<< HEAD
=======
#[cfg(feature = "parquet_encryption")]
use datafusion_common::encryption::map_config_decryption_to_decryption;
>>>>>>> origin/branch-51
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be safe, we finally remove encryption for default, and we don't use it.

@jcsherin jcsherin marked this pull request as ready for review November 24, 2025 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.