-
Notifications
You must be signed in to change notification settings - Fork 1
Upgrade fork to DF 51 #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-51-upstream
Are you sure you want to change the base?
Conversation
* Initial commit * Fix formatting * Add across partitions check * Add new test case Add a new test case * Fix buggy test
…#13909) (apache#13934) * Set utf8view as return type when input type is the same * Verify that the returned type from call to scalar function matches the return type specified in the return_type function * Match return type to utf8view Co-authored-by: Tim Saucer <timsaucer@gmail.com>
This reverts commit 5383d30.
* fix: fetch is missed in the EnfoceSorting * fix conflict * resolve comments from alamb * update
…it disabled by default
…e#14415) (apache#14453) * chore: Fixed CI * chore * chore: Fixed clippy * chore Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
* Test for string / numeric coercion * fix tests * Update tests * Add tests to stringview * add numeric coercion
datafusion/expr/src/registry.rs
Outdated
| <<<<<<< HEAD | ||
| fn udafs(&self) -> HashSet<String>; | ||
|
|
||
| /// Returns names of all available window user defined functions. | ||
| fn udwfs(&self) -> HashSet<String>; | ||
| ======= | ||
| fn udafs(&self) -> HashSet<String> { | ||
| // This default implementation is provided temporarily | ||
| // to maintain backward compatibility for the 50.1 release. | ||
| // It will be reverted to a required method in future versions. | ||
| HashSet::default() | ||
| } | ||
|
|
||
| /// Returns names of all available window user defined functions. | ||
| fn udwfs(&self) -> HashSet<String> { | ||
| // This default implementation is provided temporarily | ||
| // to maintain backward compatibility for the 50.1 release. | ||
| // It will be reverted to a required method in future versions. | ||
| HashSet::default() | ||
| } | ||
| >>>>>>> origin/branch-51 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These FunctionRegistry methods are now required.
| ======= | ||
| if let Some(comparison) = scalar.partial_cmp(current_best) { | ||
| let is_better = if find_greater { | ||
| comparison == std::cmp::Ordering::Greater | ||
| } else { | ||
| comparison == std::cmp::Ordering::Less | ||
| }; | ||
| >>>>>>> origin/branch-51 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apache#16624 Upstream fix for using try_cmp instead of partial_cmp for scalar values.
| let mut new_plan = AnalyzeExec::new( | ||
| self.verbose, | ||
| self.show_statistics, | ||
| self.metric_types.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AnalyzeExec::new method now takes metric_types: Vec<MetricType> as the third argument. https://docs.rs/datafusion/51.0.0/datafusion/physical_plan/analyze/struct.AnalyzeExec.html#method.new
| ) -> Result<Option<Arc<dyn ExecutionPlan>>> { | ||
| let mut new_plan = | ||
| ProjectionExec::try_new(self.expr.clone(), Arc::clone(self.input()))?; | ||
| ProjectionExec::try_new(self.expr().to_vec(), Arc::clone(self.input()))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first argument now changed to self.expr().to_vec().
https://docs.rs/datafusion/51.0.0/datafusion/physical_plan/projection/struct.ProjectionExec.html#method.expr
| true | ||
| } | ||
|
|
||
| #[allow(deprecated)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deprecated since 44.0.0: Use UnionExec::try_new instead
https://docs.rs/datafusion/51.0.0/datafusion/physical_plan/union/struct.UnionExec.html#method.try_new
Maybe we can switch to try_new in a follow-on PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also okay to change it directly in the PR if not a big change
| ======= | ||
| match reassign_predicate_columns(filter, &schema, true) { | ||
| Ok(filter) => { | ||
| match Self::add_filter_equivalence_info( | ||
| filter, | ||
| &mut eq_properties, | ||
| &schema, | ||
| ) { | ||
| Ok(()) => {} | ||
| Err(e) => { | ||
| warn!("Failed to add filter equivalence info: {e}"); | ||
| #[cfg(debug_assertions)] | ||
| panic!("Failed to add filter equivalence info: {e}"); | ||
| } | ||
| } | ||
| } | ||
| >>>>>>> origin/branch-51 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Picking upstream apache#17703
| ======= | ||
| macro_rules! ignore_dangling_col { | ||
| ($col:expr) => { | ||
| if let Some(col) = $col.as_any().downcast_ref::<Column>() { | ||
| if schema.index_of(col.name()).is_err() { | ||
| continue; | ||
| } | ||
| } | ||
| }; | ||
| } | ||
|
|
||
| let (equal_pairs, _) = collect_columns_from_predicate(&filter); | ||
| for (lhs, rhs) in equal_pairs { | ||
| // Ignore any binary expressions that reference non-existent columns in the current schema | ||
| // (e.g. due to unnecessary projections being removed) | ||
| ignore_dangling_col!(lhs); | ||
| ignore_dangling_col!(rhs); | ||
| eq_properties.add_equal_conditions(Arc::clone(lhs), Arc::clone(rhs))? | ||
| >>>>>>> origin/branch-51 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Picking upstream apache#17703
| <<<<<<< HEAD | ||
| .with_projection_indices(Some(vec![0, 1, 2])) | ||
| ======= | ||
| .with_projection(Some(vec![0, 1, 2])) | ||
| >>>>>>> origin/branch-51 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FileScanConfigBuilder::with_projection() method has been deprecated in favor of with_projection_indices()
| <<<<<<< HEAD | ||
| /// Number of row groups whose bloom filters were checked, tracked with matched/pruned counts | ||
| pub row_groups_pruned_bloom_filter: PruningMetrics, | ||
| /// Number of row groups whose statistics were checked, tracked with matched/pruned counts | ||
| pub row_groups_pruned_statistics: PruningMetrics, | ||
| ======= | ||
| /// Number of row groups whose bloom filters were checked and matched (not pruned) | ||
| pub row_groups_matched_bloom_filter: Count, | ||
| /// Number of row groups pruned by bloom filters | ||
| pub row_groups_pruned_bloom_filter: Count, | ||
| /// Number of row groups pruned due to limit pruning. | ||
| pub limit_pruned_row_groups: Count, | ||
| /// Number of row groups whose statistics were checked and fully matched | ||
| pub row_groups_fully_matched_statistics: Count, | ||
| /// Number of row groups whose statistics were checked and matched (not pruned) | ||
| pub row_groups_matched_statistics: Count, | ||
| /// Number of row groups pruned by statistics | ||
| pub row_groups_pruned_statistics: Count, | ||
| >>>>>>> origin/branch-51 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metric type changed to PruningMetrics.
| Self { | ||
| files_ranges_pruned_statistics, | ||
| predicate_evaluation_errors, | ||
| row_groups_matched_bloom_filter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one row_groups_matched_bloom_filter was missing, so added it back.
| <<<<<<< HEAD | ||
| metrics.row_groups_pruned_statistics.add_matched(1); | ||
| ======= | ||
| fully_contained_candidates_original_idx.push(*idx); | ||
| metrics.row_groups_matched_statistics.add(1); | ||
| >>>>>>> origin/branch-51 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeps full_contained_candidates_original_idx + upstream API change to add_matched.
| <<<<<<< HEAD | ||
| ======= | ||
| #[cfg(feature = "parquet_encryption")] | ||
| use datafusion_common::encryption::map_config_decryption_to_decryption; | ||
| >>>>>>> origin/branch-51 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
safe to remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @zhuqi-lucas I forgot the discusssion result about the decryption
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be safe, we finally remove encryption for default, and we don't use it.
| <<<<<<< HEAD | ||
| truncated_rows__ = | ||
| ======= | ||
|
|
||
| truncated_rows__ = | ||
| >>>>>>> origin/branch-51 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed with proto-common regen.sh script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structs ListingOptions, ListingTable, and ListingTableConfig are now available within the datafusion-catalog-listing crate.
| ======= | ||
| - DataSourceExec: file_groups={2 groups: [[test1.parquet], [test2.parquet]]}, projection=[a, b, c], file_type=test, pushdown_supported=true, predicate=DynamicFilterPhysicalExpr [ true ] | ||
| >>>>>>> origin/branch-51 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rendering format changed to DynamicFilter. Keeping upstream.
| //! | ||
| //! // Workaround for `node_id` not being serializable: | ||
| //! let mut annotator = NodeIdAnnotator::new(); | ||
| //! let physical_round_trip = annotate_node_id_for_execution_plan(&physical_round_trip, &mut annotator)?; | ||
| //! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new doc test added for round-trip execution plan. The node_id is not currently serialized so had to manually annotate it before assertion.
Is this ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok due to node_id is our internal implementation.
| fn add_merge_on_top(input: DistributionContext) -> DistributionContext { | ||
| /// Updated node with an execution plan, where desired single | ||
| /// distribution is satisfied by adding [`SortPreservingMergeExec`]. | ||
| fn add_merge_on_top( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was previously named add_spm_on_top per DF patch tracking doc.
| ints Map("entries": Struct("key": Utf8, "value": Int64), unsorted) NO | ||
| strings Map("entries": Struct("key": Utf8, "value": Utf8), unsorted) NO | ||
| timestamp Utf8View NO | ||
| timestamp Utf8 NO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are other tests too because of this setting: datafusion.execution.parquet.schema_force_view_types false
| BinaryView 616161 BinaryView 616161 BinaryView 616161 | ||
| BinaryView 626262 BinaryView 626262 BinaryView 626262 | ||
| BinaryView 636363 BinaryView 636363 BinaryView 636363 | ||
| BinaryView 646464 BinaryView 646464 BinaryView 646464 | ||
| BinaryView 656565 BinaryView 656565 BinaryView 656565 | ||
| BinaryView 666666 BinaryView 666666 BinaryView 666666 | ||
| BinaryView 676767 BinaryView 676767 BinaryView 676767 | ||
| BinaryView 686868 BinaryView 686868 BinaryView 686868 | ||
| BinaryView 696969 BinaryView 696969 BinaryView 696969 | ||
| Binary 616161 LargeBinary 616161 BinaryView 616161 | ||
| Binary 626262 LargeBinary 626262 BinaryView 626262 | ||
| Binary 636363 LargeBinary 636363 BinaryView 636363 | ||
| Binary 646464 LargeBinary 646464 BinaryView 646464 | ||
| Binary 656565 LargeBinary 656565 BinaryView 656565 | ||
| Binary 666666 LargeBinary 666666 BinaryView 666666 | ||
| Binary 676767 LargeBinary 676767 BinaryView 676767 | ||
| Binary 686868 LargeBinary 686868 BinaryView 686868 | ||
| Binary 696969 LargeBinary 696969 BinaryView 696969 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datafusion.execution.parquet.schema_force_view_types false
| //! | ||
| //! // Workaround for `node_id` not being serializable: | ||
| //! let mut annotator = NodeIdAnnotator::new(); | ||
| //! let physical_round_trip = annotate_node_id_for_execution_plan(&physical_round_trip, &mut annotator)?; | ||
| //! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok due to node_id is our internal implementation.
| <<<<<<< HEAD | ||
| ======= | ||
| #[cfg(feature = "parquet_encryption")] | ||
| use datafusion_common::encryption::map_config_decryption_to_decryption; | ||
| >>>>>>> origin/branch-51 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be safe, we finally remove encryption for default, and we don't use it.
Upgrade Steps
branch-51-upstreamwhich is upstream DF 51 at commit fd35a09.branch-51(notbranch-50) into this PR branch. Last commit in fork is ff301c8 - Add restriction for enabling limit pruning.Verified DF patches (present in this PR)
Missing patches (not yet applied to this PR)
1. make DefaultSchemaAdapter public Commit 848fd57
Reason: Moved out of core to
datafusion-datasource. Needs to be patched once more.Status: ✅ Fixed.
2. Add partition filters' equivalence classes info to the execution plan if it's DataSourceExec Commit c7628fb.
DataSourceExecis missing.add_partition_filter_equivalence_infomethod is missing.Status: ⏳ To be verified later if this is ported upstream. If not, add it back to fork.
Merged upstream
TODO