[pull] main from apache:main#40
Conversation
|
🚨 gitStream Monthly Automation Limit Reached 🚨 Your organization has exceeded the number of pull requests allowed for automation with gitStream. To continue automating your PR workflows and unlock additional features, please contact LinearB. |
Reviewer's Guide by SourceryThis pull request refactors the No diagrams generated as the changes look simple and do not need a visual representation. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Important Review skippedBot user detected. To trigger a single review, invoke the You can disable this status message by setting the 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Join our Discord community for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Both WHERE clause and HAVING clause translate to a Filter plan node. They differ in how the references and aggregates are handled. HAVING goes after aggregation and may reference aggregate expressions and therefore HAVING's filter will be placed after Aggregation plan node. Once a plan has been built, however, there is no special additional semantics to filters created from HAVING. Remove the unnecessary field. For reference, the field was added along with usage in a50aeef commit and the usage was later removed in eb62e28 commit.
* Clarify docs and names in parquet predicate pushdown tests * Update datafusion/datasource/src/file_scan_config.rs Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com> * clippy --------- Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
* Fix name() for FilterPushdown physical optimizer rule Typo that wasn't caught during review... * fix
fix according to review fix to_string error fix test by stripping backtrace
* Speedup tpch run with memtable * Clippy * Clippy
* Specialize unique join * handle splitting * rename a bit * fix * fix * fix * fix * Fix the test, add explanation * Simplify * Update datafusion/physical-plan/src/joins/join_hash_map.rs Co-authored-by: Christian <9384305+ctsk@users.noreply.github.com> * Update datafusion/physical-plan/src/joins/join_hash_map.rs Co-authored-by: Christian <9384305+ctsk@users.noreply.github.com> * Simplify * Simplify * Simplify --------- Co-authored-by: Christian <9384305+ctsk@users.noreply.github.com>
* added test * added parameterTest * cargo fmt * Update sql_integration.rs * allow needless_lifetimes * remove needless lifetime * update some tests * move to params.rs
* feat: array_length for fixed size list * remove list view
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.45.0 to 1.45.1. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](tokio-rs/tokio@tokio-1.45.0...tokio-1.45.1) --- updated-dependencies: - dependency-name: tokio dependency-version: 1.45.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Add failing test to demonstrate problem * Improve `unproject_sort_expr` to handle arbitrary expressions (#83) * Remove redundant return
Bumps [rustyline](https://github.com/kkawakam/rustyline) from 15.0.0 to 16.0.0. - [Release notes](https://github.com/kkawakam/rustyline/releases) - [Changelog](https://github.com/kkawakam/rustyline/blob/master/History.md) - [Commits](kkawakam/rustyline@v15.0.0...v16.0.0) --- updated-dependencies: - dependency-name: rustyline dependency-version: 16.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
ADD sha2 spark function
* feat: replace snapshot tests for enforce_sorting * feat: modify assert_optimized macro to test one snapshot with a combined physical plan * feat: update assert_optimized to support snapshot testing * Revert "feat: replace snapshot tests for enforce_sorting" This reverts commit 8c921fa. * feat: migrate core test to insta * fix format * fix format * fix typo * refactor: rename function * fix: remove trimming * refactor: replace get_plan_string with displayable in projection_pushdown --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> Co-authored-by: Ian Lai <Ian.Lai@senao.com>
Run `cargo test --test sqllogictests -- --complete` and commit the results.
* Add PhysicalExpr optimizer and cast unwrapping * address pr feedback * Update datafusion/pruning/src/pruning_predicate.rs * more lit(Xi64)
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.45.1 to 1.46.0. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](tokio-rs/tokio@tokio-1.45.1...tokio-1.46.0) --- updated-dependencies: - dependency-name: tokio dependency-version: 1.46.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…pt limit pushdown (#16641) Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
* Convert Option<Vec<sort expression>> to Vec<sort expression> * clippy * fix comment * fix doc * change back to Expr * remove redundant check
* Improve error message when ScalarValue fails to cast array The `as_*_array` functions and the `downcast_value!` have the benefit of reporting the array type when there is a mismatch. This makes the error message more actionable. * test
* Add an example of embedding indexes inside a parquet file * Add page image * Add prune file example * Fix clippy * polish code * Fmt * address comments * Add debug * Add new example, but it will fail with page index * add debug * add debug * polish * debug * Using low level API to support * polish * fix * merge * fix * complte solution * polish comments * adjust image * add comments part 1 * pin to new arrow-rs * pin to new arrow-rs * add comments part 2 * merge upstream * merge upstream * polish code * Rename example and add it to the list * Work on comments * More documentation * Documentation obession, encapsulate example * Update datafusion-examples/examples/parquet_embedded_index.rs Co-authored-by: Sherin Jacob <jacob@protoship.io> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Sherin Jacob <jacob@protoship.io>
* Implementation for regex_instr * linting and typo addressed in bench * prettier formatting * scalar_functions_formatting * linting format macros * formatting * address comments to PR * formatting * clippy * fmt * address docs typo * remove unnecessary struct and comment * delete redundant lines add tests for subexp correct function signature for benches * refactor get_index * comments addressed * update doc * clippy upgrade --------- Co-authored-by: Nirnay Roy <nirnayroy1012@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Dmitrii Blaginin <dmitrii@blaginin.me>
…nts (#16672) - Refactored the `DataFusionError` enum to use `Box<T>` for: - `ArrowError` - `ParquetError` - `AvroError` - `object_store::Error` - `ParserError` - `SchemaError` - `JoinError` - Updated all relevant match arms and constructors to handle boxed errors. - Refactored error-related macros (`arrow_datafusion_err!`, `sql_datafusion_err!`, etc.) to use `Box<T>`. - Adjusted test cases and error assertions for boxed variants. - Documentation update to the upgrade guide to explain the required changes and rationale.
…on and Mapping (#16583) - Introduced a new `schema_adapter_factory` field in `ListingTableConfig` and `ListingTable` - Added `with_schema_adapter_factory` and `schema_adapter_factory()` methods to both structs - Modified execution planning logic to apply schema adapters during scanning - Updated statistics collection to use mapped schemas - Implemented detailed documentation and example usage in doc comments - Added new unit and integration tests validating schema adapter behavior and error cases
* Reuse Rows in RowCursorStream * WIP * Fmt * Add comment, make it backwards compatible * Add comment, make it backwards compatible * Add comment, make it backwards compatible * Clippy * Clippy * Return error on non-unique reference * Comment * Update datafusion/physical-plan/src/sorts/stream.rs Co-authored-by: Oleks V <comphead@users.noreply.github.com> * Fix * Extract logic * Doc fix --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com>
#16630) * Perf: fast CursorValues compare for StringViewArray using inline_key_fast * fix * polish * polish * add test --------- Co-authored-by: Daniël Heres <danielheres@gmail.com>
One step towards #16652. Co-authored-by: Oleks V <comphead@users.noreply.github.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
* Refactor StreamJoinMetrics to reuse BaselineMetrics Signed-off-by: Alan Tang <jmtangcs@gmail.com> * use the record_poll method to update output rows Signed-off-by: Alan Tang <jmtangcs@gmail.com> --------- Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* Remove unused AggregateUDF struct * Fix docs --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
) * chore: refactor `BuildProbeJoinMetrics` to use `BaselineMetrics` Closes #16495 Here's an example of an `explain analyze` of a hash join showing these metrics: ``` [(WatchID@0, WatchID@0)], metrics=[output_rows=100, elapsed_compute=2.313624ms, build_input_batches=1, build_input_rows=100, input_batches=1, input_rows=100, output_batches=1, build_mem_used=3688, build_time=865.832µs, join_time=1.369875ms] ``` Notice `output_rows=100, elapsed_compute=2.313624ms` in the above. * test: add checks for join metrics in tests * fix: add record_poll to ExhaustedProbeSide for nested_loop_join This was needed because ExhaustedProbeSide state can also return output rows - in certain types of joins. Without this, the output_rows metric for nested loop join was wrong!
* Use compression type in file suffices - Add FileFormat::compression_type method - Specify meaningful values for CSV only - Use compression type as a part of extension for files * Add CSV tests * Add glob dep, use env logging * Use a glob pattern with compression suffix for TableProviderFactory * Conform to clippy standards --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* Refactor SortMergeJoinMetrics to reuse BaselineMetrics Signed-off-by: Alan Tang <jmtangcs@gmail.com> * use record_poll method to update output_rows Signed-off-by: Alan Tang <jmtangcs@gmail.com> * chore: Replace replace_poll with replace_output Signed-off-by: Alan Tang <jmtangcs@gmail.com> --------- Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* Add support for Arrow Dictionary type in Substrait This commit adds support for the Arrow Dictionary type in Substrait plans. Resolves #16273 * Add more specific type variation consts
* fix sqllogictest condition mismatch * Update test file condition * revert changes in sqllogictests --------- Co-authored-by: Leon Lin <lliangyu@amazon.com>
…ring physical planning (#16454) * Fix duplicates on Join creation during physcial planning * Add Substrait reproducer * Better error message & more doc * Handle case for right/left/full joins as well
--- updated-dependencies: - dependency-name: tokio dependency-version: 1.46.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.1)
Can you help keep this open source service alive? 💖 Please sponsor : )
Summary by Sourcery
Enhance DataFusion's support for ordered-set aggregate functions by introducing a WITHIN GROUP clause for functions like approx_percentile_cont
New Features:
Enhancements:
Documentation: