Skip to content

[pull] main from apache:main#40

Merged
pull[bot] merged 424 commits intoStars1233:mainfrom
apache:main
Jul 7, 2025
Merged

[pull] main from apache:main#40
pull[bot] merged 424 commits intoStars1233:mainfrom
apache:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull bot commented Apr 23, 2025

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

Summary by Sourcery

Enhance DataFusion's support for ordered-set aggregate functions by introducing a WITHIN GROUP clause for functions like approx_percentile_cont

New Features:

  • Add support for WITHIN GROUP clause in aggregate functions
  • Implement ordered-set aggregate function semantics

Enhancements:

  • Modify approx_percentile_cont to support sorting and percentile calculation
  • Update function argument parsing to handle WITHIN GROUP clause
  • Improve aggregate function documentation to reflect new syntax

Documentation:

  • Update documentation for approx_percentile_cont and approx_median to show new WITHIN GROUP syntax
  • Add explanations for ordered-set aggregate function behavior

@gitstream-cm
Copy link
Copy Markdown

gitstream-cm bot commented Apr 23, 2025

🚨 gitStream Monthly Automation Limit Reached 🚨

Your organization has exceeded the number of pull requests allowed for automation with gitStream.
Monthly PRs automated: 8014/250

To continue automating your PR workflows and unlock additional features, please contact LinearB.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai bot commented Apr 23, 2025

Reviewer's Guide by Sourcery

This pull request refactors the approx_percentile_cont and approx_percentile_cont_with_weight aggregate functions to use the SQL standard's WITHIN GROUP (ORDER BY) clause. It also adds validation to ensure that ORDER BY clause is only permitted in WITHIN GROUP clause when a WITHIN GROUP is used, OVER and WITHIN GROUP clause cannot be used together, and WITHIN GROUP clause is required when calling ordered set aggregate function.

No diagrams generated as the changes look simple and do not need a visual representation.

File-Level Changes

Change Details Files
Refactors approx_percentile_cont and approx_percentile_cont_with_weight to align with the SQL standard's WITHIN GROUP (ORDER BY) clause.
  • Updates the function signatures to accept ordering expressions.
  • Modifies the function calls to use the WITHIN GROUP (ORDER BY) syntax.
  • Adjusts the internal logic to handle the ordering within the aggregation.
  • Updates documentation and examples to reflect the new syntax.
datafusion/core/tests/dataframe/dataframe_functions.rs
datafusion/sql/src/expr/function.rs
datafusion/functions-aggregate/src/approx_percentile_cont.rs
datafusion/expr/src/udaf.rs
docs/source/user-guide/sql/aggregate_functions.md
datafusion/functions-aggregate/src/approx_percentile_cont_with_weight.rs
datafusion/sql/src/unparser/expr.rs
datafusion/core/benches/aggregate_query_sql.rs
datafusion/proto/tests/cases/roundtrip_logical_plan.rs
datafusion/functions-aggregate/src/approx_median.rs
datafusion/proto/tests/cases/roundtrip_physical_plan.rs
datafusion/sqllogictest/test_files/aggregate.slt
Implements is_ordered_set_aggregate to indicate if a function is an ordered-set aggregate function.
  • Adds a new trait method is_ordered_set_aggregate to the AggregateUDFImpl trait.
  • Implements the new trait method for ApproxPercentileCont and ApproxPercentileContWithWeight to return true.
  • Updates the schema name generation to include the WITHIN GROUP clause for ordered-set aggregate functions.
datafusion/sql/src/expr/function.rs
datafusion/functions-aggregate/src/approx_percentile_cont.rs
datafusion/expr/src/udaf.rs
datafusion/functions-aggregate/src/approx_percentile_cont_with_weight.rs
Adds validation to ensure that ORDER BY clause is only permitted in WITHIN GROUP clause when a WITHIN GROUP is used.
  • Adds a check to ensure that ORDER BY clause is only permitted in WITHIN GROUP clause when a WITHIN GROUP is used.
datafusion/sql/src/expr/function.rs
Adds validation to ensure that OVER and WITHIN GROUP clause cannot be used together.
  • Adds a check to ensure that OVER and WITHIN GROUP clause cannot be used together.
datafusion/sql/src/expr/function.rs
Adds validation to ensure that WITHIN GROUP clause is required when calling ordered set aggregate function.
  • Adds a check to ensure that WITHIN GROUP clause is required when calling ordered set aggregate function.
datafusion/sql/src/expr/function.rs

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!
  • Generate a plan of action for an issue: Comment @sourcery-ai plan on
    an issue to generate a plan of action for it.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@pull pull bot added the ⤵️ pull label Apr 23, 2025
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 23, 2025

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Join our Discord community for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

alamb and others added 15 commits May 24, 2025 06:32
Both WHERE clause and HAVING clause translate to a Filter plan node.
They differ in how the references and aggregates are handled.
HAVING goes after aggregation and may reference aggregate expressions
and therefore HAVING's filter will be placed after Aggregation plan
node.

Once a plan has been built, however, there is no special additional
semantics to filters created from HAVING. Remove the unnecessary field.

For reference, the field was added along with usage in
a50aeef commit and the usage was later
removed in eb62e28 commit.
* Clarify docs and names in parquet predicate pushdown tests

* Update datafusion/datasource/src/file_scan_config.rs

Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>

* clippy

---------

Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
* Fix name() for FilterPushdown physical optimizer rule

Typo that wasn't caught during review...

* fix
fix according to review

fix to_string error

fix test by stripping backtrace
)

Added `tables: HashMap<String, Arc<dyn TableSource>>` and `MyContextProvider::with_schema` method for dynamically defining tables for optimizer integration tests.
* Speedup tpch run with memtable

* Clippy

* Clippy
* Specialize unique join

* handle splitting

* rename a bit

* fix

* fix

* fix

* fix

* Fix the test, add explanation

* Simplify

* Update datafusion/physical-plan/src/joins/join_hash_map.rs

Co-authored-by: Christian <9384305+ctsk@users.noreply.github.com>

* Update datafusion/physical-plan/src/joins/join_hash_map.rs

Co-authored-by: Christian <9384305+ctsk@users.noreply.github.com>

* Simplify

* Simplify

* Simplify

---------

Co-authored-by: Christian <9384305+ctsk@users.noreply.github.com>
* added test

* added parameterTest

* cargo fmt

* Update sql_integration.rs

* allow needless_lifetimes

* remove needless lifetime

* update some tests

* move to params.rs
* feat: array_length for fixed size list

* remove list view
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.45.0 to 1.45.1.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](tokio-rs/tokio@tokio-1.45.0...tokio-1.45.1)

---
updated-dependencies:
- dependency-name: tokio
  dependency-version: 1.45.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Add failing test to demonstrate problem

* Improve `unproject_sort_expr` to handle arbitrary expressions (#83)

* Remove redundant return
Bumps [rustyline](https://github.com/kkawakam/rustyline) from 15.0.0 to 16.0.0.
- [Release notes](https://github.com/kkawakam/rustyline/releases)
- [Changelog](https://github.com/kkawakam/rustyline/blob/master/History.md)
- [Commits](kkawakam/rustyline@v15.0.0...v16.0.0)

---
updated-dependencies:
- dependency-name: rustyline
  dependency-version: 16.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Chen-Yuan-Lai and others added 29 commits July 2, 2025 12:01
* feat: replace snapshot tests for enforce_sorting

* feat: modify assert_optimized macro to test one snapshot with a combined physical plan

* feat: update assert_optimized to support snapshot testing

* Revert "feat: replace snapshot tests for enforce_sorting"

This reverts commit 8c921fa.

* feat: migrate core test to insta

* fix format

* fix format

* fix typo

* refactor: rename function

* fix: remove trimming

* refactor: replace get_plan_string with displayable in projection_pushdown

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>
Co-authored-by: Ian Lai <Ian.Lai@senao.com>
Run `cargo test --test sqllogictests -- --complete` and commit the
results.
* Add PhysicalExpr optimizer and cast unwrapping

* address pr feedback

* Update datafusion/pruning/src/pruning_predicate.rs

* more lit(Xi64)
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.45.1 to 1.46.0.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](tokio-rs/tokio@tokio-1.45.1...tokio-1.46.0)

---
updated-dependencies:
- dependency-name: tokio
  dependency-version: 1.46.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…pt limit pushdown (#16641)

Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
* Convert Option<Vec<sort expression>> to Vec<sort expression>

* clippy

* fix comment

* fix doc

* change back to Expr

* remove redundant check
* Improve error message when ScalarValue fails to cast array

The `as_*_array` functions and the `downcast_value!` have the benefit of
reporting the array type when there is a mismatch. This makes the error
message more actionable.

* test
* Add an example of embedding indexes inside a parquet file

* Add page image

* Add prune file example

* Fix clippy

* polish code

* Fmt

* address comments

* Add debug

* Add new example, but it will fail with page index

* add debug

* add debug

* polish

* debug

* Using low level API to support

* polish

* fix

* merge

* fix

* complte solution

* polish comments

* adjust image

* add comments part 1

* pin to new arrow-rs

* pin to new arrow-rs

* add comments part 2

* merge upstream

* merge upstream

* polish code

* Rename example and add it to the list

* Work on comments

* More documentation

* Documentation obession, encapsulate example

* Update datafusion-examples/examples/parquet_embedded_index.rs

Co-authored-by: Sherin Jacob <jacob@protoship.io>

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Sherin Jacob <jacob@protoship.io>
* Implementation for regex_instr

* linting and typo addressed in bench

* prettier formatting

* scalar_functions_formatting

* linting format macros

* formatting

* address comments to PR

* formatting

* clippy

* fmt

* address docs typo

* remove unnecessary struct and comment

* delete redundant lines
add tests for subexp
correct function signature for benches

* refactor get_index

* comments addressed

* update doc

* clippy upgrade

---------

Co-authored-by: Nirnay Roy <nirnayroy1012@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Dmitrii Blaginin <dmitrii@blaginin.me>
…nts (#16672)

- Refactored the `DataFusionError` enum to use `Box<T>` for:
  - `ArrowError`
  - `ParquetError`
  - `AvroError`
  - `object_store::Error`
  - `ParserError`
  - `SchemaError`
  - `JoinError`
- Updated all relevant match arms and constructors to handle boxed errors.
- Refactored error-related macros (`arrow_datafusion_err!`, `sql_datafusion_err!`, etc.) to use `Box<T>`.
- Adjusted test cases and error assertions for boxed variants.
- Documentation update to the upgrade guide to explain the required changes and rationale.
…on and Mapping (#16583)

- Introduced a new `schema_adapter_factory` field in `ListingTableConfig` and `ListingTable`
- Added `with_schema_adapter_factory` and `schema_adapter_factory()` methods to both structs
- Modified execution planning logic to apply schema adapters during scanning
- Updated statistics collection to use mapped schemas
- Implemented detailed documentation and example usage in doc comments
- Added new unit and integration tests validating schema adapter behavior and error cases
* Reuse Rows in RowCursorStream

* WIP

* Fmt

* Add comment, make it backwards compatible

* Add comment, make it backwards compatible

* Add comment, make it backwards compatible

* Clippy

* Clippy

* Return error on non-unique reference

* Comment

* Update datafusion/physical-plan/src/sorts/stream.rs

Co-authored-by: Oleks V <comphead@users.noreply.github.com>

* Fix

* Extract logic

* Doc fix

---------

Co-authored-by: Oleks V <comphead@users.noreply.github.com>
#16630)

* Perf: fast CursorValues compare for StringViewArray using inline_key_fast

* fix

* polish

* polish

* add test

---------

Co-authored-by: Daniël Heres <danielheres@gmail.com>
One step towards #16652.

Co-authored-by: Oleks V <comphead@users.noreply.github.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
* Refactor StreamJoinMetrics to reuse BaselineMetrics

Signed-off-by: Alan Tang <jmtangcs@gmail.com>

* use the record_poll method to update output rows

Signed-off-by: Alan Tang <jmtangcs@gmail.com>

---------

Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* Remove unused AggregateUDF struct

* Fix docs

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
)

* chore: refactor `BuildProbeJoinMetrics` to use `BaselineMetrics`

Closes #16495

Here's an example of an `explain analyze` of a hash join showing these metrics:
```
[(WatchID@0, WatchID@0)], metrics=[output_rows=100, elapsed_compute=2.313624ms, build_input_batches=1, build_input_rows=100, input_batches=1, input_rows=100, output_batches=1, build_mem_used=3688, build_time=865.832µs, join_time=1.369875ms]
```

Notice `output_rows=100, elapsed_compute=2.313624ms` in the above.

* test: add checks for join metrics in tests

* fix: add record_poll to ExhaustedProbeSide for nested_loop_join

This was needed because ExhaustedProbeSide state can also return output
rows - in certain types of joins. Without this, the output_rows metric
for nested loop join was wrong!
* Use compression type in file suffices

- Add FileFormat::compression_type method
- Specify meaningful values for CSV only
- Use compression type as a part of extension for files

* Add CSV tests

* Add glob dep, use env logging

* Use a glob pattern with compression suffix for TableProviderFactory

* Conform to clippy standards

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* Refactor SortMergeJoinMetrics to reuse BaselineMetrics

Signed-off-by: Alan Tang <jmtangcs@gmail.com>

* use record_poll method to update output_rows

Signed-off-by: Alan Tang <jmtangcs@gmail.com>

* chore: Replace replace_poll with replace_output

Signed-off-by: Alan Tang <jmtangcs@gmail.com>

---------

Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* Add support for Arrow Dictionary type in Substrait

This commit adds support for the Arrow Dictionary type in Substrait
plans.

Resolves #16273

* Add more specific type variation consts
* fix sqllogictest condition mismatch

* Update test file condition

* revert changes in sqllogictests

---------

Co-authored-by: Leon Lin <lliangyu@amazon.com>
…ring physical planning (#16454)

* Fix duplicates on Join creation during physcial planning

* Add Substrait reproducer

* Better error message & more doc

* Handle case for right/left/full joins as well
---
updated-dependencies:
- dependency-name: tokio
  dependency-version: 1.46.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@pull pull bot merged commit ebb8e95 into Stars1233:main Jul 7, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.