Skip to content

Conversation

mach-kernel
Copy link

@mach-kernel mach-kernel commented Jul 29, 2025

Changes

  • Merges upstream-48.0.1 into spiceai-48 (branched off of spiceai-47) to apply upstream changes to our previous DF release
  • Reconciles upstream changes merging ExtendedColumnProjector -> PartitionColumnProjector + Spice AI tweaks
  • Reconciles FileScanConfig / FileScanConfigBuilder deprecations for our changes
  • Applies SPI changes for create_physical_plan with filters & related partition pruning support to examples and datasources

Diff of non-upstream changes

Test changes

Please look at these commits and ensure that the correct assumptions are being made.

  • test_metadata_columns dependent on implicit sort behavior: d514acb
    • (See commit message)
  • test_count_wildcard_on_sort stale snapshot: a34bcf3
    • Looks like CommonSubexprEliminate on count(*) rewrote the plan. The new plan looks OK and more closely matches the DF API plan.
test result: ok. 614 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 6.52s

xudong963 and others added 30 commits April 23, 2025 09:36
… functions (apache#13511)

* Add within group variable to aggregate function and arguments

* Support within group and disable null handling for ordered set aggregate functions (apache#13511)

* Refactored function to match updated signature

* Modify proto to support within group clause

* Modify physical planner and accumulator to support ordered set aggregate function

* Support session management for ordered set aggregate functions

* Align code, tests, and examples with changes to aggregate function logic

* Ensure compatibility with new `within_group` and `order_by` handling.

* Adjust tests and examples to align with the new logic.

* Fix typo in existing comments

* Enhance test

* Add test cases for changed signature

* Update signature in docs

* Fix bug : handle missing within_group when applying children tree node

* Change the signature of approx_percentile_cont for consistency

* Add missing within_group for expr display

* Handle edge case when over and within group clause are used together

* Apply clippy advice: avoids too many arguments

* Add new test cases using descending order

* Apply cargo fmt

* Revert unintended submodule changes

* Apply prettier guidance

* Apply doc guidance by update_function_doc.sh

* Rollback WITHIN GROUP and related logic after converting it into expr

* Make it not to handle redundant logic

* Rollback ordered set aggregate functions from session to save same info in udf itself

* Convert within group to order by when converting sql to expr

* Add function to determine it is ordered-set aggregate function

* Rollback within group from proto

* Utilize within group as order by in functions-aggregate

* Apply clippy

* Convert order by to within group

* Apply cargo fmt

* Remove plain line breaks

* Remove duplicated column arg in schema name

* Refactor boolean functions to just return primitive type

* Make within group necessary in the signature of existing ordered set aggr funcs

* Apply cargo fmt

* Support a single ordering expression in the signature

* Apply cargo fmt

* Add dataframe function test cases to verify descending ordering

* Apply cargo fmt

* Apply code reviews

* Uses order by consistently after done with sql

* Remove redundant comment

* Serve more clear error msg

* Handle error cases in the same code block

* Update error msg in test as corresponding code changed

* fix

---------

Co-authored-by: Jay Zhan <jayzhan211@gmail.com>
Bumps [env_logger](https://github.com/rust-cli/env_logger) from 0.11.7 to 0.11.8.
- [Release notes](https://github.com/rust-cli/env_logger/releases)
- [Changelog](https://github.com/rust-cli/env_logger/blob/main/CHANGELOG.md)
- [Commits](rust-cli/env_logger@v0.11.7...v0.11.8)

---
updated-dependencies:
- dependency-name: env_logger
  dependency-version: 0.11.8
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…pache#15828)

* add `memory_limit` to `MemoryPool`, and impl it for the pools in datafusion.

* Update datafusion/execution/src/memory_pool/mod.rs

Co-authored-by: Ruihang Xia <waynestxia@gmail.com>

---------

Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* Preserve projection for inline scan

* fix

---------

Co-authored-by: Vadim Piven <vadim.piven@milaboratories.com>
Bumps [pyo3](https://github.com/pyo3/pyo3) from 0.24.1 to 0.24.2.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.24.1...v0.24.2)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-version: 0.24.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…che#15822)

* Fix: fetch is missing in EnforceSort

* add ut test_parallelize_sort_preserves_fetch

* add ut: test_plan_with_order_preserving_variants_preserves_fetch

* update

* address comments
* Fix ILIKE expression support in SQL unparser (#76)

* update tests
…ng `map_err` (apache#15796)

* First Step

* Final Step?

* Homogenisation
* Read benchmark SessionConfig from env

* Set target partitions from env by default

fix

* Set batch size from env by default

* Fix batch size option for tpch ci

* Log environment variable configuration

* Document benchmarking env variable config

* Add DATAFUSION_* env config to Error: unknown command: help

Orchestrates running benchmarks against DataFusion checkouts

Usage:
./bench.sh data [benchmark] [query]
./bench.sh run [benchmark]
./bench.sh compare <branch1> <branch2>
./bench.sh venv

**********
Examples:
**********
# Create the datasets for all benchmarks in /Users/christian/MA/datafusion/benchmarks/data
./bench.sh data

# Run the 'tpch' benchmark on the datafusion checkout in /source/datafusion
DATAFUSION_DIR=/source/datafusion ./bench.sh run tpch

**********
* Commands
**********
data:         Generates or downloads data needed for benchmarking
run:          Runs the named benchmark
compare:      Compares results from benchmark runs
venv:         Creates new venv (unless already exists) and installs compare's requirements into it

**********
* Benchmarks
**********
all(default): Data/Run/Compare for all benchmarks
tpch:                   TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), single parquet file per table, hash join
tpch_mem:               TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), query from memory
tpch10:                 TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), single parquet file per table, hash join
tpch_mem10:             TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), query from memory
cancellation:           How long cancelling a query takes
parquet:                Benchmark of parquet reader's filtering speed
sort:                   Benchmark of sorting speed
sort_tpch:              Benchmark of sorting speed for end-to-end sort queries on TPCH dataset
clickbench_1:           ClickBench queries against a single parquet file
clickbench_partitioned: ClickBench queries against a partitioned (100 files) parquet
clickbench_extended:    ClickBench "inspired" queries against a single parquet (DataFusion specific)
external_aggr:          External aggregation benchmark
h2o_small:              h2oai benchmark with small dataset (1e7 rows) for groupby,  default file format is csv
h2o_medium:             h2oai benchmark with medium dataset (1e8 rows) for groupby, default file format is csv
h2o_big:                h2oai benchmark with large dataset (1e9 rows) for groupby,  default file format is csv
h2o_small_join:         h2oai benchmark with small dataset (1e7 rows) for join,  default file format is csv
h2o_medium_join:        h2oai benchmark with medium dataset (1e8 rows) for join, default file format is csv
h2o_big_join:           h2oai benchmark with large dataset (1e9 rows) for join,  default file format is csv
imdb:                   Join Order Benchmark (JOB) using the IMDB dataset converted to parquet

**********
* Supported Configuration (Environment Variables)
**********
DATA_DIR            directory to store datasets
CARGO_COMMAND       command that runs the benchmark binary
DATAFUSION_DIR      directory to use (default /Users/christian/MA/datafusion/benchmarks/..)
RESULTS_NAME        folder where the benchmark files are stored
PREFER_HASH_JOIN    Prefer hash join algorithm (default true)
VENV_PATH           Python venv to use for compare and venv commands (default ./venv, override by <your-venv>/bin/activate)
DATAFUSION_*        Set the given datafusion configuration

* fmt
…5764)

* predicate pruning: support dictionaries

* more types

* clippy

* add tests

* add tests

* simplify to dicts

* revert most changes

* just check for strings, more tests

* more tests

* remove unecessary now confusing clause
* add fetch to CoalescePartitionsExecNode

* gen proto code

* Add test

* fix

* fix build

* Fix test build

* remove comments
Bumps [clap](https://github.com/clap-rs/clap) from 4.5.36 to 4.5.37.
- [Release notes](https://github.com/clap-rs/clap/releases)
- [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md)
- [Commits](clap-rs/clap@clap_complete-v4.5.36...clap_complete-v4.5.37)

---
updated-dependencies:
- dependency-name: clap
  dependency-version: 4.5.37
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Fix `from_unixtime` function documentation

* Update scalar_functions.md
* interval singleron

* fmt

* impl from
* refactor and make `QueryBuilder` more configurable.

* fix tests.

* fix clippy.

* extract `QueryBuilder` to a dedicated module.

* add `min_group_by_columns`, and fix some bugs.
Bumps [aws-config](https://github.com/smithy-lang/smithy-rs) from 1.6.1 to 1.6.2.
- [Release notes](https://github.com/smithy-lang/smithy-rs/releases)
- [Changelog](https://github.com/smithy-lang/smithy-rs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/smithy-lang/smithy-rs/commits)

---
updated-dependencies:
- dependency-name: aws-config
  dependency-version: 1.6.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…apache#15723)

* Add slt tests for datafusion.execution.parquet.coerce_int96 setting

* tweak
* Improve `ListingTable` / `ListingTableOptions` docs

* Update datafusion/core/src/datasource/listing/table.rs

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>

---------

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
…adline (apache#15883)

I noticed that https://datafusion.apache.org/library-user-guide/upgrading.html#filescanconfig-filescanconfigbuilder had "FileScanConfig –> FileScanConfigBuilder" as a top-level headline. It should probably be under the 47 release
blaginin and others added 22 commits June 5, 2025 17:02
* Handle dicts for distinct count

* Fix sqllogictests

* Add bench

* Fix no fix the bench

* Do not panic if error type is bad

* Add full bench query

* Set the bench

* Add dict of dict test

* Fix tests

* Rename method

* Increase the grouping test

* Increase the grouping test a bit more :)

* Fix flakiness

---------

Co-authored-by: Dmitrii Blaginin <blaginin@bmac.local>
* Add substrait roundtrip option in sqllogictests

* Fix doc link and missing license header

* Add README.md entry for the Substrait round-trip mode

* Link tracking issue in README.md

* Use clap's `conflicts_with` instead of manually checking flag compatibility

* Add sqllogictest-substrait job to the CI

* Revert committed formatting changes to README.md
* Work in progress adding user defined aggregate function FFI support

* Intermediate work. Going through groups accumulator

* MVP for aggregate udf via FFI

* Clean up after rebase

* Add unit test for FFI Accumulator Args

* Adding unit tests and fixing memory errors in aggregate ffi udf

* Working through additional unit and integration tests for UDAF ffi

* Switch to a accumulator that supports convert to state to get a little better coverage

* Set feature so we do not get an error warning in stable rustc

* Add more options to test

* Add unit test for FFI RecordBatchStream

* Add a few more args to ffi accumulator test fn

* Adding more unit tests on ffi aggregate udaf

* taplo format

* Update code comment

* Correct function name

* Temp fix record batch test dependencies

* Address some comments

* Revise comments and address PR comments

* Remove commented code

* Refactor GroupsAccumulator

* Add documentation

* Split integration tests

* Address comments to refactor error handling for opt filter

* Fix linting errors

* Fix linting and add deref

* Remove extra tests and unnecessary code

* Adjustments to FFI aggregate functions after rebase on main

* cargo fmt

* cargo clippy

* Re-implement cleaned up code that was removed in last push

* Minor review comments

---------

Co-authored-by: Crystal Zhou <crystal.zhouxiaoyue@hotmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…ion (apache#16255)

* Add BaselineMetrics to LazyMemoryStream

* UT
* Initial commit of UDWF via FFI

* Work in progress on integration testing of udwf

* Rebase due to UDF changes upstream
…ndow expression" (apache#16307)

* Revert "Improve performance of constant aggregate window expression (apache#16234)"

This reverts commit 0c30374.

* update changelog

* update changelog
* [branch-48] Update CHANGELOG for latest 48.0.0 release

* prettier
…n scan (apache#16646) (apache#16656)

* respect parquet filter pushdown config in scan

* Add test

Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
…tistics to true apache#16447  (apache#16659)

* Set the default value of `datafusion.execution.collect_statistics` to `true` (apache#16447)

* fix sqllogicaltests
* Add upgrade note

(cherry picked from commit 2d7ae09)

* Update row group pruning

---------

Co-authored-by: Adam Gutglick <adam@spiraldb.com>
…#16657)

* Column indices were not computed correctly, causing a panic

* Add unit tests

Co-authored-by: Tim Saucer <timsaucer@gmail.com>
* Update version to 48.0.1

* Add link to upgrade guide in changelog script

* prettier

* update guide
…ata_cols -> filescanconfigbuilder, tests to use partitioncolumnprojector
… changed between releases. the test has been extended to select a multiple of the partitions such that we can assert on ids 0,1,2,3 entirely.

the output of the failure BEFORE this change is below -- i think that the new "actual" output is more correct as it starts in order of the month/day buckets. anyway, for purposes of testing the metadata columns, this works, and this is why this change was made

expected:                                                                                                                  11:33:22 [15/4643]

[
    "+----+------+----------------------------------------+----------------------+",
    "| id | size | location                               | last_modified        |",
    "+----+------+----------------------------------------+----------------------+",
    "| 0  | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |",
    "| 0  | 1851 | year=2021/month=10/day=09/file.parquet | 1970-01-01T00:00:00Z |",
    "| 0  | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |",
    "| 1  | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |",
    "| 1  | 1851 | year=2021/month=10/day=09/file.parquet | 1970-01-01T00:00:00Z |",
    "| 1  | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |",
    "| 2  | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |",
    "| 2  | 1851 | year=2021/month=10/day=09/file.parquet | 1970-01-01T00:00:00Z |",
    "| 2  | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |",
    "| 3  | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |",
    "+----+------+----------------------------------------+----------------------+",
]
actual:

[
    "+----+------+----------------------------------------+----------------------+",
    "| id | size | location                               | last_modified        |",
    "+----+------+----------------------------------------+----------------------+",
    "| 0  | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |",
    "| 0  | 1851 | year=2021/month=10/day=09/file.parquet | 1970-01-01T00:00:00Z |",
    "| 0  | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |",
    "| 1  | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |",
    "| 1  | 1851 | year=2021/month=10/day=09/file.parquet | 1970-01-01T00:00:00Z |",
    "| 1  | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |",
    "| 2  | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |",
    "| 2  | 1851 | year=2021/month=10/day=09/file.parquet | 1970-01-01T00:00:00Z |",
    "| 2  | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |",
    "| 3  | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |",
    "+----+------+----------------------------------------+----------------------+",
]
@mach-kernel mach-kernel self-assigned this Jul 29, 2025
@lukekim lukekim requested review from phillipleblanc, kczimm and a team July 29, 2025 18:41
Copy link

@kczimm kczimm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mach-kernel mach-kernel merged commit 9c4022b into spiceai-48 Aug 1, 2025
@mach-kernel mach-kernel deleted the upstream-48.0.1 branch August 1, 2025 14:01
@mach-kernel mach-kernel restored the upstream-48.0.1 branch August 1, 2025 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.