[spiceai-48] -> Update to DataFusion 48 #92

mach-kernel · 2025-07-29T13:55:47Z

Changes

Merges upstream-48.0.1 into spiceai-48 (branched off of spiceai-47) to apply upstream changes to our previous DF release
Reconciles upstream changes merging ExtendedColumnProjector -> PartitionColumnProjector + Spice AI tweaks
Reconciles FileScanConfig / FileScanConfigBuilder deprecations for our changes
Applies SPI changes for create_physical_plan with filters & related partition pruning support to examples and datasources

Test changes

Please look at these commits and ensure that the correct assumptions are being made.

test_metadata_columns dependent on implicit sort behavior: d514acb
- (See commit message)
test_count_wildcard_on_sort stale snapshot: a34bcf3
- Looks like CommonSubexprEliminate on count(*) rewrote the plan. The new plan looks OK and more closely matches the DF API plan.

test result: ok. 614 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 6.52s

… EnforceDistribution (apache#15808)

* save * fmt

… functions (apache#13511) * Add within group variable to aggregate function and arguments * Support within group and disable null handling for ordered set aggregate functions (apache#13511) * Refactored function to match updated signature * Modify proto to support within group clause * Modify physical planner and accumulator to support ordered set aggregate function * Support session management for ordered set aggregate functions * Align code, tests, and examples with changes to aggregate function logic * Ensure compatibility with new `within_group` and `order_by` handling. * Adjust tests and examples to align with the new logic. * Fix typo in existing comments * Enhance test * Add test cases for changed signature * Update signature in docs * Fix bug : handle missing within_group when applying children tree node * Change the signature of approx_percentile_cont for consistency * Add missing within_group for expr display * Handle edge case when over and within group clause are used together * Apply clippy advice: avoids too many arguments * Add new test cases using descending order * Apply cargo fmt * Revert unintended submodule changes * Apply prettier guidance * Apply doc guidance by update_function_doc.sh * Rollback WITHIN GROUP and related logic after converting it into expr * Make it not to handle redundant logic * Rollback ordered set aggregate functions from session to save same info in udf itself * Convert within group to order by when converting sql to expr * Add function to determine it is ordered-set aggregate function * Rollback within group from proto * Utilize within group as order by in functions-aggregate * Apply clippy * Convert order by to within group * Apply cargo fmt * Remove plain line breaks * Remove duplicated column arg in schema name * Refactor boolean functions to just return primitive type * Make within group necessary in the signature of existing ordered set aggr funcs * Apply cargo fmt * Support a single ordering expression in the signature * Apply cargo fmt * Add dataframe function test cases to verify descending ordering * Apply cargo fmt * Apply code reviews * Uses order by consistently after done with sql * Remove redundant comment * Serve more clear error msg * Handle error cases in the same code block * Update error msg in test as corresponding code changed * fix --------- Co-authored-by: Jay Zhan <jayzhan211@gmail.com>

Bumps [env_logger](https://github.com/rust-cli/env_logger) from 0.11.7 to 0.11.8. - [Release notes](https://github.com/rust-cli/env_logger/releases) - [Changelog](https://github.com/rust-cli/env_logger/blob/main/CHANGELOG.md) - [Commits](rust-cli/env_logger@v0.11.7...v0.11.8) --- updated-dependencies: - dependency-name: env_logger dependency-version: 0.11.8 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…pache#15828) * add `memory_limit` to `MemoryPool`, and impl it for the pools in datafusion. * Update datafusion/execution/src/memory_pool/mod.rs Co-authored-by: Ruihang Xia <waynestxia@gmail.com> --------- Co-authored-by: Ruihang Xia <waynestxia@gmail.com>

* Preserve projection for inline scan * fix --------- Co-authored-by: Vadim Piven <vadim.piven@milaboratories.com>

Bumps [pyo3](https://github.com/pyo3/pyo3) from 0.24.1 to 0.24.2. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.24.1...v0.24.2) --- updated-dependencies: - dependency-name: pyo3 dependency-version: 0.24.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…che#15822) * Fix: fetch is missing in EnforceSort * add ut test_parallelize_sort_preserves_fetch * add ut: test_plan_with_order_preserving_variants_preserves_fetch * update * address comments

* Fix ILIKE expression support in SQL unparser (#76) * update tests

…ng `map_err` (apache#15796) * First Step * Final Step? * Homogenisation

* Read benchmark SessionConfig from env * Set target partitions from env by default fix * Set batch size from env by default * Fix batch size option for tpch ci * Log environment variable configuration * Document benchmarking env variable config * Add DATAFUSION_* env config to Error: unknown command: help Orchestrates running benchmarks against DataFusion checkouts Usage: ./bench.sh data [benchmark] [query] ./bench.sh run [benchmark] ./bench.sh compare <branch1> <branch2> ./bench.sh venv ********** Examples: ********** # Create the datasets for all benchmarks in /Users/christian/MA/datafusion/benchmarks/data ./bench.sh data # Run the 'tpch' benchmark on the datafusion checkout in /source/datafusion DATAFUSION_DIR=/source/datafusion ./bench.sh run tpch ********** * Commands ********** data: Generates or downloads data needed for benchmarking run: Runs the named benchmark compare: Compares results from benchmark runs venv: Creates new venv (unless already exists) and installs compare's requirements into it ********** * Benchmarks ********** all(default): Data/Run/Compare for all benchmarks tpch: TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), single parquet file per table, hash join tpch_mem: TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), query from memory tpch10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), single parquet file per table, hash join tpch_mem10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), query from memory cancellation: How long cancelling a query takes parquet: Benchmark of parquet reader's filtering speed sort: Benchmark of sorting speed sort_tpch: Benchmark of sorting speed for end-to-end sort queries on TPCH dataset clickbench_1: ClickBench queries against a single parquet file clickbench_partitioned: ClickBench queries against a partitioned (100 files) parquet clickbench_extended: ClickBench "inspired" queries against a single parquet (DataFusion specific) external_aggr: External aggregation benchmark h2o_small: h2oai benchmark with small dataset (1e7 rows) for groupby, default file format is csv h2o_medium: h2oai benchmark with medium dataset (1e8 rows) for groupby, default file format is csv h2o_big: h2oai benchmark with large dataset (1e9 rows) for groupby, default file format is csv h2o_small_join: h2oai benchmark with small dataset (1e7 rows) for join, default file format is csv h2o_medium_join: h2oai benchmark with medium dataset (1e8 rows) for join, default file format is csv h2o_big_join: h2oai benchmark with large dataset (1e9 rows) for join, default file format is csv imdb: Join Order Benchmark (JOB) using the IMDB dataset converted to parquet ********** * Supported Configuration (Environment Variables) ********** DATA_DIR directory to store datasets CARGO_COMMAND command that runs the benchmark binary DATAFUSION_DIR directory to use (default /Users/christian/MA/datafusion/benchmarks/..) RESULTS_NAME folder where the benchmark files are stored PREFER_HASH_JOIN Prefer hash join algorithm (default true) VENV_PATH Python venv to use for compare and venv commands (default ./venv, override by <your-venv>/bin/activate) DATAFUSION_* Set the given datafusion configuration * fmt

…5764) * predicate pruning: support dictionaries * more types * clippy * add tests * add tests * simplify to dicts * revert most changes * just check for strings, more tests * more tests * remove unecessary now confusing clause

…ache#15842)

* add fetch to CoalescePartitionsExecNode * gen proto code * Add test * fix * fix build * Fix test build * remove comments

…gths (apache#15856)

Bumps [clap](https://github.com/clap-rs/clap) from 4.5.36 to 4.5.37. - [Release notes](https://github.com/clap-rs/clap/releases) - [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md) - [Commits](clap-rs/clap@clap_complete-v4.5.36...clap_complete-v4.5.37) --- updated-dependencies: - dependency-name: clap dependency-version: 4.5.37 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix `from_unixtime` function documentation * Update scalar_functions.md

* interval singleron * fmt * impl from

* refactor and make `QueryBuilder` more configurable. * fix tests. * fix clippy. * extract `QueryBuilder` to a dedicated module. * add `min_group_by_columns`, and fix some bugs.

Bumps [aws-config](https://github.com/smithy-lang/smithy-rs) from 1.6.1 to 1.6.2. - [Release notes](https://github.com/smithy-lang/smithy-rs/releases) - [Changelog](https://github.com/smithy-lang/smithy-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/smithy-lang/smithy-rs/commits) --- updated-dependencies: - dependency-name: aws-config dependency-version: 1.6.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…apache#15723) * Add slt tests for datafusion.execution.parquet.coerce_int96 setting * tweak

* Improve `ListingTable` / `ListingTableOptions` docs * Update datafusion/core/src/datasource/listing/table.rs Co-authored-by: Alex Huang <huangweijun1001@gmail.com> --------- Co-authored-by: Alex Huang <huangweijun1001@gmail.com>

…adline (apache#15883) I noticed that https://datafusion.apache.org/library-user-guide/upgrading.html#filescanconfig-filescanconfigbuilder had "FileScanConfig –> FileScanConfigBuilder" as a top-level headline. It should probably be under the 47 release

* Handle dicts for distinct count * Fix sqllogictests * Add bench * Fix no fix the bench * Do not panic if error type is bad * Add full bench query * Set the bench * Add dict of dict test * Fix tests * Rename method * Increase the grouping test * Increase the grouping test a bit more :) * Fix flakiness --------- Co-authored-by: Dmitrii Blaginin <blaginin@bmac.local>

* Add substrait roundtrip option in sqllogictests * Fix doc link and missing license header * Add README.md entry for the Substrait round-trip mode * Link tracking issue in README.md * Use clap's `conflicts_with` instead of manually checking flag compatibility * Add sqllogictest-substrait job to the CI * Revert committed formatting changes to README.md

* Work in progress adding user defined aggregate function FFI support * Intermediate work. Going through groups accumulator * MVP for aggregate udf via FFI * Clean up after rebase * Add unit test for FFI Accumulator Args * Adding unit tests and fixing memory errors in aggregate ffi udf * Working through additional unit and integration tests for UDAF ffi * Switch to a accumulator that supports convert to state to get a little better coverage * Set feature so we do not get an error warning in stable rustc * Add more options to test * Add unit test for FFI RecordBatchStream * Add a few more args to ffi accumulator test fn * Adding more unit tests on ffi aggregate udaf * taplo format * Update code comment * Correct function name * Temp fix record batch test dependencies * Address some comments * Revise comments and address PR comments * Remove commented code * Refactor GroupsAccumulator * Add documentation * Split integration tests * Address comments to refactor error handling for opt filter * Fix linting errors * Fix linting and add deref * Remove extra tests and unnecessary code * Adjustments to FFI aggregate functions after rebase on main * cargo fmt * cargo clippy * Re-implement cleaned up code that was removed in last push * Minor review comments --------- Co-authored-by: Crystal Zhou <crystal.zhouxiaoyue@hotmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

…ion (apache#16255) * Add BaselineMetrics to LazyMemoryStream * UT

* Initial commit of UDWF via FFI * Work in progress on integration testing of udwf * Rebase due to UDF changes upstream

…ndow expression" (apache#16307) * Revert "Improve performance of constant aggregate window expression (apache#16234)" This reverts commit 0c30374. * update changelog * update changelog

* [branch-48] Update CHANGELOG for latest 48.0.0 release * prettier

…n scan (apache#16646) (apache#16656) * respect parquet filter pushdown config in scan * Add test Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>

…tistics to true apache#16447 (apache#16659) * Set the default value of `datafusion.execution.collect_statistics` to `true` (apache#16447) * fix sqllogicaltests * Add upgrade note (cherry picked from commit 2d7ae09) * Update row group pruning --------- Co-authored-by: Adam Gutglick <adam@spiraldb.com>

…#16657) * Column indices were not computed correctly, causing a panic * Add unit tests Co-authored-by: Tim Saucer <timsaucer@gmail.com>

* Update version to 48.0.1 * Add link to upgrade guide in changelog script * prettier * update guide

…ata_cols -> filescanconfigbuilder, tests to use partitioncolumnprojector

… changed between releases. the test has been extended to select a multiple of the partitions such that we can assert on ids 0,1,2,3 entirely. the output of the failure BEFORE this change is below -- i think that the new "actual" output is more correct as it starts in order of the month/day buckets. anyway, for purposes of testing the metadata columns, this works, and this is why this change was made expected: 11:33:22 [15/4643] [ "+----+------+----------------------------------------+----------------------+", "| id | size | location | last_modified |", "+----+------+----------------------------------------+----------------------+", "| 0 | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 0 | 1851 | year=2021/month=10/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 0 | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |", "| 1 | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 1 | 1851 | year=2021/month=10/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 1 | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |", "| 2 | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 2 | 1851 | year=2021/month=10/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 2 | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |", "| 3 | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |", "+----+------+----------------------------------------+----------------------+", ] actual: [ "+----+------+----------------------------------------+----------------------+", "| id | size | location | last_modified |", "+----+------+----------------------------------------+----------------------+", "| 0 | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 0 | 1851 | year=2021/month=10/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 0 | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |", "| 1 | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 1 | 1851 | year=2021/month=10/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 1 | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |", "| 2 | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 2 | 1851 | year=2021/month=10/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 2 | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |", "| 3 | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |", "+----+------+----------------------------------------+----------------------+", ]

…inate rewriting. the plan looks correct.

kczimm

LGTM!

xudong963 and others added 30 commits April 23, 2025 09:36

Fix: fetch is lost in replace_order_preserving_variants method during…

ab0da55

… EnforceDistribution (apache#15808)

Speed up optimize_projection (apache#15787)

230f31b

* save * fmt

docs: add ArkFlow (apache#15826)

323c939

Support unparsing UNION for distinct results (apache#15814)

07a310f

Preserve projection for inline scan (apache#15825)

4fd295d

* Preserve projection for inline scan * fix --------- Co-authored-by: Vadim Piven <vadim.piven@milaboratories.com>

cleanup after emit (apache#15834)

377483f

Fix: fetch is missing in EnforceSorting optimizer (two places) (apa…

5eb0968

…che#15822) * Fix: fetch is missing in EnforceSort * add ut test_parallelize_sort_preserves_fetch * add ut: test_plan_with_order_preserving_variants_preserves_fetch * update * address comments

Minor: fix potential flaky test in aggregate.slt (apache#15829)

f85535b

Fix ILIKE expression support in SQL unparser (apache#15820)

e0fd892

* Fix ILIKE expression support in SQL unparser (#76) * update tests

Make Diagnostic easy/convinient to attach by using macro and avoidi…

8f5158a

…ng `map_err` (apache#15796) * First Step * Final Step? * Homogenisation

Fix: fetch is missing in plan_with_order_breaking_variants method (ap…

b8b6214

…ache#15842)

Fix CoalescePartitionsExec proto serialization (apache#15824)

4ac9b55

* add fetch to CoalescePartitionsExecNode * gen proto code * Add test * fix * fix build * Fix test build * remove comments

Fix build (apache#15849)

8b91f9a

Fix scalar list comparison when the compared lists have different len…

00617a0

…gths (apache#15856)

chore: More details to No UDF registered error (apache#15843)

6bf2326

Remove usage of dbg! (apache#15858)

611cebf

Fix from_unixtime function documentation (apache#15844)

40ed90e

* Fix `from_unixtime` function documentation * Update scalar_functions.md

Minor: Interval singleton (apache#15859)

96a2086

* interval singleron * fmt * impl from

Make aggr fuzzer query builder more configurable (apache#15851)

74dc419

* refactor and make `QueryBuilder` more configurable. * fix tests. * fix clippy. * extract `QueryBuilder` to a dedicated module. * add `min_group_by_columns`, and fix some bugs.

Add slt tests for datafusion.execution.parquet.coerce_int96 setting (…

cc65b72

…apache#15723) * Add slt tests for datafusion.execution.parquet.coerce_int96 setting * tweak

blaginin and others added 22 commits June 5, 2025 17:02

Minor: fix upgrade papercut where structure was moved (apache#16264)

25727d4

feat(small): Add BaselineMetrics to generate_series() table funct…

2c8241a

…ion (apache#16255) * Add BaselineMetrics to LazyMemoryStream * UT

feat: Add Window UDFs to FFI Crate (apache#16261)

2a7f64a

* Initial commit of UDWF via FFI * Work in progress on integration testing of udwf * Rebase due to UDF changes upstream

Chore: update DF48 changelog (apache#16269)

85f6621

fix: [branch-48] Revert "Improve performance of constant aggregate wi…

c76c1f0

…ndow expression" (apache#16307) * Revert "Improve performance of constant aggregate window expression (apache#16234)" This reverts commit 0c30374. * update changelog * update changelog

feat: add metadata to literal expressions (apache#16170) (apache#16315)

b5dfdbe

[branch-48] Update CHANGELOG for latest 48.0.0 release (apache#16314)

33a32d4

* [branch-48] Update CHANGELOG for latest 48.0.0 release * prettier

Fix parquet filter_pushdown: respect parquet filter pushdown config i…

7b31676

…n scan (apache#16646) (apache#16656) * respect parquet filter pushdown config in scan * Add test Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>

fix: column indices in FFI partition evaluator (apache#16480) (apache…

bcb8dc5

…#16657) * Column indices were not computed correctly, causing a panic * Add unit tests Co-authored-by: Tim Saucer <timsaucer@gmail.com>

[branch-48] Prepare 48.0.1 ad CHANGELOG (apache#16679)

1dbf5c5

* Update version to 48.0.1 * Add link to upgrade guide in changelog script * prettier * update guide

Merge branch 'spiceai-48' into upstream-48.0.1

f03eda5

[reconcile upstream] ecp -> pcp facade changes

1fe6e3d

[reconcile upstream] deprecation notice for filescanconfig with_metad…

5eff4c2

…ata_cols -> filescanconfigbuilder, tests to use partitioncolumnprojector

[reconcile upstream] spi changes for pruning/physical plan

648e3d1

[reconcile changes] spi physical/pruning -> examples

604463b

update snapshot: this appears to be the result of common_subexpr_elim…

a34bcf3

…inate rewriting. the plan looks correct.

lint

62687cc

mach-kernel self-assigned this Jul 29, 2025

lukekim requested review from phillipleblanc, kczimm and a team July 29, 2025 18:41

kczimm approved these changes Jul 30, 2025

View reviewed changes

mach-kernel merged commit 9c4022b into spiceai-48 Aug 1, 2025

mach-kernel deleted the upstream-48.0.1 branch August 1, 2025 14:01

mach-kernel restored the upstream-48.0.1 branch August 1, 2025 19:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[spiceai-48] -> Update to DataFusion 48 #92

[spiceai-48] -> Update to DataFusion 48 #92

mach-kernel commented Jul 29, 2025 •

edited

Loading

Uh oh!

kczimm left a comment

Uh oh!

Uh oh!

[spiceai-48] -> Update to DataFusion 48 #92

[spiceai-48] -> Update to DataFusion 48 #92

Conversation

mach-kernel commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Test changes

Uh oh!

kczimm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mach-kernel commented Jul 29, 2025 •

edited

Loading