forked from apache/datafusion
-
Notifications
You must be signed in to change notification settings - Fork 0
[spiceai-48] -> Update to DataFusion 48 #92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
+55,546
−23,048
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… EnforceDistribution (apache#15808)
* save * fmt
… functions (apache#13511) * Add within group variable to aggregate function and arguments * Support within group and disable null handling for ordered set aggregate functions (apache#13511) * Refactored function to match updated signature * Modify proto to support within group clause * Modify physical planner and accumulator to support ordered set aggregate function * Support session management for ordered set aggregate functions * Align code, tests, and examples with changes to aggregate function logic * Ensure compatibility with new `within_group` and `order_by` handling. * Adjust tests and examples to align with the new logic. * Fix typo in existing comments * Enhance test * Add test cases for changed signature * Update signature in docs * Fix bug : handle missing within_group when applying children tree node * Change the signature of approx_percentile_cont for consistency * Add missing within_group for expr display * Handle edge case when over and within group clause are used together * Apply clippy advice: avoids too many arguments * Add new test cases using descending order * Apply cargo fmt * Revert unintended submodule changes * Apply prettier guidance * Apply doc guidance by update_function_doc.sh * Rollback WITHIN GROUP and related logic after converting it into expr * Make it not to handle redundant logic * Rollback ordered set aggregate functions from session to save same info in udf itself * Convert within group to order by when converting sql to expr * Add function to determine it is ordered-set aggregate function * Rollback within group from proto * Utilize within group as order by in functions-aggregate * Apply clippy * Convert order by to within group * Apply cargo fmt * Remove plain line breaks * Remove duplicated column arg in schema name * Refactor boolean functions to just return primitive type * Make within group necessary in the signature of existing ordered set aggr funcs * Apply cargo fmt * Support a single ordering expression in the signature * Apply cargo fmt * Add dataframe function test cases to verify descending ordering * Apply cargo fmt * Apply code reviews * Uses order by consistently after done with sql * Remove redundant comment * Serve more clear error msg * Handle error cases in the same code block * Update error msg in test as corresponding code changed * fix --------- Co-authored-by: Jay Zhan <jayzhan211@gmail.com>
Bumps [env_logger](https://github.com/rust-cli/env_logger) from 0.11.7 to 0.11.8. - [Release notes](https://github.com/rust-cli/env_logger/releases) - [Changelog](https://github.com/rust-cli/env_logger/blob/main/CHANGELOG.md) - [Commits](rust-cli/env_logger@v0.11.7...v0.11.8) --- updated-dependencies: - dependency-name: env_logger dependency-version: 0.11.8 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…pache#15828) * add `memory_limit` to `MemoryPool`, and impl it for the pools in datafusion. * Update datafusion/execution/src/memory_pool/mod.rs Co-authored-by: Ruihang Xia <waynestxia@gmail.com> --------- Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* Preserve projection for inline scan * fix --------- Co-authored-by: Vadim Piven <vadim.piven@milaboratories.com>
Bumps [pyo3](https://github.com/pyo3/pyo3) from 0.24.1 to 0.24.2. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.24.1...v0.24.2) --- updated-dependencies: - dependency-name: pyo3 dependency-version: 0.24.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…che#15822) * Fix: fetch is missing in EnforceSort * add ut test_parallelize_sort_preserves_fetch * add ut: test_plan_with_order_preserving_variants_preserves_fetch * update * address comments
* Fix ILIKE expression support in SQL unparser (#76) * update tests
…ng `map_err` (apache#15796) * First Step * Final Step? * Homogenisation
* Read benchmark SessionConfig from env * Set target partitions from env by default fix * Set batch size from env by default * Fix batch size option for tpch ci * Log environment variable configuration * Document benchmarking env variable config * Add DATAFUSION_* env config to Error: unknown command: help Orchestrates running benchmarks against DataFusion checkouts Usage: ./bench.sh data [benchmark] [query] ./bench.sh run [benchmark] ./bench.sh compare <branch1> <branch2> ./bench.sh venv ********** Examples: ********** # Create the datasets for all benchmarks in /Users/christian/MA/datafusion/benchmarks/data ./bench.sh data # Run the 'tpch' benchmark on the datafusion checkout in /source/datafusion DATAFUSION_DIR=/source/datafusion ./bench.sh run tpch ********** * Commands ********** data: Generates or downloads data needed for benchmarking run: Runs the named benchmark compare: Compares results from benchmark runs venv: Creates new venv (unless already exists) and installs compare's requirements into it ********** * Benchmarks ********** all(default): Data/Run/Compare for all benchmarks tpch: TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), single parquet file per table, hash join tpch_mem: TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), query from memory tpch10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), single parquet file per table, hash join tpch_mem10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), query from memory cancellation: How long cancelling a query takes parquet: Benchmark of parquet reader's filtering speed sort: Benchmark of sorting speed sort_tpch: Benchmark of sorting speed for end-to-end sort queries on TPCH dataset clickbench_1: ClickBench queries against a single parquet file clickbench_partitioned: ClickBench queries against a partitioned (100 files) parquet clickbench_extended: ClickBench "inspired" queries against a single parquet (DataFusion specific) external_aggr: External aggregation benchmark h2o_small: h2oai benchmark with small dataset (1e7 rows) for groupby, default file format is csv h2o_medium: h2oai benchmark with medium dataset (1e8 rows) for groupby, default file format is csv h2o_big: h2oai benchmark with large dataset (1e9 rows) for groupby, default file format is csv h2o_small_join: h2oai benchmark with small dataset (1e7 rows) for join, default file format is csv h2o_medium_join: h2oai benchmark with medium dataset (1e8 rows) for join, default file format is csv h2o_big_join: h2oai benchmark with large dataset (1e9 rows) for join, default file format is csv imdb: Join Order Benchmark (JOB) using the IMDB dataset converted to parquet ********** * Supported Configuration (Environment Variables) ********** DATA_DIR directory to store datasets CARGO_COMMAND command that runs the benchmark binary DATAFUSION_DIR directory to use (default /Users/christian/MA/datafusion/benchmarks/..) RESULTS_NAME folder where the benchmark files are stored PREFER_HASH_JOIN Prefer hash join algorithm (default true) VENV_PATH Python venv to use for compare and venv commands (default ./venv, override by <your-venv>/bin/activate) DATAFUSION_* Set the given datafusion configuration * fmt
…5764) * predicate pruning: support dictionaries * more types * clippy * add tests * add tests * simplify to dicts * revert most changes * just check for strings, more tests * more tests * remove unecessary now confusing clause
* add fetch to CoalescePartitionsExecNode * gen proto code * Add test * fix * fix build * Fix test build * remove comments
Bumps [clap](https://github.com/clap-rs/clap) from 4.5.36 to 4.5.37. - [Release notes](https://github.com/clap-rs/clap/releases) - [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md) - [Commits](clap-rs/clap@clap_complete-v4.5.36...clap_complete-v4.5.37) --- updated-dependencies: - dependency-name: clap dependency-version: 4.5.37 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Fix `from_unixtime` function documentation * Update scalar_functions.md
* interval singleron * fmt * impl from
* refactor and make `QueryBuilder` more configurable. * fix tests. * fix clippy. * extract `QueryBuilder` to a dedicated module. * add `min_group_by_columns`, and fix some bugs.
Bumps [aws-config](https://github.com/smithy-lang/smithy-rs) from 1.6.1 to 1.6.2. - [Release notes](https://github.com/smithy-lang/smithy-rs/releases) - [Changelog](https://github.com/smithy-lang/smithy-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/smithy-lang/smithy-rs/commits) --- updated-dependencies: - dependency-name: aws-config dependency-version: 1.6.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…apache#15723) * Add slt tests for datafusion.execution.parquet.coerce_int96 setting * tweak
* Improve `ListingTable` / `ListingTableOptions` docs * Update datafusion/core/src/datasource/listing/table.rs Co-authored-by: Alex Huang <huangweijun1001@gmail.com> --------- Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
…adline (apache#15883) I noticed that https://datafusion.apache.org/library-user-guide/upgrading.html#filescanconfig-filescanconfigbuilder had "FileScanConfig –> FileScanConfigBuilder" as a top-level headline. It should probably be under the 47 release
* Handle dicts for distinct count * Fix sqllogictests * Add bench * Fix no fix the bench * Do not panic if error type is bad * Add full bench query * Set the bench * Add dict of dict test * Fix tests * Rename method * Increase the grouping test * Increase the grouping test a bit more :) * Fix flakiness --------- Co-authored-by: Dmitrii Blaginin <blaginin@bmac.local>
* Add substrait roundtrip option in sqllogictests * Fix doc link and missing license header * Add README.md entry for the Substrait round-trip mode * Link tracking issue in README.md * Use clap's `conflicts_with` instead of manually checking flag compatibility * Add sqllogictest-substrait job to the CI * Revert committed formatting changes to README.md
* Work in progress adding user defined aggregate function FFI support * Intermediate work. Going through groups accumulator * MVP for aggregate udf via FFI * Clean up after rebase * Add unit test for FFI Accumulator Args * Adding unit tests and fixing memory errors in aggregate ffi udf * Working through additional unit and integration tests for UDAF ffi * Switch to a accumulator that supports convert to state to get a little better coverage * Set feature so we do not get an error warning in stable rustc * Add more options to test * Add unit test for FFI RecordBatchStream * Add a few more args to ffi accumulator test fn * Adding more unit tests on ffi aggregate udaf * taplo format * Update code comment * Correct function name * Temp fix record batch test dependencies * Address some comments * Revise comments and address PR comments * Remove commented code * Refactor GroupsAccumulator * Add documentation * Split integration tests * Address comments to refactor error handling for opt filter * Fix linting errors * Fix linting and add deref * Remove extra tests and unnecessary code * Adjustments to FFI aggregate functions after rebase on main * cargo fmt * cargo clippy * Re-implement cleaned up code that was removed in last push * Minor review comments --------- Co-authored-by: Crystal Zhou <crystal.zhouxiaoyue@hotmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…ion (apache#16255) * Add BaselineMetrics to LazyMemoryStream * UT
* Initial commit of UDWF via FFI * Work in progress on integration testing of udwf * Rebase due to UDF changes upstream
…ndow expression" (apache#16307) * Revert "Improve performance of constant aggregate window expression (apache#16234)" This reverts commit 0c30374. * update changelog * update changelog
* [branch-48] Update CHANGELOG for latest 48.0.0 release * prettier
…n scan (apache#16646) (apache#16656) * respect parquet filter pushdown config in scan * Add test Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
…tistics to true apache#16447 (apache#16659) * Set the default value of `datafusion.execution.collect_statistics` to `true` (apache#16447) * fix sqllogicaltests * Add upgrade note (cherry picked from commit 2d7ae09) * Update row group pruning --------- Co-authored-by: Adam Gutglick <adam@spiraldb.com>
…#16657) * Column indices were not computed correctly, causing a panic * Add unit tests Co-authored-by: Tim Saucer <timsaucer@gmail.com>
* Update version to 48.0.1 * Add link to upgrade guide in changelog script * prettier * update guide
…ata_cols -> filescanconfigbuilder, tests to use partitioncolumnprojector
… changed between releases. the test has been extended to select a multiple of the partitions such that we can assert on ids 0,1,2,3 entirely. the output of the failure BEFORE this change is below -- i think that the new "actual" output is more correct as it starts in order of the month/day buckets. anyway, for purposes of testing the metadata columns, this works, and this is why this change was made expected: 11:33:22 [15/4643] [ "+----+------+----------------------------------------+----------------------+", "| id | size | location | last_modified |", "+----+------+----------------------------------------+----------------------+", "| 0 | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 0 | 1851 | year=2021/month=10/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 0 | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |", "| 1 | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 1 | 1851 | year=2021/month=10/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 1 | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |", "| 2 | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 2 | 1851 | year=2021/month=10/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 2 | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |", "| 3 | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |", "+----+------+----------------------------------------+----------------------+", ] actual: [ "+----+------+----------------------------------------+----------------------+", "| id | size | location | last_modified |", "+----+------+----------------------------------------+----------------------+", "| 0 | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 0 | 1851 | year=2021/month=10/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 0 | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |", "| 1 | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 1 | 1851 | year=2021/month=10/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 1 | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |", "| 2 | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 2 | 1851 | year=2021/month=10/day=09/file.parquet | 1970-01-01T00:00:00Z |", "| 2 | 1851 | year=2021/month=10/day=28/file.parquet | 1970-01-01T00:00:00Z |", "| 3 | 1851 | year=2021/month=09/day=09/file.parquet | 1970-01-01T00:00:00Z |", "+----+------+----------------------------------------+----------------------+", ]
…inate rewriting. the plan looks correct.
kczimm
approved these changes
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes
upstream-48.0.1
intospiceai-48
(branched off ofspiceai-47
) to apply upstream changes to our previous DF releaseExtendedColumnProjector
->PartitionColumnProjector
+ Spice AI tweaksFileScanConfig
/FileScanConfigBuilder
deprecations for our changescreate_physical_plan
with filters & related partition pruning support to examples and datasourcesDiff of non-upstream changes
Test changes
Please look at these commits and ensure that the correct assumptions are being made.
test_metadata_columns
dependent on implicit sort behavior: d514acbtest_count_wildcard_on_sort
stale snapshot: a34bcf3CommonSubexprEliminate
oncount(*)
rewrote the plan. The new plan looks OK and more closely matches the DF API plan.