Disallow floating point pruning #28

etseidl · 2025-04-25T21:57:42Z

@adriangb here's what I came up with. Sorry there'a a lot of thrash here because I merged in main to get your changes to the allowed cast types for pruning. The real changes are pretty much in pruning.rs around line 1586. I also modified the ColumnOrder definitions some to better match the Parquet spec.

Anyway, I think this works well, so thanks for getting the column orders pushed down. My only concern at this point is that this is all pretty Parquet specific, so I don't know how well received it will be. We could also try handling this at the data source level, but that seems more complicated (we'd probably have to reproduce a bunch of the predicate building code).

Let me know what you think of this. Thanks!

… functions (apache#13511) * Add within group variable to aggregate function and arguments * Support within group and disable null handling for ordered set aggregate functions (apache#13511) * Refactored function to match updated signature * Modify proto to support within group clause * Modify physical planner and accumulator to support ordered set aggregate function * Support session management for ordered set aggregate functions * Align code, tests, and examples with changes to aggregate function logic * Ensure compatibility with new `within_group` and `order_by` handling. * Adjust tests and examples to align with the new logic. * Fix typo in existing comments * Enhance test * Add test cases for changed signature * Update signature in docs * Fix bug : handle missing within_group when applying children tree node * Change the signature of approx_percentile_cont for consistency * Add missing within_group for expr display * Handle edge case when over and within group clause are used together * Apply clippy advice: avoids too many arguments * Add new test cases using descending order * Apply cargo fmt * Revert unintended submodule changes * Apply prettier guidance * Apply doc guidance by update_function_doc.sh * Rollback WITHIN GROUP and related logic after converting it into expr * Make it not to handle redundant logic * Rollback ordered set aggregate functions from session to save same info in udf itself * Convert within group to order by when converting sql to expr * Add function to determine it is ordered-set aggregate function * Rollback within group from proto * Utilize within group as order by in functions-aggregate * Apply clippy * Convert order by to within group * Apply cargo fmt * Remove plain line breaks * Remove duplicated column arg in schema name * Refactor boolean functions to just return primitive type * Make within group necessary in the signature of existing ordered set aggr funcs * Apply cargo fmt * Support a single ordering expression in the signature * Apply cargo fmt * Add dataframe function test cases to verify descending ordering * Apply cargo fmt * Apply code reviews * Uses order by consistently after done with sql * Remove redundant comment * Serve more clear error msg * Handle error cases in the same code block * Update error msg in test as corresponding code changed * fix --------- Co-authored-by: Jay Zhan <jayzhan211@gmail.com>

Bumps [env_logger](https://github.com/rust-cli/env_logger) from 0.11.7 to 0.11.8. - [Release notes](https://github.com/rust-cli/env_logger/releases) - [Changelog](https://github.com/rust-cli/env_logger/blob/main/CHANGELOG.md) - [Commits](rust-cli/env_logger@v0.11.7...v0.11.8) --- updated-dependencies: - dependency-name: env_logger dependency-version: 0.11.8 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…pache#15828) * add `memory_limit` to `MemoryPool`, and impl it for the pools in datafusion. * Update datafusion/execution/src/memory_pool/mod.rs Co-authored-by: Ruihang Xia <waynestxia@gmail.com> --------- Co-authored-by: Ruihang Xia <waynestxia@gmail.com>

* Preserve projection for inline scan * fix --------- Co-authored-by: Vadim Piven <vadim.piven@milaboratories.com>

Bumps [pyo3](https://github.com/pyo3/pyo3) from 0.24.1 to 0.24.2. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.24.1...v0.24.2) --- updated-dependencies: - dependency-name: pyo3 dependency-version: 0.24.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…che#15822) * Fix: fetch is missing in EnforceSort * add ut test_parallelize_sort_preserves_fetch * add ut: test_plan_with_order_preserving_variants_preserves_fetch * update * address comments

* Fix ILIKE expression support in SQL unparser (apache#76) * update tests

…ng `map_err` (apache#15796) * First Step * Final Step? * Homogenisation

* Read benchmark SessionConfig from env * Set target partitions from env by default fix * Set batch size from env by default * Fix batch size option for tpch ci * Log environment variable configuration * Document benchmarking env variable config * Add DATAFUSION_* env config to Error: unknown command: help Orchestrates running benchmarks against DataFusion checkouts Usage: ./bench.sh data [benchmark] [query] ./bench.sh run [benchmark] ./bench.sh compare <branch1> <branch2> ./bench.sh venv ********** Examples: ********** # Create the datasets for all benchmarks in /Users/christian/MA/datafusion/benchmarks/data ./bench.sh data # Run the 'tpch' benchmark on the datafusion checkout in /source/datafusion DATAFUSION_DIR=/source/datafusion ./bench.sh run tpch ********** * Commands ********** data: Generates or downloads data needed for benchmarking run: Runs the named benchmark compare: Compares results from benchmark runs venv: Creates new venv (unless already exists) and installs compare's requirements into it ********** * Benchmarks ********** all(default): Data/Run/Compare for all benchmarks tpch: TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), single parquet file per table, hash join tpch_mem: TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), query from memory tpch10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), single parquet file per table, hash join tpch_mem10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), query from memory cancellation: How long cancelling a query takes parquet: Benchmark of parquet reader's filtering speed sort: Benchmark of sorting speed sort_tpch: Benchmark of sorting speed for end-to-end sort queries on TPCH dataset clickbench_1: ClickBench queries against a single parquet file clickbench_partitioned: ClickBench queries against a partitioned (100 files) parquet clickbench_extended: ClickBench "inspired" queries against a single parquet (DataFusion specific) external_aggr: External aggregation benchmark h2o_small: h2oai benchmark with small dataset (1e7 rows) for groupby, default file format is csv h2o_medium: h2oai benchmark with medium dataset (1e8 rows) for groupby, default file format is csv h2o_big: h2oai benchmark with large dataset (1e9 rows) for groupby, default file format is csv h2o_small_join: h2oai benchmark with small dataset (1e7 rows) for join, default file format is csv h2o_medium_join: h2oai benchmark with medium dataset (1e8 rows) for join, default file format is csv h2o_big_join: h2oai benchmark with large dataset (1e9 rows) for join, default file format is csv imdb: Join Order Benchmark (JOB) using the IMDB dataset converted to parquet ********** * Supported Configuration (Environment Variables) ********** DATA_DIR directory to store datasets CARGO_COMMAND command that runs the benchmark binary DATAFUSION_DIR directory to use (default /Users/christian/MA/datafusion/benchmarks/..) RESULTS_NAME folder where the benchmark files are stored PREFER_HASH_JOIN Prefer hash join algorithm (default true) VENV_PATH Python venv to use for compare and venv commands (default ./venv, override by <your-venv>/bin/activate) DATAFUSION_* Set the given datafusion configuration * fmt

…5764) * predicate pruning: support dictionaries * more types * clippy * add tests * add tests * simplify to dicts * revert most changes * just check for strings, more tests * more tests * remove unecessary now confusing clause

hyperlint-ai · 2025-04-25T21:58:12Z

PR Change Summary

Disallowed floating point pruning and made modifications to better align with the Parquet specification.

Disallowed floating point pruning in pruning.rs
Updated ColumnOrder definitions to match Parquet spec
Enhanced benchmark documentation with new configuration options
Added ArkFlow to the list of active projects using DataFusion

Modified Files

benchmarks/README.md
docs/source/user-guide/introduction.md
docs/source/user-guide/sql/aggregate_functions.md

How can I customize these reviews?

Check out the Hyperlint AI Reviewer docs for more information on how to customize the review.

If you just want to ignore it on this PR, you can add the hyperlint-ignore label to the PR. Future changes won't trigger a Hyperlint review.

Note specifically for link checks, we only check the first 30 links in a file and we cache the results for several hours (for instance, if you just added a page, you might experience this). Our recommendation is to add hyperlint-ignore to the PR to ignore the link check for this PR.

adriangb · 2025-04-25T23:06:04Z

@etseidl I didn't look at the contents but I gave you push access to this fork so that you can just push your changes to the PR and take ownership of it 😄

etseidl · 2025-04-25T23:30:17Z

Takeover complete 😄

Thanks @adriangb

Garamda and others added 18 commits April 23, 2025 18:46

docs: add ArkFlow (apache#15826)

323c939

Support unparsing UNION for distinct results (apache#15814)

07a310f

Preserve projection for inline scan (apache#15825)

4fd295d

* Preserve projection for inline scan * fix --------- Co-authored-by: Vadim Piven <vadim.piven@milaboratories.com>

cleanup after emit (apache#15834)

377483f

Fix: fetch is missing in EnforceSorting optimizer (two places) (apa…

5eb0968

…che#15822) * Fix: fetch is missing in EnforceSort * add ut test_parallelize_sort_preserves_fetch * add ut: test_plan_with_order_preserving_variants_preserves_fetch * update * address comments

Minor: fix potential flaky test in aggregate.slt (apache#15829)

f85535b

Fix ILIKE expression support in SQL unparser (apache#15820)

e0fd892

* Fix ILIKE expression support in SQL unparser (apache#76) * update tests

Make Diagnostic easy/convinient to attach by using macro and avoidi…

8f5158a

…ng `map_err` (apache#15796) * First Step * Final Step? * Homogenisation

Merge remote-tracking branch 'origin/main' into float_pruning_no_total

6bfaeb1

start adding rules for floats

849f18f

cleanup and add more comments

dada2ce

more cleanup

4c82fbf

etseidl mentioned this pull request Apr 25, 2025

pipe column orderings into pruning predicate creation apache/datafusion#15821

Open

etseidl merged commit 4c82fbf into pydantic:column-orders-parquet-stats Apr 25, 2025
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disallow floating point pruning #28

Disallow floating point pruning #28

Uh oh!

etseidl commented Apr 25, 2025

Uh oh!

hyperlint-ai bot commented Apr 25, 2025

Uh oh!

adriangb commented Apr 25, 2025

Uh oh!

Uh oh!

etseidl commented Apr 25, 2025

Uh oh!

Uh oh!

Disallow floating point pruning #28

Disallow floating point pruning #28

Uh oh!

Conversation

etseidl commented Apr 25, 2025

Uh oh!

hyperlint-ai bot commented Apr 25, 2025

PR Change Summary

Uh oh!

adriangb commented Apr 25, 2025

Uh oh!

Uh oh!

etseidl commented Apr 25, 2025

Uh oh!

Uh oh!