Skip to content

Disallow floating point pruning #28

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

etseidl
Copy link
Collaborator

@etseidl etseidl commented Apr 25, 2025

@adriangb here's what I came up with. Sorry there'a a lot of thrash here because I merged in main to get your changes to the allowed cast types for pruning. The real changes are pretty much in pruning.rs around line 1586. I also modified the ColumnOrder definitions some to better match the Parquet spec.

Anyway, I think this works well, so thanks for getting the column orders pushed down. My only concern at this point is that this is all pretty Parquet specific, so I don't know how well received it will be. We could also try handling this at the data source level, but that seems more complicated (we'd probably have to reproduce a bunch of the predicate building code).

Let me know what you think of this. Thanks!

Garamda and others added 18 commits April 23, 2025 18:46
… functions (apache#13511)

* Add within group variable to aggregate function and arguments

* Support within group and disable null handling for ordered set aggregate functions (apache#13511)

* Refactored function to match updated signature

* Modify proto to support within group clause

* Modify physical planner and accumulator to support ordered set aggregate function

* Support session management for ordered set aggregate functions

* Align code, tests, and examples with changes to aggregate function logic

* Ensure compatibility with new `within_group` and `order_by` handling.

* Adjust tests and examples to align with the new logic.

* Fix typo in existing comments

* Enhance test

* Add test cases for changed signature

* Update signature in docs

* Fix bug : handle missing within_group when applying children tree node

* Change the signature of approx_percentile_cont for consistency

* Add missing within_group for expr display

* Handle edge case when over and within group clause are used together

* Apply clippy advice: avoids too many arguments

* Add new test cases using descending order

* Apply cargo fmt

* Revert unintended submodule changes

* Apply prettier guidance

* Apply doc guidance by update_function_doc.sh

* Rollback WITHIN GROUP and related logic after converting it into expr

* Make it not to handle redundant logic

* Rollback ordered set aggregate functions from session to save same info in udf itself

* Convert within group to order by when converting sql to expr

* Add function to determine it is ordered-set aggregate function

* Rollback within group from proto

* Utilize within group as order by in functions-aggregate

* Apply clippy

* Convert order by to within group

* Apply cargo fmt

* Remove plain line breaks

* Remove duplicated column arg in schema name

* Refactor boolean functions to just return primitive type

* Make within group necessary in the signature of existing ordered set aggr funcs

* Apply cargo fmt

* Support a single ordering expression in the signature

* Apply cargo fmt

* Add dataframe function test cases to verify descending ordering

* Apply cargo fmt

* Apply code reviews

* Uses order by consistently after done with sql

* Remove redundant comment

* Serve more clear error msg

* Handle error cases in the same code block

* Update error msg in test as corresponding code changed

* fix

---------

Co-authored-by: Jay Zhan <jayzhan211@gmail.com>
Bumps [env_logger](https://github.com/rust-cli/env_logger) from 0.11.7 to 0.11.8.
- [Release notes](https://github.com/rust-cli/env_logger/releases)
- [Changelog](https://github.com/rust-cli/env_logger/blob/main/CHANGELOG.md)
- [Commits](rust-cli/env_logger@v0.11.7...v0.11.8)

---
updated-dependencies:
- dependency-name: env_logger
  dependency-version: 0.11.8
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…pache#15828)

* add `memory_limit` to `MemoryPool`, and impl it for the pools in datafusion.

* Update datafusion/execution/src/memory_pool/mod.rs

Co-authored-by: Ruihang Xia <waynestxia@gmail.com>

---------

Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* Preserve projection for inline scan

* fix

---------

Co-authored-by: Vadim Piven <vadim.piven@milaboratories.com>
Bumps [pyo3](https://github.com/pyo3/pyo3) from 0.24.1 to 0.24.2.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.24.1...v0.24.2)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-version: 0.24.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…che#15822)

* Fix: fetch is missing in EnforceSort

* add ut test_parallelize_sort_preserves_fetch

* add ut: test_plan_with_order_preserving_variants_preserves_fetch

* update

* address comments
* Fix ILIKE expression support in SQL unparser (apache#76)

* update tests
…ng `map_err` (apache#15796)

* First Step

* Final Step?

* Homogenisation
* Read benchmark SessionConfig from env

* Set target partitions from env by default

fix

* Set batch size from env by default

* Fix batch size option for tpch ci

* Log environment variable configuration

* Document benchmarking env variable config

* Add DATAFUSION_* env config to Error: unknown command: help

Orchestrates running benchmarks against DataFusion checkouts

Usage:
./bench.sh data [benchmark] [query]
./bench.sh run [benchmark]
./bench.sh compare <branch1> <branch2>
./bench.sh venv

**********
Examples:
**********
# Create the datasets for all benchmarks in /Users/christian/MA/datafusion/benchmarks/data
./bench.sh data

# Run the 'tpch' benchmark on the datafusion checkout in /source/datafusion
DATAFUSION_DIR=/source/datafusion ./bench.sh run tpch

**********
* Commands
**********
data:         Generates or downloads data needed for benchmarking
run:          Runs the named benchmark
compare:      Compares results from benchmark runs
venv:         Creates new venv (unless already exists) and installs compare's requirements into it

**********
* Benchmarks
**********
all(default): Data/Run/Compare for all benchmarks
tpch:                   TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), single parquet file per table, hash join
tpch_mem:               TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), query from memory
tpch10:                 TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), single parquet file per table, hash join
tpch_mem10:             TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), query from memory
cancellation:           How long cancelling a query takes
parquet:                Benchmark of parquet reader's filtering speed
sort:                   Benchmark of sorting speed
sort_tpch:              Benchmark of sorting speed for end-to-end sort queries on TPCH dataset
clickbench_1:           ClickBench queries against a single parquet file
clickbench_partitioned: ClickBench queries against a partitioned (100 files) parquet
clickbench_extended:    ClickBench "inspired" queries against a single parquet (DataFusion specific)
external_aggr:          External aggregation benchmark
h2o_small:              h2oai benchmark with small dataset (1e7 rows) for groupby,  default file format is csv
h2o_medium:             h2oai benchmark with medium dataset (1e8 rows) for groupby, default file format is csv
h2o_big:                h2oai benchmark with large dataset (1e9 rows) for groupby,  default file format is csv
h2o_small_join:         h2oai benchmark with small dataset (1e7 rows) for join,  default file format is csv
h2o_medium_join:        h2oai benchmark with medium dataset (1e8 rows) for join, default file format is csv
h2o_big_join:           h2oai benchmark with large dataset (1e9 rows) for join,  default file format is csv
imdb:                   Join Order Benchmark (JOB) using the IMDB dataset converted to parquet

**********
* Supported Configuration (Environment Variables)
**********
DATA_DIR            directory to store datasets
CARGO_COMMAND       command that runs the benchmark binary
DATAFUSION_DIR      directory to use (default /Users/christian/MA/datafusion/benchmarks/..)
RESULTS_NAME        folder where the benchmark files are stored
PREFER_HASH_JOIN    Prefer hash join algorithm (default true)
VENV_PATH           Python venv to use for compare and venv commands (default ./venv, override by <your-venv>/bin/activate)
DATAFUSION_*        Set the given datafusion configuration

* fmt
…5764)

* predicate pruning: support dictionaries

* more types

* clippy

* add tests

* add tests

* simplify to dicts

* revert most changes

* just check for strings, more tests

* more tests

* remove unecessary now confusing clause
Copy link

hyperlint-ai bot commented Apr 25, 2025

PR Change Summary

Disallowed floating point pruning and made modifications to better align with the Parquet specification.

  • Disallowed floating point pruning in pruning.rs
  • Updated ColumnOrder definitions to match Parquet spec
  • Enhanced benchmark documentation with new configuration options
  • Added ArkFlow to the list of active projects using DataFusion

Modified Files

  • benchmarks/README.md
  • docs/source/user-guide/introduction.md
  • docs/source/user-guide/sql/aggregate_functions.md

How can I customize these reviews?

Check out the Hyperlint AI Reviewer docs for more information on how to customize the review.

If you just want to ignore it on this PR, you can add the hyperlint-ignore label to the PR. Future changes won't trigger a Hyperlint review.

Note specifically for link checks, we only check the first 30 links in a file and we cache the results for several hours (for instance, if you just added a page, you might experience this). Our recommendation is to add hyperlint-ignore to the PR to ignore the link check for this PR.

@adriangb
Copy link
Member

@etseidl I didn't look at the contents but I gave you push access to this fork so that you can just push your changes to the PR and take ownership of it 😄

@etseidl etseidl merged commit 4c82fbf into pydantic:column-orders-parquet-stats Apr 25, 2025
1 of 2 checks passed
@etseidl
Copy link
Collaborator Author

etseidl commented Apr 25, 2025

Takeover complete 😄

Thanks @adriangb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.