-
Notifications
You must be signed in to change notification settings - Fork 0
Disallow floating point pruning #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disallow floating point pruning #28
Conversation
… functions (apache#13511) * Add within group variable to aggregate function and arguments * Support within group and disable null handling for ordered set aggregate functions (apache#13511) * Refactored function to match updated signature * Modify proto to support within group clause * Modify physical planner and accumulator to support ordered set aggregate function * Support session management for ordered set aggregate functions * Align code, tests, and examples with changes to aggregate function logic * Ensure compatibility with new `within_group` and `order_by` handling. * Adjust tests and examples to align with the new logic. * Fix typo in existing comments * Enhance test * Add test cases for changed signature * Update signature in docs * Fix bug : handle missing within_group when applying children tree node * Change the signature of approx_percentile_cont for consistency * Add missing within_group for expr display * Handle edge case when over and within group clause are used together * Apply clippy advice: avoids too many arguments * Add new test cases using descending order * Apply cargo fmt * Revert unintended submodule changes * Apply prettier guidance * Apply doc guidance by update_function_doc.sh * Rollback WITHIN GROUP and related logic after converting it into expr * Make it not to handle redundant logic * Rollback ordered set aggregate functions from session to save same info in udf itself * Convert within group to order by when converting sql to expr * Add function to determine it is ordered-set aggregate function * Rollback within group from proto * Utilize within group as order by in functions-aggregate * Apply clippy * Convert order by to within group * Apply cargo fmt * Remove plain line breaks * Remove duplicated column arg in schema name * Refactor boolean functions to just return primitive type * Make within group necessary in the signature of existing ordered set aggr funcs * Apply cargo fmt * Support a single ordering expression in the signature * Apply cargo fmt * Add dataframe function test cases to verify descending ordering * Apply cargo fmt * Apply code reviews * Uses order by consistently after done with sql * Remove redundant comment * Serve more clear error msg * Handle error cases in the same code block * Update error msg in test as corresponding code changed * fix --------- Co-authored-by: Jay Zhan <jayzhan211@gmail.com>
Bumps [env_logger](https://github.com/rust-cli/env_logger) from 0.11.7 to 0.11.8. - [Release notes](https://github.com/rust-cli/env_logger/releases) - [Changelog](https://github.com/rust-cli/env_logger/blob/main/CHANGELOG.md) - [Commits](rust-cli/env_logger@v0.11.7...v0.11.8) --- updated-dependencies: - dependency-name: env_logger dependency-version: 0.11.8 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…pache#15828) * add `memory_limit` to `MemoryPool`, and impl it for the pools in datafusion. * Update datafusion/execution/src/memory_pool/mod.rs Co-authored-by: Ruihang Xia <waynestxia@gmail.com> --------- Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* Preserve projection for inline scan * fix --------- Co-authored-by: Vadim Piven <vadim.piven@milaboratories.com>
Bumps [pyo3](https://github.com/pyo3/pyo3) from 0.24.1 to 0.24.2. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.24.1...v0.24.2) --- updated-dependencies: - dependency-name: pyo3 dependency-version: 0.24.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…che#15822) * Fix: fetch is missing in EnforceSort * add ut test_parallelize_sort_preserves_fetch * add ut: test_plan_with_order_preserving_variants_preserves_fetch * update * address comments
* Fix ILIKE expression support in SQL unparser (apache#76) * update tests
…ng `map_err` (apache#15796) * First Step * Final Step? * Homogenisation
* Read benchmark SessionConfig from env * Set target partitions from env by default fix * Set batch size from env by default * Fix batch size option for tpch ci * Log environment variable configuration * Document benchmarking env variable config * Add DATAFUSION_* env config to Error: unknown command: help Orchestrates running benchmarks against DataFusion checkouts Usage: ./bench.sh data [benchmark] [query] ./bench.sh run [benchmark] ./bench.sh compare <branch1> <branch2> ./bench.sh venv ********** Examples: ********** # Create the datasets for all benchmarks in /Users/christian/MA/datafusion/benchmarks/data ./bench.sh data # Run the 'tpch' benchmark on the datafusion checkout in /source/datafusion DATAFUSION_DIR=/source/datafusion ./bench.sh run tpch ********** * Commands ********** data: Generates or downloads data needed for benchmarking run: Runs the named benchmark compare: Compares results from benchmark runs venv: Creates new venv (unless already exists) and installs compare's requirements into it ********** * Benchmarks ********** all(default): Data/Run/Compare for all benchmarks tpch: TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), single parquet file per table, hash join tpch_mem: TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), query from memory tpch10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), single parquet file per table, hash join tpch_mem10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), query from memory cancellation: How long cancelling a query takes parquet: Benchmark of parquet reader's filtering speed sort: Benchmark of sorting speed sort_tpch: Benchmark of sorting speed for end-to-end sort queries on TPCH dataset clickbench_1: ClickBench queries against a single parquet file clickbench_partitioned: ClickBench queries against a partitioned (100 files) parquet clickbench_extended: ClickBench "inspired" queries against a single parquet (DataFusion specific) external_aggr: External aggregation benchmark h2o_small: h2oai benchmark with small dataset (1e7 rows) for groupby, default file format is csv h2o_medium: h2oai benchmark with medium dataset (1e8 rows) for groupby, default file format is csv h2o_big: h2oai benchmark with large dataset (1e9 rows) for groupby, default file format is csv h2o_small_join: h2oai benchmark with small dataset (1e7 rows) for join, default file format is csv h2o_medium_join: h2oai benchmark with medium dataset (1e8 rows) for join, default file format is csv h2o_big_join: h2oai benchmark with large dataset (1e9 rows) for join, default file format is csv imdb: Join Order Benchmark (JOB) using the IMDB dataset converted to parquet ********** * Supported Configuration (Environment Variables) ********** DATA_DIR directory to store datasets CARGO_COMMAND command that runs the benchmark binary DATAFUSION_DIR directory to use (default /Users/christian/MA/datafusion/benchmarks/..) RESULTS_NAME folder where the benchmark files are stored PREFER_HASH_JOIN Prefer hash join algorithm (default true) VENV_PATH Python venv to use for compare and venv commands (default ./venv, override by <your-venv>/bin/activate) DATAFUSION_* Set the given datafusion configuration * fmt
…5764) * predicate pruning: support dictionaries * more types * clippy * add tests * add tests * simplify to dicts * revert most changes * just check for strings, more tests * more tests * remove unecessary now confusing clause
PR Change SummaryDisallowed floating point pruning and made modifications to better align with the Parquet specification.
Modified Files
How can I customize these reviews?Check out the Hyperlint AI Reviewer docs for more information on how to customize the review. If you just want to ignore it on this PR, you can add the Note specifically for link checks, we only check the first 30 links in a file and we cache the results for several hours (for instance, if you just added a page, you might experience this). Our recommendation is to add |
@etseidl I didn't look at the contents but I gave you push access to this fork so that you can just push your changes to the PR and take ownership of it 😄 |
4c82fbf
into
pydantic:column-orders-parquet-stats
Takeover complete 😄 Thanks @adriangb |
@adriangb here's what I came up with. Sorry there'a a lot of thrash here because I merged in main to get your changes to the allowed cast types for pruning. The real changes are pretty much in pruning.rs around line 1586. I also modified the
ColumnOrder
definitions some to better match the Parquet spec.Anyway, I think this works well, so thanks for getting the column orders pushed down. My only concern at this point is that this is all pretty Parquet specific, so I don't know how well received it will be. We could also try handling this at the data source level, but that seems more complicated (we'd probably have to reproduce a bunch of the predicate building code).
Let me know what you think of this. Thanks!