forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 1
[pull] master from apache:master #1108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…gate functions ### What changes were proposed in this pull request? This PR adds comprehensive documentation for Spark SQL's sketch-based approximate functions powered by the Apache DataSketches library. The new documentation page (`sql-ref-sketch-aggregates.md`) covers: **Function Reference:** - **HyperLogLog (HLL) Sketch Functions**: `hll_sketch_agg`, `hll_union_agg`, `hll_sketch_estimate`, `hll_union` - **Theta Sketch Functions**: `theta_sketch_agg`, `theta_union_agg`, `theta_intersection_agg`, `theta_sketch_estimate`, `theta_union`, `theta_intersection`, `theta_difference` - **KLL Quantile Sketch Functions**: `kll_sketch_agg_*`, `kll_sketch_to_string_*`, `kll_sketch_get_n_*`, `kll_sketch_merge_*`, `kll_sketch_get_quantile_*`, `kll_sketch_get_rank_*` - **Approximate Top-K Functions**: `approx_top_k_accumulate`, `approx_top_k_combine`, `approx_top_k_estimate` **Best Practices:** - Guidance on choosing between HLL and Theta sketches - Accuracy vs. memory trade-offs for each sketch type - Tips for storing and reusing sketches **Common Use Cases and Examples:** - Tracking daily unique users with HLL sketches (ETL workflow) - Computing percentiles over time with KLL sketches - Set operations with Theta sketches (intersection, difference for cohort analysis) - Finding trending items with Top-K sketches The PR also adds links to this new documentation page from: - `sql-ref-functions.md` (under Aggregate-like Functions) - `sql-ref.md` (under Functions section) - `_data/menu-sql.yaml` (navigation menu) ### Why are the changes needed? Spark SQL has added several sketch-based approximate functions using the Apache DataSketches library (HLL sketches in 3.5.0, Theta/KLL/Top-K sketches in 4.1.0), but there was no comprehensive documentation explaining: - How to use these functions together in practical ETL workflows - How to store sketches and merge them across multiple data batches - Best practices for choosing the right sketch type and tuning accuracy parameters This documentation fills that gap and helps users understand the full power of sketch-based analytics in Spark SQL. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds new documentation pages that are user-facing. No code changes are included. ### How was this patch tested? Documentation-only change. The examples were verified against the existing function implementations and test cases in the codebase. ### Was this patch authored or co-authored using generative AI tooling? Yes, code assistance with `claude-4.5-opus-high` in combination with manual editing by the author. Closes #53297 from dtenedor/sketch-function-docs. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
….test_with_none_and_nan` ### What changes were proposed in this pull request? There was a bug in create dataframe from ndarray containing NaN values: NaN was incorrectly converted to Null when arrow-optimization is on, it happened to be resolved in #53280 ### Why are the changes needed? for test coverage ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #53305 from zhengruifeng/reenable_test_with_none_and_nan. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
### What changes were proposed in this pull request? Optimize Py4J calls in schema inference ### Why are the changes needed? to fetch all configs in single py4j call ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #53300 from zhengruifeng/py4j_infer_schema. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…lizer` with `GroupPandasUDFSerializer` ### What changes were proposed in this pull request? This PR consolidates `GroupPandasUDFSerializer` to support both `SQL_GROUPED_MAP_PANDAS_UDF` and `SQL_GROUPED_MAP_PANDAS_ITER_UDF`, aligning with the design pattern used by `GroupArrowUDFSerializer`. ### Why are the changes needed? When `Iterator[pandas.DataFrame]` API was added to `groupBy().applyInPandas()` in SPARK-53614 (#52716), a new `GroupPandasIterUDFSerializer` class was created. However, this class is nearly identical to `GroupPandasUDFSerializer`, differing only in whether batches are processed lazily (iterator mode) or all at once (regular mode). ### Does this PR introduce _any_ user-facing change? No, this is an internal refactoring that maintains backward compatibility. The API behavior remains the same from the user's perspective. ### How was this patch tested? Existing test cases. ### Was this patch authored or co-authored using generative AI tooling? Co-Generated-by: Cursor with Claude 4.5 Sonnet Closes #53043 from Yicong-Huang/SPARK-54316/refactor/consolidate-pandas-iter-serializer. Lead-authored-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com> Co-authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
### What changes were proposed in this pull request? This PR adds a script and a supporting directory for users to have native debugging experience with VSCode (Cursor) for not only driver code, but UDF/workers/daemon. <img width="2324" height="1010" alt="image" src="https://github.com/user-attachments/assets/57d1bbae-53c6-48e9-8910-25a47e4f0e24" /> ### Why are the changes needed? So user can debug their (local) code with VSCode debugger. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Locally works. This does not touch spark code, it's just a dev tool. ### Was this patch authored or co-authored using generative AI tooling? No Closes #53289 from gaogaotiantian/support-vscode-breakpoint. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…ndlingMode` in pyspark pandas doctest ### What changes were proposed in this pull request? After #53299, explicitly set conf `spark.sql.execution.pandas.structHandlingMode` to `row`. This is needed because when Arrow optimization was previously disabled, structHandlingMode converted to Row object by default, but when Arrow optimization is enabled, it converts to dict or raise an Exception if duplicated nested field names. To match the docs behavior after enabling arrow by default, we explicitly set this conf to row. ### Why are the changes needed? Fix pyspark-pandas doctest and remove the skip of doctests ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI running pyspark-pandas doctest ### Was this patch authored or co-authored using generative AI tooling? No Closes #53301 from asl3/pysparkpandasdoctest. Authored-by: Amanda Liu <amanda.liu@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…nnect ### What changes were proposed in this pull request? When `spark.sql.connect.enrichError.enabled` is enabled (default), `message_parameters` is retrieved from `root_error.spark_throwable` and set correctly. However, `error_class` was not being retrieved/set in the same way, causing the error class information to be lost on the PySpark Connect client side. This PR fixes it by propagating the error class correctly. ### Why are the changes needed? Fix a bug. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? Yes Closes #53288 from shujingyang-db/fix-spark-connect-config-error. Authored-by: Shujing Yang <shujing.yang@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…xec`" ### What changes were proposed in this pull request? Clean revert of d65234b. Will later handle for cases of sourceSide child nodes without `numOutputRows`, and will re-target the new implementation to later Spark release. ### Why are the changes needed? The current implementation may grab the incorrect `numOutputRows` metric if there is an intermediary node (such as custom Spark operator) which does not support the metric. This is because we target the first sourceSide child node with `numOutputRows`. If a SparkExtension node does not contain this metric but transforms the source table, then we could progress all the way to the source table and grab the incorrect metric. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing CI, as this is a revert ### Was this patch authored or co-authored using generative AI tooling? No Closes #53293 from asl3/numsourcerowsrevert. Authored-by: Amanda Liu <amanda.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )