[pull] master from apache:master #1108

pull · 2025-12-04T05:41:08Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

…gate functions ### What changes were proposed in this pull request? This PR adds comprehensive documentation for Spark SQL's sketch-based approximate functions powered by the Apache DataSketches library. The new documentation page (`sql-ref-sketch-aggregates.md`) covers: **Function Reference:** - **HyperLogLog (HLL) Sketch Functions**: `hll_sketch_agg`, `hll_union_agg`, `hll_sketch_estimate`, `hll_union` - **Theta Sketch Functions**: `theta_sketch_agg`, `theta_union_agg`, `theta_intersection_agg`, `theta_sketch_estimate`, `theta_union`, `theta_intersection`, `theta_difference` - **KLL Quantile Sketch Functions**: `kll_sketch_agg_*`, `kll_sketch_to_string_*`, `kll_sketch_get_n_*`, `kll_sketch_merge_*`, `kll_sketch_get_quantile_*`, `kll_sketch_get_rank_*` - **Approximate Top-K Functions**: `approx_top_k_accumulate`, `approx_top_k_combine`, `approx_top_k_estimate` **Best Practices:** - Guidance on choosing between HLL and Theta sketches - Accuracy vs. memory trade-offs for each sketch type - Tips for storing and reusing sketches **Common Use Cases and Examples:** - Tracking daily unique users with HLL sketches (ETL workflow) - Computing percentiles over time with KLL sketches - Set operations with Theta sketches (intersection, difference for cohort analysis) - Finding trending items with Top-K sketches The PR also adds links to this new documentation page from: - `sql-ref-functions.md` (under Aggregate-like Functions) - `sql-ref.md` (under Functions section) - `_data/menu-sql.yaml` (navigation menu) ### Why are the changes needed? Spark SQL has added several sketch-based approximate functions using the Apache DataSketches library (HLL sketches in 3.5.0, Theta/KLL/Top-K sketches in 4.1.0), but there was no comprehensive documentation explaining: - How to use these functions together in practical ETL workflows - How to store sketches and merge them across multiple data batches - Best practices for choosing the right sketch type and tuning accuracy parameters This documentation fills that gap and helps users understand the full power of sketch-based analytics in Spark SQL. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds new documentation pages that are user-facing. No code changes are included. ### How was this patch tested? Documentation-only change. The examples were verified against the existing function implementations and test cases in the codebase. ### Was this patch authored or co-authored using generative AI tooling? Yes, code assistance with `claude-4.5-opus-high` in combination with manual editing by the author. Closes #53297 from dtenedor/sketch-function-docs. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>

….test_with_none_and_nan` ### What changes were proposed in this pull request? There was a bug in create dataframe from ndarray containing NaN values: NaN was incorrectly converted to Null when arrow-optimization is on, it happened to be resolved in #53280 ### Why are the changes needed? for test coverage ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #53305 from zhengruifeng/reenable_test_with_none_and_nan. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

### What changes were proposed in this pull request? Optimize Py4J calls in schema inference ### Why are the changes needed? to fetch all configs in single py4j call ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #53300 from zhengruifeng/py4j_infer_schema. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

…lizer` with `GroupPandasUDFSerializer` ### What changes were proposed in this pull request? This PR consolidates `GroupPandasUDFSerializer` to support both `SQL_GROUPED_MAP_PANDAS_UDF` and `SQL_GROUPED_MAP_PANDAS_ITER_UDF`, aligning with the design pattern used by `GroupArrowUDFSerializer`. ### Why are the changes needed? When `Iterator[pandas.DataFrame]` API was added to `groupBy().applyInPandas()` in SPARK-53614 (#52716), a new `GroupPandasIterUDFSerializer` class was created. However, this class is nearly identical to `GroupPandasUDFSerializer`, differing only in whether batches are processed lazily (iterator mode) or all at once (regular mode). ### Does this PR introduce _any_ user-facing change? No, this is an internal refactoring that maintains backward compatibility. The API behavior remains the same from the user's perspective. ### How was this patch tested? Existing test cases. ### Was this patch authored or co-authored using generative AI tooling? Co-Generated-by: Cursor with Claude 4.5 Sonnet Closes #53043 from Yicong-Huang/SPARK-54316/refactor/consolidate-pandas-iter-serializer. Lead-authored-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com> Co-authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

### What changes were proposed in this pull request? This PR adds a script and a supporting directory for users to have native debugging experience with VSCode (Cursor) for not only driver code, but UDF/workers/daemon. <img width="2324" height="1010" alt="image" src="https://github.com/user-attachments/assets/57d1bbae-53c6-48e9-8910-25a47e4f0e24" /> ### Why are the changes needed? So user can debug their (local) code with VSCode debugger. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Locally works. This does not touch spark code, it's just a dev tool. ### Was this patch authored or co-authored using generative AI tooling? No Closes #53289 from gaogaotiantian/support-vscode-breakpoint. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

…ndlingMode` in pyspark pandas doctest ### What changes were proposed in this pull request? After #53299, explicitly set conf `spark.sql.execution.pandas.structHandlingMode` to `row`. This is needed because when Arrow optimization was previously disabled, structHandlingMode converted to Row object by default, but when Arrow optimization is enabled, it converts to dict or raise an Exception if duplicated nested field names. To match the docs behavior after enabling arrow by default, we explicitly set this conf to row. ### Why are the changes needed? Fix pyspark-pandas doctest and remove the skip of doctests ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI running pyspark-pandas doctest ### Was this patch authored or co-authored using generative AI tooling? No Closes #53301 from asl3/pysparkpandasdoctest. Authored-by: Amanda Liu <amanda.liu@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

…nnect ### What changes were proposed in this pull request? When `spark.sql.connect.enrichError.enabled` is enabled (default), `message_parameters` is retrieved from `root_error.spark_throwable` and set correctly. However, `error_class` was not being retrieved/set in the same way, causing the error class information to be lost on the PySpark Connect client side. This PR fixes it by propagating the error class correctly. ### Why are the changes needed? Fix a bug. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? Yes Closes #53288 from shujingyang-db/fix-spark-connect-config-error. Authored-by: Shujing Yang <shujing.yang@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

…xec`" ### What changes were proposed in this pull request? Clean revert of d65234b. Will later handle for cases of sourceSide child nodes without `numOutputRows`, and will re-target the new implementation to later Spark release. ### Why are the changes needed? The current implementation may grab the incorrect `numOutputRows` metric if there is an intermediary node (such as custom Spark operator) which does not support the metric. This is because we target the first sourceSide child node with `numOutputRows`. If a SparkExtension node does not contain this metric but transforms the source table, then we could progress all the way to the source table and grab the incorrect metric. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing CI, as this is a revert ### Was this patch authored or co-authored using generative AI tooling? No Closes #53293 from asl3/numsourcerowsrevert. Authored-by: Amanda Liu <amanda.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

dtenedor and others added 8 commits December 3, 2025 15:53

pull bot locked and limited conversation to collaborators Dec 4, 2025

pull bot added the ⤵️ pull label Dec 4, 2025

pull bot merged commit ee41857 into huangxiaopingRD:master Dec 4, 2025

github-actions bot added CORE SQL DOCS PYTHON PANDAS API ON SPARK CONNECT labels Dec 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from apache:master #1108

[pull] master from apache:master #1108

Uh oh!

pull bot commented Dec 4, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[pull] master from apache:master #1108

[pull] master from apache:master #1108

Uh oh!

Conversation

pull bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pull bot commented Dec 4, 2025 •

edited

Loading