[pull] master from apache:master #133

pull · 2024-07-24T18:49:18Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

…er in log ### What changes were proposed in this pull request? This PR followups for #47145 to rename the log field naming ### Why are the changes needed? `line_no` is not very intuitive so we better renaming to `line_number` explicitly. ### Does this PR introduce _any_ user-facing change? No API change, but user-facing log message will be improved ### How was this patch tested?  The existing CI should pass ### Was this patch authored or co-authored using generative AI tooling?  No Closes #47437 from itholic/logger_followup. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Haejoon Lee <haejoon.lee@databricks.com>

…inted RDDs ### What changes were proposed in this pull request? This pull request proposes logging a warning message when the `.unpersist()` method is called on RDDs that have been locally checkpointed. The goal is to inform users about the potential risks associated with unpersisting locally checkpointed RDDs without changing the current behavior of the method. ### Why are the changes needed? Local checkpointing truncates the lineage of an RDD, preventing it from being recomputed from its source. If a locally checkpointed RDD is unpersisted, it loses its data and cannot be regenerated, potentially leading to job failures if subsequent actions or transformations are attempted on the RDD (which was seen on some user workloads). Logging a warning message helps users avoid such pitfalls and aids in debugging. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds a warning log message when .unpersist() is called on a locally checkpointed RDD, but it does not alter any existing behavior. ### How was this patch tested? This PR does not change any existing behavior and therefore no tests are added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47391 from mingkangli-db/warning_unpersist. Authored-by: Mingkang Li <mingkang.li@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

### What changes were proposed in this pull request? Fix breaking change in `fromJson` method by having default param values. ### Why are the changes needed? In order to not break clients that don't care about collations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46737 from stefankandic/fromJsonBreakingChange. Authored-by: Stefan Kandic <stefan.kandic@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…d of `SQLContext.implicits` ### What changes were proposed in this pull request? This PR replaces `SQLContext.implicits` with `SparkSession.implicits` in the Spark codebase. ### Why are the changes needed? Reduce the usage of code from `SQLContext` within the internal code of Spark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #47457 from LuciferYang/use-sparksession-implicits. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

### What changes were proposed in this pull request? The pr aims to make `curl` retry `3 times` in `bin/mvn`. ### Why are the changes needed? Avoid the following issues: https://github.com/panbingkun/spark/actions/runs/10067831390/job/27832101470 <img width="993" alt="image" src="https://github.com/user-attachments/assets/3fa9a59a-82cb-4e99-b9f7-d128f051d340"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Continuous manual observation. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47465 from panbingkun/SPARK-48987. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…llations ### What changes were proposed in this pull request? Fixing the bug in the code where because of different way string interpolation works in python we had an accidental dollar sign in the string value. ### Why are the changes needed? To be consistent with the scala code. ### Does this PR introduce _any_ user-facing change? Different value will be shown to the user. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47463 from stefankandic/fixPythonToString. Authored-by: Stefan Kandic <stefan.kandic@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? The pr aims to improve the docs related to `variable`, includes: - `docs/sql-ref-syntax-aux-set-var.md`, show the `primitive` error messages. - `docs/sql-ref-syntax-ddl-declare-variable.md`, add usage of `DECLARE OR REPLACE`. - `docs/sql-ref-syntax-ddl-drop-variable.md`, show the `primitive` error messages and fix `typo`. ### Why are the changes needed? Only improve docs. ### Does this PR introduce _any_ user-facing change? Yes, make end-user docs clearer. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47460 from panbingkun/SPARK-48976. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…nition from `protobuf` ### What changes were proposed in this pull request? This PR removes the unused object definition `ScalaReflectionLock` from the `protobuf` module. `ScalaReflectionLock` is a definition at the access scope of `protobuf` package, which was defined in SPARK-40654 | #37972 and become unused in SPARK-41639 | #39147. ### Why are the changes needed? Clean up unused definitions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #47459 from LuciferYang/remove-ScalaReflectionLock. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…intenance task ### What changes were proposed in this pull request? Currently, during the state store maintenance process, we find which old version files of the **RocksDB** state store to delete by listing all existing snapshotted version files in the checkpoint directory every 1 minute by default. The frequent list calls in the cloud can result in high costs. To address this concern and reduce the cost associated with state store maintenance, we should aim to minimize the frequency of listing object stores inside the maintenance task. To minimize the frequency, we will try to accumulate versions to delete and only call list when the number of versions to delete reaches a configured threshold. The changes include: 1. Adding new configuration variable `ratioExtraVersionsAllowedInCheckpoint` in **SQLConf**. This determines the ratio of extra versions files we want to retain in the checkpoint directory compared to number of versions to retain for rollbacks (`minBatchesToRetain`). 2. Using this config and `minBatchesToRetain`, set `minVersionsToDelete` config inside **StateStoreConf** to `minVersionsToDelete = ratioExtraVersionsAllowedInCheckpoint * minBatchesToRetain.` 3. Using `minSeenVersion` and `maxSeenVersion` variables in **RocksDBFileManager** to estimate min/max version present in directory and control deletion frequency. This is done by ensuring number of stale versions to delete is at least `minVersionsToDelete` ### Why are the changes needed? Currently, maintenance operations like snapshotting, purging, deletion, and file management is done asynchronously for each data partition. We want to shift away periodic deletion and instead rely on the estimated number of files in the checkpoint directory to reduce list calls and introduce batch deletion. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47393 from riyaverm-db/reduce-cloud-store-list-api-cost-in-maintenance. Authored-by: Riya Verma <riya.verma@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

… consistent with JVM ### What changes were proposed in this pull request? This PR proposes to make the parameter naming of `PySparkException` consistent with JVM ### Why are the changes needed? The parameter names of `PySparkException` are different from `SparkException` so there is an inconsistency when searching those parameters from error logs. SparkException: https://github.com/apache/spark/blob/6508b1f5e18731359354af0a7bcc1382bc4f356b/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L27-L33 PySparkException: https://github.com/apache/spark/blob/6508b1f5e18731359354af0a7bcc1382bc4f356b/python/pyspark/errors/exceptions/base.py#L29-L40 ### Does this PR introduce _any_ user-facing change? The error parameter names are changed from: - `error_class` -> `errorClass` - `message_parameters` -> `messageParameters` - `query_contexts` -> `context` ### How was this patch tested? The existing CI should pass ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47436 from itholic/SPARK-48961. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Haejoon Lee <haejoon.lee@databricks.com>

…Collation` expression itself in UT ### What changes were proposed in this pull request? The pr aims to: - make `checkEvaluation` directly check the `Collation` expression itself in UT, rather than `Collation(...).replacement`. - fix an `miss` check in UT. ### Why are the changes needed? When checking the `RuntimeReplaceable` expression in UT, there is no need to write as `checkEvaluation(Collation(Literal("abc")).replacement, "UTF8_BINARY")`, because it has already undergone a similar replacement internally. https://github.com/apache/spark/blob/1a428c1606645057ef94ac8a6cadbb947b9208a6/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala#L75 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Update existed UT. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47401 from panbingkun/SPARK-48935. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Checking wether variable declaration is only at the beginning of the BEGIN END block. ### Why are the changes needed? SQL standard states that the variables can be declared only immediately after BEGIN. ### Does this PR introduce _any_ user-facing change? Users will get an error if they try to declare variable in the scope that is not started with BEGIN and ended with END or if the declarations are not immediately after BEGIN. ### How was this patch tested? Tests are in SqlScriptingParserSuite. There are 2 tests for now, if declarations are correctly written and if declarations are not written immediately after BEGIN. There is a TODO to write the test if declaration is located in the scope that is not BEGIN END. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47404 from momcilomrk-db/check_variable_declarations. Authored-by: Momcilo Mrkaic <momcilo.mrkaic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… the actual StreamingQueryProgress This reverts commit d067fc6, which reverted 042804a, essentially brings it back. 042804a failed the 3.5 client <> 4.0 server test, but the test was decided to turned off for cross-version test in #47468 ### What changes were proposed in this pull request? This PR is created after discussion in this closed one: #46886 I was trying to fix a bug (in connect, query.lastProgress doesn't have `numInputRows`, `inputRowsPerSecond`, and `processedRowsPerSecond`), and we reached the conclusion that what purposed in this PR should be the ultimate fix. In python, for both classic spark and spark connect, the return type of `lastProgress` is `Dict` (and `recentProgress` is `List[Dict]`), but in scala it's the actual `StreamingQueryProgress` object: https://github.com/apache/spark/blob/1a5d22aa2ffe769435be4aa6102ef961c55b9593/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala#L94-L101 This API discrepancy brings some confusion, like in Scala, users can do `query.lastProgress.batchId`, while in Python they have to do `query.lastProgress["batchId"]`. This PR makes `StreamingQuery.lastProgress` to return the actual `StreamingQueryProgress` (and `StreamingQuery.recentProgress` to return `List[StreamingQueryProgress]`). To prevent breaking change, we extend `StreamingQueryProgress` to be a subclass of `dict`, so existing code accessing using dictionary method (e.g. `query.lastProgress["id"]`) is still functional. ### Why are the changes needed? API parity ### Does this PR introduce _any_ user-facing change? Yes, now `StreamingQuery.lastProgress` returns the actual `StreamingQueryProgress` (and `StreamingQuery.recentProgress` returns `List[StreamingQueryProgress]`). ### How was this patch tested? Added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #47470 from WweiL/bring-back-lastProgress. Authored-by: Wei Liu <wei.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? adds support for variant type in `InMemoryTableScan`, or `df.cache()` by supporting writing variant values to an inmemory buffer. ### Why are the changes needed? prior to this PR, calling `df.cache()` on a df that has a variant would fail because `InMemoryTableScan` does not support reading variant types. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? added UTs ### Was this patch authored or co-authored using generative AI tooling? no Closes #47252 from richardc-db/variant_dfcache_support. Authored-by: Richard Chen <r.chen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ith spark session ### What changes were proposed in this pull request? `DefaultParamsReader/Writer` handle metadata with spark session ### Why are the changes needed? In existing ml implementations, when loading/saving a model, it loads/saves the metadata with `SparkContext` then loads/saves the coefficients with `SparkSession`. This PR aims to also load/save the metadata with `SparkSession`, by introducing new helper functions. - Note I: 3-rd libraries (e.g. [xgboost](https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/org/apache/spark/ml/util/XGBoostReadWrite.scala#L38-L53) ) likely depends on existing implementation of saveMetadata/loadMetadata, so we cannot simply remove them even though they are `private[ml]`. - Note II: this PR only handles `loadMetadata` and `saveMetadata`, there are similar cases for meta algorithms and param read/write, but I want to ignore the remaining part first, to avoid touching too many files in single PR. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #47467 from zhengruifeng/ml_load_with_spark. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ileStreamSink.hasMetadata ### What changes were proposed in this pull request? This pull request proposed to move path initialization into try-catch block in FileStreamSink.hasMetadata. Then, exceptions from invalid paths can be handled consistently like other path-related exceptions in the current try-catch block. At last, we can make the errors fall into the correct code branches to be handled ### Why are the changes needed? bugfix for improperly handled exceptions in FileStreamSink.hasMetadata ### Does this PR introduce _any_ user-facing change? no, an invalid path is still invalid, but fails in the correct places ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes #47471 from yaooqinn/SPARK-48991. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>

### What changes were proposed in this pull request? The pr aims to unified `variable` related `SQL syntax` keywords, enable syntax `DECLARE (OR REPLACE)? ...` and `DROP TEMPORARY ...` to support keyword: `VAR` (not only `VARIABLE`). ### Why are the changes needed? When `setting` variables, we support `(VARIABLE | VAR)`, but when `declaring` and `dropping` variables, we only support the keyword `VARIABLE` (not support the keyword `VAR`) <img width="597" alt="image" src="https://github.com/user-attachments/assets/07084fef-4080-4410-a74c-e6001ae0a8fa"> https://github.com/apache/spark/blob/285489b0225004e918b6e937f7367e492292815e/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L68-L72 https://github.com/apache/spark/blob/285489b0225004e918b6e937f7367e492292815e/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L218-L220 The syntax seems `a bit weird`, `inconsistent experience` in SQL syntax related to variable usage by end-users, so I propose to `unify` it. ### Does this PR introduce _any_ user-facing change? Yes, enable end-users to use `variable related SQL` with `consistent` keywords. ### How was this patch tested? Updated existed UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47469 from panbingkun/SPARK-48990. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…t case failure with ANSI mode off ### What changes were proposed in this pull request? This PR aims to fix the `h2` filter push-down test case failure with ANSI mode off. ### Why are the changes needed? Fix test failure. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test of the whole `JDBCV2Suite` with ANSI mode off and on. 1. Method One: with IDEA. - ANSI mode off: with `SPARK_ANSI_SQL_MODE=false` <img width="1066" alt="image" src="https://github.com/user-attachments/assets/13ec8ff4-0699-4f3e-95c4-74f53d9824fe"> - ANSI mode on: without `SPARK_ANSI_SQL_MODE` env variable <img width="1066" alt="image" src="https://github.com/user-attachments/assets/8434bf0c-b332-4663-965c-0d17d60da78a"> 2. Method Two: with commands. - ANSI mode off ``` SPARK_ANSI_SQL_MODE=false $ build/sbt > project sql > testOnly org.apache.spark.sql.jdbc.JDBCV2Suite ``` - ANSI mode on ``` UNSET SPARK_ANSI_SQL_MODE $ build/sbt > project sql > testOnly org.apache.spark.sql.jdbc.JDBCV2Suite ``` Test results: 1. The issue on current `master` branch - with `SPARK_ANSI_SQL_MODE=false`, test failed - without `SPARK_ANSI_SQL_MODE` env variable, test passed 2. Fixed with new test code - with `SPARK_ANSI_SQL_MODE=false`, test passed - without `SPARK_ANSI_SQL_MODE` env variable, test passed ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47472 from wayneguow/fix_h2. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Introduce a new `clusterBy` DataFrame API in Scala. This PR adds the API for both the DataFrameWriter V1 and V2, as well as Spark Connect. ### Why are the changes needed? Introduce more ways for users to interact with clustered tables. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new `clusterBy` DataFrame API in Scala to allow specifying the clustering columns when writing DataFrames. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47301 from zedtang/clusterby-scala-api. Authored-by: Jiaheng Tang <jiaheng.tang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… in hive-thriftserver test ### What changes were proposed in this pull request? A follow up of SPARK-48844 to cleanup duplicated data resource files in hive-thriftserver test ### Why are the changes needed? code refactoring ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #47480 from yaooqinn/SPARK-48844-F. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>

### What changes were proposed in this pull request? The pr aims to update doc `sql/README.md`. ### Why are the changes needed? After #41426, We have added a subproject `API` to our `SQL moudle`, so we need to update the doc `sql/README.md` synchronously. ### Does this PR introduce _any_ user-facing change? Yes, make the doc clearer and more accurate. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47476 from panbingkun/minor_docs. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…hStateExec operator ### What changes were proposed in this pull request? Introducing the OperatorStateMetadataV2 format that integrates with the TransformWithStateExec operator. This is used to keep information about the TWS operator, will be used to enforce invariants in between query runs. Each OperatorStateMetadataV2 has a pointer to the StateSchemaV3 file for the corresponding operator. Will introduce purging in this PR: #47286 ### Why are the changes needed? This is needed for State Metadata integration with the TransformWithState operator. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Added unit tests to StateStoreSuite and TransformWithStateSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes #47445 from ericm-db/metadata-v2. Authored-by: Eric Marnadi <eric.marnadi@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

… of Column ### What changes were proposed in this pull request? Allows bare literals for `__and__` and `__or__` of Column API in Spark Classic. ### Why are the changes needed? Currently bare literals are not allowed for `__and__` and `__or__` of Column API in Spark Classic and need to wrap with `lit()` function. It should be allowed similar to other similar operators. ```py >>> from pyspark.sql.functions import * >>> c = col("c") >>> c & True Traceback (most recent call last): ... py4j.Py4JException: Method and([class java.lang.Boolean]) does not exist >>> c & lit(True) Column<'and(c, true)'> ``` whereas other operators: ```py >>> c + 1 Column<'`+`(c, 1)'> >>> c + lit(1) Column<'`+`(c, 1)'> ``` Spark Connect allows this. ```py >>> c & True Column<'and(c, True)'> >>> c & lit(True) Column<'and(c, True)'> ``` ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Added the related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47474 from ueshin/issues/SPARK-48996/literal_and_or. Authored-by: Takuya Ueshin <ueshin@databricks.com> Signed-off-by: Takuya Ueshin <ueshin@databricks.com>

…, if they are bound to outer rows ### What changes were proposed in this pull request? Extends previous work in #46839, allowing the grouping expressions to be bound to outer references. Most common example is `select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))` Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it. Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries. ### Why are the changes needed? Extends supported subqueries ### Does this PR introduce _any_ user-facing change? Yes, previously failing queries are now passing ### How was this patch tested? Query tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #47388 from agubichev/group_by_cols. Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR aims to improve `MasterPage` to support custom title. ### Why are the changes needed? When there exists multiple Spark clusters, custom title can be more helpful than the spark master address because it can contain semantics like the role of the clusters. In addition, the URL field in the same page already provides the spark master information even when we use a custom title. **BEFORE** ``` sbin/start-master.sh ``` ![Screenshot 2024-07-25 at 14 01 11](https://github.com/user-attachments/assets/7055d700-4bd6-4785-a535-2f8ce6dba47d) **AFTER** ``` SPARK_MASTER_OPTS='-Dspark.master.ui.title="Projext X Staging Cluster"' sbin/start-master.sh ``` ![Screenshot 2024-07-25 at 14 05 38](https://github.com/user-attachments/assets/f7e45fd6-fa2b-4547-ae39-1403b1e910d9) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Pass the CIs with newly added test case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47491 from dongjoon-hyun/SPARK-49007. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? 1. add overloads with SparkSession of following helper functions: - SharedReadWrite.saveImpl - SharedReadWrite.load - DefaultParamsWriter.getMetadataToSave - DefaultParamsReader.loadParamsInstance - DefaultParamsReader.loadParamsInstanceReader 2. deprecate old functions 3. apply the new functions in ML ### Why are the changes needed? Meta algorithms save/load model with SparkSession After this PR, all `.ml` implementations save and load models with SparkSession, while the old helper functions with `sc` are still available (just deprecated) for eco-system. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #47477 from zhengruifeng/ml_meta_spark. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

…to the `hive-thriftserver` module to fix the Maven daily test ### What changes were proposed in this pull request? This pr add bouncycastle-related test dependencies to the `hive-thrift` module to fix the Maven daily test. ### Why are the changes needed? `sql-on-files.sql` added the following statement in #47480, which caused the Maven daily test to fail https://github.com/apache/spark/blob/2363aec0c14ead24ade2bfa23478a4914f179c00/sql/core/src/test/resources/sql-tests/inputs/sql-on-files.sql#L10 - https://github.com/apache/spark/actions/runs/10094638521/job/27943309504 - https://github.com/apache/spark/actions/runs/10095571472/job/27943298802 ``` - sql-on-files.sql *** FAILED *** "" did not contain "Exception" Exception did not match for query #6 CREATE TABLE sql_on_files.test_orc USING ORC AS SELECT 1, expected: , but got: java.sql.SQLException org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8542.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8542.0 (TID 8594) (localhost executor driver): java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider at test.org.apache.spark.sql.execution.datasources.orc.FakeKeyProvider$Factory.createProvider(FakeKeyProvider.java:127) at org.apache.hadoop.crypto.key.KeyProviderFactory.get(KeyProviderFactory.java:96) at org.apache.hadoop.crypto.key.KeyProviderFactory.getProviders(KeyProviderFactory.java:68) at org.apache.orc.impl.HadoopShimsCurrent.createKeyProvider(HadoopShimsCurrent.java:97) at org.apache.orc.impl.HadoopShimsCurrent.getHadoopKeyProvider(HadoopShimsCurrent.java:131) at org.apache.orc.impl.CryptoUtils$HadoopKeyProviderFactory.create(CryptoUtils.java:158) at org.apache.orc.impl.CryptoUtils.getKeyProvider(CryptoUtils.java:141) at org.apache.orc.impl.WriterImpl.setupEncryption(WriterImpl.java:1015) at org.apache.orc.impl.WriterImpl.<init>(WriterImpl.java:164) at org.apache.orc.OrcFile.createWriter(OrcFile.java:1078) at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:49) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:89) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:180) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:165) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:391) at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:107) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:901) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:901) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374) at org.apache.spark.rdd.RDD.iterator(RDD.scala:338) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171) at org.apache.spark.scheduler.Task.run(Task.scala:146) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) Caused by: java.lang.ClassNotFoundException: org.bouncycastle.jce.provider.BouncyCastleProvider at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) ... 32 more ``` Because we have configured `hadoop.security.key.provider.path` as `test:///` in the parent `pom.xml`, https://github.com/apache/spark/blob/5ccf9ba958f492c1eb4dde22a647ba75aba63d8e/pom.xml#L3165-L3166 `KeyProviderFactory#getProviders` will use `FakeKeyProvider$Factory` to create instances of `FakeKeyProvider`. https://github.com/apache/spark/blob/5ccf9ba958f492c1eb4dde22a647ba75aba63d8e/sql/core/src/test/resources/META-INF/services/org.apache.hadoop.crypto.key.KeyProviderFactory#L18 During the initialization of `FakeKeyProvider`, it first initializes its superclass `org.apache.hadoop.crypto.key.KeyProvider`, which leads to the loading of the `BouncyCastleProvider` class. Therefore, we need to add bouncycastle-related test dependencies in the `hive-thrift` module. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual Test with this pr. ``` build/mvn -Phive -Phive-thriftserver clean install -DskipTests build/mvn -Phive -Phive-thriftserver clean install -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite -pl sql/hive-thriftserver ``` ``` Run completed in 6 minutes, 52 seconds. Total number of tests run: 243 Suites: completed 2, aborted 0 Tests: succeeded 243, failed 0, canceled 0, ignored 20, pending 0 All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #47496 from LuciferYang/thrift-bouncycastle. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? Divide PythonStreamingDataSourceSuite into PythonStreamingDataSourceSuite, PythonStreamingDataSourceSimpleSuite and PythonStreamingDataSourceWriteSuite ### Why are the changes needed? It is reported that PythonStreamingDataSourceSuite runs too long. We need to divide it to allow better parallelism. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Watch CI to cover the three new tests and pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47479 from siying/divide_python_stream_suite. Authored-by: Siying Dong <siying.dong@databricks.com> Signed-off-by: allisonwang-db <allison.wang@databricks.com>

### What changes were proposed in this pull request? Make Column APIs and functions accept `Enum`s. ### Why are the changes needed? `Enum`s can be accepted in Column APIs and functions using its `value`. ```py >>> from pyspark.sql import functions as F >>> from enum import Enum >>> class A(Enum): ... X = "x" ... Y = "y" ... >>> F.lit(A.X) Column<'x'> >>> F.lit(A.X) + A.Y Column<'`+`(x, y)'> ``` ### Does this PR introduce _any_ user-facing change? Yes, Python's `Enum`s will be used as literal values. ### How was this patch tested? Added the related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47495 from ueshin/issues/SPARK-49009/enum. Authored-by: Takuya Ueshin <ueshin@databricks.com> Signed-off-by: Takuya Ueshin <ueshin@databricks.com>

### What changes were proposed in this pull request? This PR introduces an intermediate representation for Column operations. It also adds a converter from this IR to Catalyst Expression. This is a first step in sharing Column API between Classic and Connect. It is not integrated with any of the pre-existing code base. ### Why are the changes needed? We want to share the Scala Column API between the Classic and Connect Scala DataFrame API implementations. For this we need to decouple the Column API from Catalyst. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a test suite that tests the conversion from `ColumnNode` to `Expression`. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47466 from hvanhovell/SPARK-48986. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

…ement in REST Submission API ### What changes were proposed in this pull request? This PR aims to support server-side environment variable replacement in REST Submission API. - For example, ephemeral Spark clusters with server-side environment variables can provide backend-resource and information without touching client-side applications and configurations. - The place holder pattern is `{{SERVER_ENVIRONMENT_VARIABLE_NAME}}` style like the following. https://github.com/apache/spark/blob/163e512c53208301a8511310023d930d8b77db96/docs/configuration.md?plain=1#L694 https://github.com/apache/spark/blob/163e512c53208301a8511310023d930d8b77db96/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L233-L234 ### Why are the changes needed? A user can submits an environment variable holder like `{{AWS_CA_BUNDLE}}` and `{{AWS_ENDPOINT_URL}}` in order to use server-wide environment variables of Spark Master. ``` $ SPARK_MASTER_OPTS="-Dspark.master.rest.enabled=true" \ AWS_ENDPOINT_URL=ENDPOINT_FOR_THIS_CLUSTER \ sbin/start-master.sh $ sbin/start-worker.sh spark://$(hostname):7077 ``` ``` curl -s -k -XPOST http://localhost:6066/v1/submissions/create \ --header "Content-Type:application/json;charset=UTF-8" \ --data '{ "appResource": "", "sparkProperties": { "spark.master": "spark://localhost:7077", "spark.app.name": "", "spark.submit.deployMode": "cluster", "spark.jars": "/Users/dongjoon/APACHE/spark-merge/examples/target/scala-2.13/jars/spark-examples_2.13-4.0.0-SNAPSHOT.jar" }, "clientSparkVersion": "", "mainClass": "org.apache.spark.examples.SparkPi", "environmentVariables": { "AWS_ACCESS_KEY_ID": "A", "AWS_SECRET_ACCESS_KEY": "B", "AWS_ENDPOINT_URL": "{{AWS_ENDPOINT_URL}}" }, "action": "CreateSubmissionRequest", "appArgs": [ "10000" ] }' ``` - http://localhost:4040/environment/ ![Screenshot 2024-07-26 at 16 58 26](https://github.com/user-attachments/assets/c52daf4e-02ce-4015-bda6-895fb39a39a9) ### Does this PR introduce _any_ user-facing change? No. This is a new feature and disabled by default via `spark.master.rest.enabled (default: false)` ### How was this patch tested? Pass the CIs with newly added test case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47509 from dongjoon-hyun/SPARK-49033. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…sensitivity ### What changes were proposed in this pull request? Currently, XML respects the case sensitivity SQLConf (default to false) in the schema inference but we lack unit tests to verify the behavior. This PR adds more unit tests to it. ### Why are the changes needed? This is a test-only change. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is a test-only change. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47494 from shujingyang-db/xml-schema-inference-case-sensitivity. Authored-by: Shujing Yang <shujing.yang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… the Variant Spec ### What changes were proposed in this pull request? This PR adds support for the [YearMonthIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/YearMonthIntervalType.html) and [DayTimeIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DayTimeIntervalType.html) as new primitive types in the Variant spec. As part of this task, the PR adds support for casting between intervals and variants and support for interval types in all the relevant variant expressions. This PR also adds support for these types on the PySpark side. ### Why are the changes needed? The variant spec should be compatible with all SQL Standard data types. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to cast interval types to variants and vice versa. ### How was this patch tested? Unit tests in VariantExpressionSuite.scala and test_types.py ### Was this patch authored or co-authored using generative AI tooling? Yes, I used perplexity.ai to get guidance on converting some Scala code to Java code and Java code to Python code. Generated-by: perplexity.ai Closes #47473 from harshmotw-db/variant_interval. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… in REST Submission API ### What changes were proposed in this pull request? Like SPARK-49033, this PR aims to support server-side `sparkProperties` replacement in REST Submission API. - For example, ephemeral Spark clusters with server-side environment variables can provide backend-resource and information without touching client-side applications and configurations. - The place holder pattern is `{{SERVER_ENVIRONMENT_VARIABLE_NAME}}` style like the following. https://github.com/apache/spark/blob/163e512c53208301a8511310023d930d8b77db96/docs/configuration.md?plain=1#L694 https://github.com/apache/spark/blob/163e512c53208301a8511310023d930d8b77db96/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L233-L234 ### Why are the changes needed? A user can submits an environment variable holder like `{{AWS_ENDPOINT_URL}}` in order to use server-wide environment variables of Spark Master. ``` $ SPARK_MASTER_OPTS="-Dspark.master.rest.enabled=true" \ AWS_ENDPOINT_URL=ENDPOINT_FOR_THIS_CLUSTER \ sbin/start-master.sh $ sbin/start-worker.sh spark://$(hostname):7077 ``` ``` curl -s -k -XPOST http://localhost:6066/v1/submissions/create \ --header "Content-Type:application/json;charset=UTF-8" \ --data '{ "appResource": "", "sparkProperties": { "spark.master": "spark://localhost:7077", "spark.app.name": "", "spark.submit.deployMode": "cluster", "spark.hadoop.fs.s3a.endpoint": "{{AWS_ENDPOINT_URL}}", "spark.jars": "/Users/dongjoon/APACHE/spark-merge/examples/target/scala-2.13/jars/spark-examples_2.13-4.0.0-SNAPSHOT.jar" }, "clientSparkVersion": "", "mainClass": "org.apache.spark.examples.SparkPi", "environmentVariables": {}, "action": "CreateSubmissionRequest", "appArgs": [ "10000" ] }' ``` - http://localhost:4040/environment/ ![Screenshot 2024-07-26 at 22 00 26](https://github.com/user-attachments/assets/20ea5d98-2503-4969-8cdb-82938c706029) ### Does this PR introduce _any_ user-facing change? No. This is a new feature and disabled by default via `spark.master.rest.enabled (default: false)` ### How was this patch tested? Pass the CIs with newly added test case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47511 from dongjoon-hyun/SPARK-49034-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…nProtoConverter ### What changes were proposed in this pull request? This PR refactors the `LiteralExpressionProtoConverter` to use `CatalystTypeConverters` for consistent type conversion, eliminating code duplication and improving maintainability. **Key changes:** 1. **Simplified `LiteralExpressionProtoConverter.toCatalystExpression()`**: Replaced the large switch statement (86 lines) with a clean 3-line implementation that leverages existing conversion utilities 2. **Added TIME type support**: Added missing TIME literal type conversion in `LiteralValueProtoConverter.toScalaValue()` ### Why are the changes needed? 1. **Type conversion issues**: Some complex nested data structures (e.g., arrays of case classes) failed to convert to Catalyst's internal representation when using `expressions.Literal.create(...)`. 2. **Inconsistent behaviors**: Differences in behavior between Spark Connect and classic Spark for the same data types (e.g., Decimal). ### Does this PR introduce _any_ user-facing change? **Yes** - Users can now successfully use `typedLit` with an array of case classes. Previously, attempting to use `typedlit(Array(CaseClass(1, "a")))` would fail (please see the code piece below for details), but now it works correctly and converts case classes to proper struct representations. ```scala import org.apache.spark.sql.functions.typedlit case class CaseClass(a: Int, b: String) spark.sql("select 1").select(typedlit(Array(CaseClass(1, "a")))).collect() // Below is the error message: """ org.apache.spark.SparkIllegalArgumentException: requirement failed: Literal must have a corresponding value to array<struct<a:int,b:string>>, but class GenericArrayData found. scala.Predef$.require(Predef.scala:337) org.apache.spark.sql.catalyst.expressions.Literal$.validateLiteralValue(literals.scala:306) org.apache.spark.sql.catalyst.expressions.Literal.<init>(literals.scala:456) org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:206) org.apache.spark.sql.connect.planner.LiteralExpressionProtoConverter$.toCatalystExpression(LiteralExpressionProtoConverter.scala:103) """ ``` Besides, some catalyst values (e.g., Decimal 89.97620 -> 89.976200000000000000) have changed. Please see the changes in `explain-results/` for details. ```scala import org.apache.spark.sql.functions.typedlit spark.sql("select 1").select(typedlit(BigDecimal(8997620, 5)),typedlit(Array(BigDecimal(8997620, 5), BigDecimal(8997621, 5)))).explain() // Current explain() output: """ Project [89.97620 AS 89.97620#819, [89.97620,89.97621] AS ARRAY(89.97620BD, 89.97621BD)#820] """ // Expected explain() output (i.e., same as the classic mode): """ Project [89.976200000000000000 AS 89.976200000000000000#132, [89.976200000000000000,89.976210000000000000] AS ARRAY(89.976200000000000000BD, 89.976210000000000000BD)#133] """ ``` ### How was this patch tested? `build/sbt "connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite"` `build/sbt "connect/testOnly org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite"` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.4.5 Closes apache#52188 from heyihong/SPARK-53438. Authored-by: Yihong He <heyihong.cn@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

itholic and others added 17 commits July 24, 2024 03:04

github-actions bot added BUILD CORE SQL MLLIB DOCS PYTHON PANDAS API ON SPARK CONNECT STRUCTURED STREAMING PROTOBUF ML AVRO labels Jul 24, 2024

pull bot removed the BUILD label Jul 24, 2024

github-actions bot added PYTHON PANDAS API ON SPARK CONNECT STRUCTURED STREAMING PROTOBUF ML AVRO labels Jul 24, 2024

wayneguow and others added 8 commits July 25, 2024 09:53

github-actions bot added the R label Jul 26, 2024

github-actions bot added the WEB UI label Jul 26, 2024

zhengruifeng and others added 9 commits July 26, 2024 18:17

pull bot merged commit 10849d9 into huangxiaopingRD:master Jul 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from apache:master #133

[pull] master from apache:master #133

Uh oh!

pull bot commented Jul 24, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

[pull] master from apache:master #133

[pull] master from apache:master #133

Uh oh!

Conversation

pull bot commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

pull bot commented Jul 24, 2024 •

edited

Loading