[pull] master from apache:master #47

pull · 2023-10-12T13:13:26Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

### What changes were proposed in this pull request? This PR refines the docstring of `DataFrame.show` by adding more examples. ### Why are the changes needed? To improve PySpark documentations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? doctest ### Was this patch authored or co-authored using generative AI tooling? No Closes #43252 from allisonwang-db/spark-45442-refine-show. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

### What changes were proposed in this pull request? The following XML with rowTag 'book' will yield a schema with just "_id" column and not the value: ``` <p><book id="1">Great Book</book> </p> ``` Let's parse value as well. The scope of this PR is to keep the rowTag's behavior of `valueTag` consistent with the inner objects. ### Why are the changes needed? The semantics for attributes and `valueTag` should be consistent ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #43319 from shujingyang-db/rootlevel-valuetag. Lead-authored-by: Shujing Yang <shujing.yang@databricks.com> Co-authored-by: Shujing Yang <135740748+shujingyang-db@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ala.collection.mutable.Growable` ### What changes were proposed in this pull request? Since scala 2.13.0, `scala.collection.generic.Growable` marked as deprecated. This PR change it to `scala.collection.mutable.Growable` ### Why are the changes needed? Remove deprecated api. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exist test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43347 from Hisoka-X/SPARK-45510-replace-growable. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

…n-Spark object on Spark Connect ### What changes were proposed in this pull request? This PR proposes to raise proper exception for `ps.sql` with Pandas-on-Spark DataFrame on Spark Connect ### Why are the changes needed? To improve error message ### Does this PR introduce _any_ user-facing change? No API change, but it's error message improvement. **Before** ```python >>> psdf = ps.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> ps.sql("SELECT {col}, {col2} FROM {tbl}", col=psdf.A, col2=psdf.B, tbl=psdf) Traceback (most recent call last): ... pyspark.errors.exceptions.connect.AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `_pandas_api_32aa6c7b33ac442bab790cfb49f65ca1` cannot be found. Verify the spelling and correctness of the schema and catalog. If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog. To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.; line 1 pos 17; 'Project ['A, 'B] +- 'UnresolvedRelation [_pandas_api_32aa6c7b33ac442bab790cfb49f65ca1], [], false JVM stacktrace: org.apache.spark.sql.catalyst.ExtendedAnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `_pandas_api_32aa6c7b33ac442bab790cfb49f65ca1` cannot be found. Verify the spelling and correctness of the schema and catalog. ... ``` **After** ```python >>> psdf = ps.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> ps.sql("SELECT {col}, {col2} FROM {tbl}", col=psdf.A, col2=psdf.B, tbl=psdf) Traceback (most recent call last): ... pyspark.errors.exceptions.base.PySparkTypeError: [UNSUPPORTED_DATA_TYPE] Unsupported DataType `DataFrame`. ``` ### How was this patch tested? The existing CI should pass ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43237 from itholic/SPARK-43664. Lead-authored-by: Haejoon Lee <haejoon.lee@databricks.com> Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Due to a quirk in the parser, in some cases, IDENTIFIER(<funcStr>)(<arg>) is not properly recognized as a function invocation. The change is to remove the explicit IDENTIFIER-clause rule in the function invocation grammar and instead recognize IDENTIFIER(<arg>) within visitFunctionCall. ### Why are the changes needed? Function invocation support for IDENTIFIER is incomplete otherwise ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added new testcases to identifier-clause.sql ### Was this patch authored or co-authored using generative AI tooling? No Closes #42888 from srielau/SPARK-45132. Lead-authored-by: srielau <serge@rielau.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Upgrade Apache Kafka from 3.4.1 to 3.6.0 ### Why are the changes needed? - https://downloads.apache.org/kafka/3.6.0/RELEASE_NOTES.html - https://downloads.apache.org/kafka/3.5.1/RELEASE_NOTES.html - https://archive.apache.org/dist/kafka/3.5.0/RELEASE_NOTES.html ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GitHub CI. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43348 from dengziming/kafka-3.6.0. Authored-by: dengziming <dengziming1993@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…onversion ### What changes were proposed in this pull request? This pr change to use pattern matching for type checking and conversion instead of the explicit type casting statement in Java code. The change refer to [JEP 394](https://openjdk.org/jeps/394), and this pr does not include parts of the `hive-thriftserver` module. Example: ```java if (obj instanceof String) { String str = (String) obj; System.out.println(str); } ``` Can be replaced with ```java if (obj instanceof String str) { System.out.println(str); } ``` ### Why are the changes needed? Using `JEP 394: Pattern Matching for instanceof` can bring the following benefits: 1. **Code conciseness**: By eliminating explicit type conversion and redundant variable declarations, the code becomes more concise and easy to read. 2. **Improved safety**: In the past, explicit type conversion was required, and if accidentally converted to the wrong type, a `ClassCastException` would be thrown at runtime. Now, as type checking and type conversion occur in the same step, such errors are no longer possible. 3. **Better semantics**: Previously, instanceof and type casting were two independent steps, which could lead to unclear code intentions. Now, these two steps are merged into one, making the intentions of the code clearer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43327 from LuciferYang/jep-394. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

…ead of `ReferenceQueue#poll` ### What changes were proposed in this pull request? This pr replaces `refQueue.poll()` with `refQueue.remove()` in the test case `reference to sub iterator should not be available after completion` to ensure that a `PhantomReference` object can be retrieved from `refQueue`. ### Why are the changes needed? #43325 replaces `Reference#isEnqueued` with `Reference#refersTo(null)` to eliminate the use of deprecated APIs. However, there are some differences between `ref.isEnqueued` and `ref.refersTo(null)`. - The `ref.isEnqueued` method is used to check whether this `PhantomReference` object has been added to its reference queue by the garbage collector. When the garbage collector decides to recycle an object, if this object has one or more `PhantomReference`, then these `PhantomReference` will be added to their reference queues. So, if `ref.isEnqueued` returns `true`, it means that this `PhantomReference` has been added to the reference queue, which means that the object it references has been recycled by the garbage collector. - The `ref.refersTo(null)` method is used to check whether this `PhantomReference` object refers to the specified object. In the current code, `ref.refersTo(null)` is used to check whether `ref` still refers to `sub`. If `ref.refersTo(null)` returns `true`, it means that `ref` no longer refers to `sub`, which means that `sub` might have been recycled by the garbage collector. But this does not mean that this `ref` has been added to the reference queue. So we can see the following test failure in GA: https://github.com/apache/spark/actions/runs/6484510414/job/17608536854 ``` [info] - reference to sub iterator should not be available after completion *** FAILED *** (287 milliseconds) [info] null did not equal java.lang.ref.PhantomReference11e8f090 (CompletionIteratorSuite.scala:67) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) [info] at org.apache.spark.util.CompletionIteratorSuite.$anonfun$new$3(CompletionIteratorSuite.scala:67) ``` To solve this issue, this PR replaces `refQueue.poll()` with `refQueue.remove()` to allow for waiting until `ref` is put into `refQueue` and can be retrieved from `refQueue`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43345 from LuciferYang/ref-remove. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

…essage ### What changes were proposed in this pull request? - Include QueryContext in SparkThrowable proto message - Reconstruct QueryContext for SparkThrowable exceptions on the client side ### Why are the changes needed? - Better integration with the error framework ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite"` ### Was this patch authored or co-authored using generative AI tooling? Closes #43352 from heyihong/SPARK-45516. Lead-authored-by: Yihong He <yihong.he@databricks.com> Co-authored-by: Yihong He <heyihong.cn@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Currently, the `analyzeInPython` method in UserDefinedPythonTableFunction object can starts a Python process in driver and run a Python function in the Python process. This PR aims to refactor this logic into a reusable runner class. ### Why are the changes needed? To make the code more reusable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #43340 from allisonwang-db/spark-45505-refactor-analyze-in-py. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

…alias to current_schema() ### What changes were proposed in this pull request? Change column alias for current_database() to current_schema. ### Why are the changes needed? To better align with preferred usage of schema rather than database for three part namespace. ### Does this PR introduce _any_ user-facing change? Yes, `current_database()` column alias is now `current_schema()`. ### How was this patch tested? Unit tests pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43235 from michaelzhan-db/SPARK-45418. Authored-by: Michael Zhang <m.zhang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? 1, add a new private `compressed` method with given `nnz`, since we can know it sometime; 2, minor change `Array.range(0, length)` -> `Iterator.range(0, length)` to avoid array creation; ### Why are the changes needed? in `VectorAssembler`, the `nnz` if already known before vector construction, the scan to compute nnz can be skipped; ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #43353 from zhengruifeng/ml_vec_opt. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

… the regular `switch` statement ### What changes were proposed in this pull request? This pr use enhanced `switch` Expressions to replace the regular `switch` statement in Spark Java code refer to [JEP 361](https://openjdk.org/jeps/361) Example: ```java double getPrice(String fruit) { switch (fruit) { case "Apple": return 1.0; case "Orange": return 1.5; case "Mango": return 2.0; default: throw new IllegalArgumentException(); } } ``` Can be changed to ```java double getPrice(String fruit) { return switch (fruit) { case "Apple" -> 1.0; case "Orange" -> 1.5; case "Mango" -> 2.0; default -> throw new IllegalArgumentException(); }; } ``` This pr does not include parts of the `hive-thriftserver` module. ### Why are the changes needed? Using `JEP 361: Switch Expressions` can bring the following benefits: 1. **More concise syntax**: `switch` can be used as an expression, not just a statement. This makes the code more concise and easier to read. 2. **Safer**: In `switch` expressions, if we forget the `break`, there will be no unexpected `fall-through` behavior. At the same time, the compiler will check whether all possible cases are covered. If not all cases are covered, the compiler will report an error. 3. **Easier to understand**: The new `switch` expression syntax is closer to our decision-making pattern in daily life, making the code easier to understand. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43349 from LuciferYang/jep-361. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

…ythonUDTF when the table argument is specified as a named argument ### What changes were proposed in this pull request? This is a follow-up of #43042. Fix to resolve UnresolvedPolymorphicPythonUDTF when the table argument is specified as a named argument. ### Why are the changes needed? The Python UDTF analysis result was not applied when the table argument is specified as a named argument. For example, for the following UDTF: ```py udtf class TestUDTF: def __init__(self): self._count = 0 self._sum = 0 self._last = None staticmethod def analyze(*args, **kwargs): return AnalyzeResult( schema=StructType() .add("count", IntegerType()) .add("total", IntegerType()) .add("last", IntegerType()), with_single_partition=True, order_by=[OrderingColumn("input"), OrderingColumn("partition_col")], ) def eval(self, row: Row): # Make sure that the rows arrive in the expected order. if self._last is not None and self._last > row["input"]: raise Exception( f"self._last was {self._last} but the row value was {row['input']}" ) self._count += 1 self._last = row["input"] self._sum += row["input"] def terminate(self): yield self._count, self._sum, self._last spark.udtf.register("test_udtf", TestUDTF) ``` The following query shows a wrong result: ```py >>> spark.sql(""" ... WITH t AS ( ... SELECT id AS partition_col, 1 AS input FROM range(1, 21) ... UNION ALL ... SELECT id AS partition_col, 2 AS input FROM range(1, 21) ... ) ... SELECT count, total, last ... FROM test_udtf(row => TABLE(t)) ... ORDER BY 1, 2 ... """).show() +-----+-----+----+ |count|total|last| +-----+-----+----+ | 1| 1| 1| | 1| 1| 1| | 1| 1| 1| | 1| 1| 1| | 1| 1| 1| | 1| 1| 1| | 1| 1| 1| | 1| 1| 1| | 1| 1| 1| | 1| 1| 1| | 1| 1| 1| | 1| 1| 1| | 1| 2| 2| | 1| 2| 2| | 1| 2| 2| | 1| 2| 2| | 1| 2| 2| | 1| 2| 2| | 1| 2| 2| | 1| 2| 2| +-----+-----+----+ only showing top 20 rows ``` That should equal to the result without named argument: ```py >>> spark.sql(""" ... WITH t AS ( ... SELECT id AS partition_col, 1 AS input FROM range(1, 21) ... UNION ALL ... SELECT id AS partition_col, 2 AS input FROM range(1, 21) ... ) ... SELECT count, total, last ... FROM test_udtf(TABLE(t)) ... ORDER BY 1, 2 ... """).show() +-----+-----+----+ |count|total|last| +-----+-----+----+ | 40| 60| 2| +-----+-----+----+ ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified the related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43355 from ueshin/issues/SPARK-45266/fix. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

### What changes were proposed in this pull request? With [SPARK-45182](https://issues.apache.org/jira/browse/SPARK-45182), we added a fix for not letting laggard tasks of the older attempts of the indeterminate stage from marking the partition has completed in the map output tracker. When a task is completed, the DAG scheduler also notifies all the task sets of the stage about that partition being completed. Tasksets would not schedule such tasks if they are not already scheduled. This is not correct for the indeterminate stage, since we want to re-run all the tasks on a re-attempt ### Why are the changes needed? Since the partition is not completed by older attempts and the partition from the newer attempt also doesn't get scheduled, the stage will have to be rescheduled to complete that partition. Since the stage is indeterminate, all the partitions will be recomputed ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added check in existing unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #43326 from mayurdb/indeterminateFix. Authored-by: mayurb <mayurb@uber.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…iter.options to take a dictionary ### What changes were proposed in this pull request? This PR proposes to add the example of DataFrameReader/Writer.options to take a dictionary. ### Why are the changes needed? For users to know how to set options in a dictionary ay PySpark. ### Does this PR introduce _any_ user-facing change? Yes, it describes an example for setting the options with a dictionary. ### How was this patch tested? Existing doctests in this PR's CI. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43357 Closes #43358 from HyukjinKwon/SPARK-45528. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…NNAMED" so Platform can access Cleaner on Java 9+ ### What changes were proposed in this pull request? This PR adds `--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED` to our JVM flags so that we can access `jdk.internal.ref.Cleaner` in JDK 9+. ### Why are the changes needed? This allows Spark to allocate direct memory while ignoring the JVM's MaxDirectMemorySize limit. Spark uses JDK internal APIs to directly construct DirectByteBuffers while bypassing that limit, but there is a fallback path at https://github.com/apache/spark/blob/v3.5.0/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java#L213 that is used if we cannot reflectively access the `Cleaner` API. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a unit test in `PlatformUtilSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43344 from JoshRosen/SPARK-45508. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

### What changes were proposed in this pull request? This PR restores the [Protobuf Data Source Guide](https://spark.apache.org/docs/latest/sql-data-sources-protobuf.html#python)'s code tabs which #40614 removed for markdown syntax fixes In this PR, we introduce a hidden div to hold the code-block marker of markdown, then make both the liquid and markdown happy. ### Why are the changes needed? improve doc readability and consistency. ### Does this PR introduce _any_ user-facing change? yes, doc change ### How was this patch tested? #### Doc build ![image](https://github.com/apache/spark/assets/8326978/8aefeee0-92b2-4048-a3f6-108e4c3f309d) #### markdown editor and view ![image](https://github.com/apache/spark/assets/8326978/283b0820-390a-4540-8713-647c40f956ac) ### Was this patch authored or co-authored using generative AI tooling? no Closes #43361 from yaooqinn/SPARK-45532. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>

### What changes were proposed in this pull request? This pr lower the default `-Xmx` of `build/mvn` from 4g to 3g to reduce the peak memory usage of Maven compilation. ### Why are the changes needed? This can potentially fix the snapshot build being failed: https://github.com/apache/spark/actions/runs/6502277099 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Manual check. run ``` build/mvn clean install -DskipTests -Pyarn -Pkubernetes -Pvolcano -Phive -Phive-thriftserver -Pspark-ganglia-lgpl -Phadoop-cloud ``` **Before** Peak memory usage is at 6.1GB. **After** Peak memory usage is at 5GB, but the compilation time has increased by 10%. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43364 from LuciferYang/r-xmx-3g. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

… cluster when dynamic allocation disabled ### What changes were proposed in this pull request? This PR is a follow-up of #37268 which supports stage-level task resource profile for standalone cluster when dynamic allocation is disabled. This PR enables stage-level task resource profile for the Kubernetes cluster. ### Why are the changes needed? Users who work on spark ML/DL cases running on Kubernetes would expect stage-level task resource profile feature. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The current tests of #37268 can also cover this PR since both Kubernetes and standalone cluster share the same TaskSchedulerImpl class which implements this feature. Apart from that, modifying the existing test to cover the Kubernetes cluster. Apart from that, I also performed some manual tests which have been updated in the comments. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43323 from wbo4958/k8s-stage-level. Authored-by: Bobby Wang <wbo4958@gmail.com> Signed-off-by: Thomas Graves <tgraves@apache.org>

### What changes were proposed in this pull request? This adds in support for trust store reloading, mirroring the Hadoop implementation (see source comments for a link). I believe reusing the existing code instead of adding a dependency is fine license wise (see https://github.com/apache/spark/pull/42685/files#r1333667328) ### Why are the changes needed? This helps us refresh trust stores without needing downtime ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests (also copied from upstream) ``` build/sbt > project network-common > testOnly org.apache.spark.network.ssl.ReloadingX509TrustManagerSuite ``` The rest of the changes and integration were tested as part of #42685 ### Was this patch authored or co-authored using generative AI tooling? No Closes #43249 from hasnain-db/spark-tls-reloading. Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

### What changes were proposed in this pull request? This PR replaces TEMP error message 3007 and fills in many missing SQLSTATEs. ### Why are the changes needed? This is part of the ongoing effort to switch to the new error framework. ### Does this PR introduce _any_ user-facing change? yes, error docs will now show more SQLSTATEs ### How was this patch tested? Existing QA suite was run. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43342 from srielau/SPARK-45487-Fix-temp-errors. Authored-by: srielau <serge@rielau.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>

…2Zipped` to `scala.collection.LazyZip2` ### What changes were proposed in this pull request? Since scala 2.13.0, `scala.runtime.Tuple2Zipped` marked as deprecated and `scala.collection.LazyZip2` recommended. ### Why are the changes needed? Replace `scala.runtime.Tuple2Zipped` to `scala.collection.LazyZip2` ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Exist test cases. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #43351 from beliefer/SPARK-45513. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Jiaan Geng <beliefer@163.com>

…dsToMetricType ### What changes were proposed in this pull request? This PR aims to reduce the memory consumption of `LiveStageMetrics.accumIdsToMetricType`, which should help to reduce driver memory usage when running complex SQL queries that contain many operators and run many jobs. In SQLAppStatusListener, the LiveStageMetrics.accumIdsToMetricType field holds a map which is used to look up the type of accumulators in order to perform conditional processing of a stage’s metrics. Currently, that field is derived from `LiveExecutionData.metrics`, which contains metrics for _all_ operators used anywhere in the query. Whenever a job is submitted, we construct a fresh map containing all metrics that have ever been registered for that SQL query. If a query runs a single job, this isn't an issue: in that case, all `LiveStageMetrics` instances will hold the same immutable `accumIdsToMetricType`. The problem arises if we have a query that runs many jobs (e.g. a complex query with many joins which gets divided into many jobs due to AQE): in that case, each job submission results in a new `accumIdsToMetricType` map being created. This PR fixes this by changing `accumIdsToMetricType` to be a mutable `mutable.HashMap` which is shared across all `LivestageMetrics` instances belonging to the same `LiveExecutionData`. The modified classes are `private` and are used only in SQLAppStatusListener, so I don't think this change poses any realistic risk of binary incompatibility risks to third party code. ### Why are the changes needed? Addresses one contributing factor behind high driver memory / OOMs when executing complex queries. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. To demonstrate memory reduction, I performed manual benchmarking and heap dump inspection using benchmark that ran copies of a complex query: each test query launches ~200 jobs (so at least 200 stages) and contains ~3800 total operators, resulting in a huge number metric accumulators. Prior to this PR's fix, ~3700 LiveStageMetrics instances (from multiple concurrent runs of the query) consumed a combined ~3.3 GB of heap. After this PR's fix, I observed negligible memory usage from LiveStageMetrics. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43250 from JoshRosen/reduce-accum-ids-to-metric-type-mem-overhead. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

…portConf ### What changes were proposed in this pull request? This PR adds the options added in #43220 to `SSLOptions` and `SparkTransportConf`. By adding it to the `SSLOptions` we can support inheritance of options, so settings for the UI and RPC SSL settings can be shared as much as possible. The `SparkTransportConf` changes are needed to support propagating these settings. I also make some changes to `SecurityManager` to log when this feature is enabled, and make the existing `spark.network.crypto` options mutually exclusive with this new settings (it would just involve double encryption then). Lastly, make these flags propagate down to when executors are launched, and allow the passwords to be sent via environment variables (similar to how it's done for an existing secret). This ensures they are not visible in plaintext, but also ensures they are available at executor startup (otherwise it can never talk to the driver/worker) ### Why are the changes needed? The propagation of these options are needed for the RPC functionality to work ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI Added some unit tests which I verified passed: ``` build/sbt > project core > testOnly org.apache.spark.SparkConfSuite org.apache.spark.SSLOptionsSuite org.apache.spark.SecurityManagerSuite org.apache.spark.deploy.worker.CommandUtilsSuite ``` The rest of the changes and integration were tested as part of #42685 ### Was this patch authored or co-authored using generative AI tooling? No Closes #43238 from hasnain-db/spark-tls-ssloptions. Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

… for `NioBufferedFileInputStream` ### What changes were proposed in this pull request? This pr change to use `java.lang.ref.Cleaner` instead of `finalize()` for `NioBufferedFileInputStream`. They are all protective measures for resource cleanup, but the `finalize()` method has been marked as `deprecated` since Java 9 and will be removed in the future, `java.lang.ref.Cleaner` is the more recommended solution. https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/java/lang/Object.java#L546-L568 ``` * deprecated The finalization mechanism is inherently problematic. * Finalization can lead to performance issues, deadlocks, and hangs. * Errors in finalizers can lead to resource leaks; there is no way to cancel * finalization if it is no longer necessary; and no ordering is specified * among calls to {code finalize} methods of different objects. * Furthermore, there are no guarantees regarding the timing of finalization. * The {code finalize} method might be called on a finalizable object * only after an indefinite delay, if at all. * * Classes whose instances hold non-heap resources should provide a method * to enable explicit release of those resources, and they should also * implement {link AutoCloseable} if appropriate. * The {link java.lang.ref.Cleaner} and {link java.lang.ref.PhantomReference} * provide more flexible and efficient ways to release resources when an object * becomes unreachable. * * throws Throwable the {code Exception} raised by this method * see java.lang.ref.WeakReference * see java.lang.ref.PhantomReference * jls 12.6 Finalization of Class Instances */ Deprecated(since="9") protected void finalize() throws Throwable { } ``` ### Why are the changes needed? Clean up deprecated api usage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43333 from LuciferYang/nio-buffered-fis-cleaner. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

### What changes were proposed in this pull request? This PR adds helper classes for SSL RPC communication that are needed to work around the fact that `netty` does not support zero-copy transfers. These mirror the existing `MessageWithHeader` and `MessageEncoder` classes with very minor differences. But the differences were just enough that it didn't seem easy to refactor/consolidate, and since we don't expect these classes to change much I hope it's ok. ### Why are the changes needed? These are needed to support transferring `ManagedBuffer`s into a form that can be transferred by `netty` over the network, since netty's encryption support does not support zero-copy transfers. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests ``` build/sbt > project network-common > testOnly org.apache.spark.network.protocol.EncryptedMessageWithHeaderSuite ``` The rest of the changes and integration were tested as part of #42685 ### Was this patch authored or co-authored using generative AI tooling? No Closes #43244 from hasnain-db/spark-tls-helpers. Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

…to 3g" This reverts commit 3e2470d. ### What changes were proposed in this pull request? This pr revert change of #43364. ### Why are the changes needed? It seems to have no effect on fixing `Publish snapshot`, it still failed - https://github.com/apache/spark/actions/runs/6514229181/job/17696846279 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43372 from LuciferYang/revert-SPARK-45536. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

### What changes were proposed in this pull request? There are have redundant parameters in org.apache.spark.sql.kafka010.KafkaWriter#validateQuery and org.apache.spark.sql.kafka010.KafkaWriter#write, can remove them. ### Why are the changes needed? They are not used, remove them to make the code more concise. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Existing can test it. Closes #42198 from zhaomin1423/fix_kafka. Authored-by: zhaomin <zhaomin1423@163.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

### What changes were proposed in this pull request? 1. Make add_artifact request idempotent i.e. subsequent requests will succeed if the same content is provided. This makes retrying more safe. 2. Fix existing error handling mechanism: Before the update the error looks like that ``` >>> spark.addArtifact("tmp.py", pyfile=True) >>> spark.addArtifact("tmp.py", pyfile=True) # fails 2023-10-09 15:55:30,352 82873 DEBUG __iter__ Will retry call after 60014.279746934524 ms sleep (error: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNKNOWN details = "" debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"", grpc_status:2, created_time:"2023-10-09T15:55:30.351541+02:00"}" >) (this is also getting retried) ``` Now it looks: ``` >>> spark.addArtifact("abc.sh", file=True) >>> spark.addArtifact("abc.sh", file=True) # passes >>> # update file's content >>> spark.addArtifact("abc.sh", file=True) # now fails Traceback (most recent call last): [...] grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.ALREADY_EXISTS details = "Duplicate Artifact: files/abc.sh. Artifacts cannot be overwritten." debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Duplicate Artifact: files/abc.sh. Artifacts cannot be overwritten.", grpc_status:6, created_time:"2023-10-10T01:25:38.231317+02:00"}" > ``` ### Why are the changes needed? Makes retrying more robust, adds user-friendly error (see above). ### Does this PR introduce _any_ user-facing change? Mostly internal improvements ### How was this patch tested? Unit testing, testing against server ### Was this patch authored or co-authored using generative AI tooling? No Closes #43314 from cdkrot/SPARK-45485. Authored-by: Alice Sayutina <alice.sayutina@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ing `resolveColumnsByPosition` ### What changes were proposed in this pull request? This PR proposes to raise exception directly instead of calling `resolveColumnsByPosition`. ### Why are the changes needed? We can directly throw error when resolving output columns if there is any error, instead of calling `resolveColumnsByPosition` again. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The existing CI should pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42762 from itholic/SPARK-42309-followup. Lead-authored-by: Haejoon Lee <haejoon.lee@databricks.com> Co-authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ala.collection.LazyZip3` ### What changes were proposed in this pull request? Since scala 2.13.0, `scala.runtime.Tuple3Zipped` marked as deprecated and `scala.collection.LazyZip3` recommended. ### Why are the changes needed? Replace `scala.runtime.Tuple3Zipped` to `scala.collection.LazyZip3`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Exists test cases. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #43363 from beliefer/SPARK-45514. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Jiaan Geng <beliefer@163.com>

### What changes were proposed in this pull request? Pulling up correlated subquery predicates in Joins, and re-writing them into ExistenceJoins if they are not pushed down into the join inputs. ### Why are the changes needed? This change allows correlated IN and EXISTS subqueries in join condition. This is valid SQL that is not yet supported by Spark SQL. ### Does this PR introduce _any_ user-facing change? Yes, previously unsupported queries become supported. ### How was this patch tested? Added SQL tests for IN and EXISTS in join conditions, and crossed-check correctness with postgres (except for ANTI joins, which are not supported in postgres). Permutations of the tests: 1. Exists / Not exists / in / not in 2. Subquery references left child / right child 3. Join type: inner / left outer 4. Transitive predicates to try invoking filter inference Closes #42725 from andylam-db/correlated-subquery-in-join-cond. Authored-by: Andy Lam <andy.lam@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… is required for streaming aggregation queries in append mode ### What changes were proposed in this pull request? Add assert and log to indicate watermark definition is required for streaming aggregation queries in append mode ### Why are the changes needed? We have a check for ensuring that watermark attributes are specified in append mode based on the UnsupportedOperationChecker. However, in some cases we got report where user hit this stack trace: ``` org.apache.spark.SparkException: Exception thrown in awaitResult: Job aborted due to stage failure: Task 3 in stage 32.0 failed 4 times, most recent failure: Lost task 3.3 in stage 32.0 (TID 606) (10.5.71.29 executor 0): java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:472) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:708) at org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:145) at org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs$(statefulOperators.scala:145) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec.timeTakenMs(statefulOperators.scala:414) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$5(statefulOperators.scala:470) at org.apache.spark.sql.execution.streaming.state.package$StateStoreOps.$anonfun$mapPartitionsWithStateStore$1(package.scala:63) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:127) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406) ``` In this case, the reason for failure is not immediately clear. Hence adding an assert and log message to indicate why the query failed on the executor. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #43370 from anishshri-db/task/SPARK-45539. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

### What changes were proposed in this pull request? Complete addition of SQLSTATE's to all named error classes. ### Why are the changes needed? We need SQLSTATEs top classify errors and catch them in JDBC/ODBC ### Does this PR introduce _any_ user-facing change? Yes, SQLSTATE's are documented ### How was this patch tested? Run existing QA ### Was this patch authored or co-authored using generative AI tooling? No Closes #43376 from srielau/SPARK-45491-tenp-errors-sqlstates-2. Authored-by: srielau <serge@rielau.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ame for InjectRuntimeFilter ### What changes were proposed in this pull request? After many improvements, `InjectRuntimeFilter` is a bit complex. We need add more comments to give more design details and rename some variable name so that the `InjectRuntimeFilter` have better readability and maintainability. The core of a runtime filter is join keys, but the suffix `Exp` is not intuitive, so it's better to use the suffix `Key` directly. So rename as follows: `filterApplicationSideExp` -> `filterApplicationSideKey` `filterCreationSideExp` -> `filterCreationSideKey` `findBloomFilterWithExp` -> `findBloomFilterWithKey` `expr` -> `joinKey` ### Why are the changes needed? Improve the readability and maintainability. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #43359 from beliefer/SPARK-45531. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Fix a bug in pyspark connect. DataFrameWriterV2.overwritePartitions set mode as overwrite_partitions [pyspark/sql/connect/readwriter.py, line 825], but WirteOperationV2 take it as overwrite_partition [pyspark/sql/connect/plan.py, line 1660] ### Why are the changes needed? make dataframe.writeTo(table).overwritePartitions() work ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No test. This bug is very obvious. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43367 from xieshuaihu/python_connect_overwrite. Authored-by: xieshuaihu <xieshuaihu@agora.io> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…alect ### What changes were proposed in this pull request? 1. This PR add `dropTable` function to `JdbcDialect`. So user can override dropTable SQL by other JdbcDialect like Neo4J Neo4J Drop case ```sql MATCH (m:Person {name: 'Mark'}) DELETE m ``` 2. Also add `getInsertStatement` for same reason. Neo4J Insert case ```sql MATCH (p:Person {name: 'Jennifer'}) SET p.birthdate = date('1980-01-01') RETURN p ``` Neo4J SQL(in fact named `CQL`) not like normal SQL, but it have JDBC driver. ### Why are the changes needed? Make JdbcDialect more useful ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exist test Closes #41855 from Hisoka-X/SPARK-44262_JDBCUtils_improve. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

This reverts commit d1bd21a. ### What changes were proposed in this pull request? This pr aims to revert SPARK-45502 to make the test case `KafkaSourceStressSuite` stable. ### Why are the changes needed? The test case `KafkaSourceStressSuite` has become very unstable after the merger of SPARK-45502, with 10 out of the recent 22 tests failing because of it. Revert it for now, and we can upgrade Kafka again after resolving the test issues. - https://github.com/apache/spark/actions/runs/6497999347/job/17648385705 - https://github.com/apache/spark/actions/runs/6502219014/job/17660900989 - https://github.com/apache/spark/actions/runs/6502591917/job/17661861797 - https://github.com/apache/spark/actions/runs/6503144598/job/17663199041 - https://github.com/apache/spark/actions/runs/6503233514/job/17663413817 - https://github.com/apache/spark/actions/runs/6504416528/job/17666334238 - https://github.com/apache/spark/actions/runs/6509796846/job/17682130466 - https://github.com/apache/spark/actions/runs/6510877112/job/17685502094 - https://github.com/apache/spark/actions/runs/6512948316/job/17691625228 - https://github.com/apache/spark/actions/runs/6516366232/job/17699813649 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43379 from LuciferYang/Revert-SPARK-45502. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

…ingBuilder` ### What changes were proposed in this pull request? This PR aims to improve `toString` by `JEP-280` instead of `ToStringBuilder`. In addition, `Scalastyle` and `Checkstyle` rules are added to prevent a future regression. ### Why are the changes needed? Since Java 9, `String Concatenation` has been handled better by default. | ID | DESCRIPTION | | - | - | | JEP-280 | [Indify String Concatenation](https://openjdk.org/jeps/280) | For example, this PR improves `OpenBlocks` like the following. Both Java source code and byte code are simplified a lot by utilizing JEP-280 properly. **CODE CHANGE** ```java - return new ToStringBuilder(this, ToStringStyle.SHORT_PREFIX_STYLE) - .append("appId", appId) - .append("execId", execId) - .append("blockIds", Arrays.toString(blockIds)) - .toString(); + return "OpenBlocks[appId=" + appId + ",execId=" + execId + ",blockIds=" + + Arrays.toString(blockIds) + "]"; ``` **BEFORE** ``` public java.lang.String toString(); Code: 0: new #39 // class org/apache/commons/lang3/builder/ToStringBuilder 3: dup 4: aload_0 5: getstatic #41 // Field org/apache/commons/lang3/builder/ToStringStyle.SHORT_PREFIX_STYLE:Lorg/apache/commons/lang3/builder/ToStringStyle; 8: invokespecial #47 // Method org/apache/commons/lang3/builder/ToStringBuilder."<init>":(Ljava/lang/Object;Lorg/apache/commons/lang3/builder/ToStringStyle;)V 11: ldc #50 // String appId 13: aload_0 14: getfield #7 // Field appId:Ljava/lang/String; 17: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 20: ldc #55 // String execId 22: aload_0 23: getfield #13 // Field execId:Ljava/lang/String; 26: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 29: ldc #56 // String blockIds 31: aload_0 32: getfield #16 // Field blockIds:[Ljava/lang/String; 35: invokestatic #57 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String; 38: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 41: invokevirtual #61 // Method org/apache/commons/lang3/builder/ToStringBuilder.toString:()Ljava/lang/String; 44: areturn ``` **AFTER** ``` public java.lang.String toString(); Code: 0: aload_0 1: getfield #7 // Field appId:Ljava/lang/String; 4: aload_0 5: getfield #13 // Field execId:Ljava/lang/String; 8: aload_0 9: getfield #16 // Field blockIds:[Ljava/lang/String; 12: invokestatic #39 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String; 15: invokedynamic #43, 0 // InvokeDynamic #0:makeConcatWithConstants:(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)Ljava/lang/String; 20: areturn ``` ### Does this PR introduce _any_ user-facing change? No. This is an `toString` implementation improvement. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51572 from dongjoon-hyun/SPARK-52880. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

allisonwang-db and others added 4 commits October 12, 2023 13:05

github-actions bot added SQL PYTHON PANDAS API ON SPARK labels Oct 12, 2023

pull bot added ⤵️ pull and removed SQL PYTHON PANDAS API ON SPARK labels Oct 12, 2023

github-actions bot added BUILD SQL PYTHON PANDAS API ON SPARK STRUCTURED STREAMING labels Oct 12, 2023

github-actions bot added the CORE label Oct 12, 2023

LuciferYang and others added 2 commits October 13, 2023 01:24

github-actions bot added the CONNECT label Oct 12, 2023

allisonwang-db and others added 3 commits October 12, 2023 17:02

github-actions bot added MLLIB ML labels Oct 13, 2023

LuciferYang and others added 2 commits October 13, 2023 08:49

mayurdb and others added 4 commits October 13, 2023 10:17

github-actions bot added the DOCS label Oct 13, 2023

LuciferYang and others added 6 commits October 13, 2023 22:55

github-actions bot added the WEB UI label Oct 14, 2023

hasnain-db and others added 15 commits October 14, 2023 02:36

huangxiaopingRD merged commit 62653b9 into huangxiaopingRD:master Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from apache:master #47

[pull] master from apache:master #47

Uh oh!

pull bot commented Oct 12, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

[pull] master from apache:master #47

[pull] master from apache:master #47

Uh oh!

Conversation

pull bot commented Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

pull bot commented Oct 12, 2023 •

edited

Loading