forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 1
[pull] master from apache:master #1107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…erver output stream to files ### What changes were proposed in this pull request? Currently, the Spark Connect test server's stdout and stderr are discarded when SPARK_DEBUG_SC_JVM_CLIENT=false, making it difficult to debug test failures. This PR enables log4j logging for Test Spark Connect server in all test modes (both debug and non-debug) by always configuring log4j2.properties. ### Why are the changes needed? When `SPARK_DEBUG_SC_JVM_CLIENT=false` SparkConnectJdbcDataTypeSuite randomly hangs because the child server process blocks on write() calls when stdout/stderr pipe buffers fill up. Without consuming the output, the buffers reach capacity and cause the process to block indefinitely. Instead of `Redirect.DISCARD` , redirect the logs into log4j files ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested and confirmed that log files are created when 1) `SPARK_DEBUG_SC_JVM_CLIENT=false build/sbt "connect-client-jdbc/testOnly org.apache.spark.sql.connect.client.jdbc.SparkConnectJdbcDataTypeSuite"` OR 2) `SPARK_DEBUG_SC_JVM_CLIENT=true build/sbt "connect-client-jdbc/testOnly org.apache.spark.sql.connect.client.jdbc.SparkConnectJdbcDataTypeSuite"` ``` In this file ./target/unit-tests.log ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53275 from vinodkc/br_redirect_stdout_stderr_to_file. Authored-by: vinodkc <vinod.kc.in@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…nt/wait cost ### What changes were proposed in this pull request? When ShuffleBlockFetcherIterator fetch data, two shuffle cost not calculated. 1. Network resource congestion and waiting between `fetchUpToMaxBytes` and `fetchAllHostLocalBlocks` ; 2. Connection establishment congestion. When `fetchUpToMaxBytes` and `fetchAllHostLocalBlocks` send request, create client may be congestion ### Why are the changes needed? Make shuffle fetch wait time request time more accurate. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? For open block request add a Thread.sleep(3000) latency, shuffle read metrics like below <img width="1724" height="829" alt="截屏2025-11-27 17 38 26" src="https://github.com/user-attachments/assets/99f3822d-d5a7-4f4a-abfc-cc272e61667c" /> ### Was this patch authored or co-authored using generative AI tooling? No Closes #53245 from AngersZhuuuu/SPARK-54536. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? In this PR I propose to make `QueryPlanningTracker` as `HybridAnalyzer` field. ### Why are the changes needed? In order to simplify the code and further single-pass analyzer development. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53277 from mihailoale-db/analyzertracker. Authored-by: mihailoale-db <mihailo.aleksic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…for Single ColFamily ### What changes were proposed in this pull request? Introducing a new StatePartitionReader - StatePartitionReaderAllColumnFamilies to support offline repartition. StatePartitionReaderAllColumnFamilies is invoked when user specify option `readAllColumnFamilies` to true. We have the StateDataSource Reader, which allows customers to read the rows in an operator state store using the DataFrame API, just like they read a normal table. But it currently only supports reading one column family in the state store at a time. We would introduce a change to allow reading all the state rows in all the column families, so that we can repartition them at once. This would allow us to read the entire state store, repartition the rows, and then save the new repartition state rows to the cloud. This also has a perf impact, since we don’t have to read each column family separately. We would read the state based on the last committed batch version. Since each column family can have a different schema, the DataFrame we will return will treat the key and value row as bytes - - partition_key (string) - key_bytes (binary) - value_bytes (binary) - column_family_name (string) ### Why are the changes needed? See above ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? See unit test. It not only verify the schema, but also validate the data are serialized to bytes correctly by comparing them against the normal queried data frame ### Was this patch authored or co-authored using generative AI tooling? Yes. haiku, sonnet. Closes #53104 from zifeif2/repartition-reader-single-cf. Lead-authored-by: zifeif2 <zifeifeng11@gmail.com> Co-authored-by: Ubuntu <zifei.feng@your.hostname.com> Signed-off-by: Anish Shrigondekar <anish.shrigondekar@databricks.com>
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )