Skip to content

Conversation

@pull
Copy link

@pull pull bot commented Dec 3, 2025

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

vinodkc and others added 4 commits December 3, 2025 07:34
…erver output stream to files

### What changes were proposed in this pull request?

Currently, the Spark Connect test server's stdout and stderr are discarded when SPARK_DEBUG_SC_JVM_CLIENT=false, making it difficult to debug test failures.

This PR enables log4j logging for Test Spark Connect server in all test modes (both debug and non-debug) by always configuring log4j2.properties.

### Why are the changes needed?

When `SPARK_DEBUG_SC_JVM_CLIENT=false`
SparkConnectJdbcDataTypeSuite randomly hangs because the child server process blocks on write() calls when stdout/stderr pipe buffers fill up. Without consuming the output, the buffers reach capacity and cause the process to block indefinitely.

Instead of `Redirect.DISCARD` , redirect the logs into log4j files

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Tested and confirmed that  log files are created  when

1) `SPARK_DEBUG_SC_JVM_CLIENT=false  build/sbt "connect-client-jdbc/testOnly org.apache.spark.sql.connect.client.jdbc.SparkConnectJdbcDataTypeSuite"`

OR

2) `SPARK_DEBUG_SC_JVM_CLIENT=true  build/sbt "connect-client-jdbc/testOnly org.apache.spark.sql.connect.client.jdbc.SparkConnectJdbcDataTypeSuite"`
```
In this file
./target/unit-tests.log
```
### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #53275 from vinodkc/br_redirect_stdout_stderr_to_file.

Authored-by: vinodkc <vinod.kc.in@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…nt/wait cost

### What changes were proposed in this pull request?
When ShuffleBlockFetcherIterator fetch data, two shuffle cost not calculated.

1. Network resource congestion and waiting between `fetchUpToMaxBytes` and `fetchAllHostLocalBlocks` ;
2. Connection establishment congestion. When `fetchUpToMaxBytes` and `fetchAllHostLocalBlocks` send request, create client may be congestion

### Why are the changes needed?
Make shuffle fetch wait time request time more accurate.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
For open block request add a Thread.sleep(3000) latency, shuffle read metrics like below

<img width="1724" height="829" alt="截屏2025-11-27 17 38 26" src="https://github.com/user-attachments/assets/99f3822d-d5a7-4f4a-abfc-cc272e61667c" />

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #53245 from AngersZhuuuu/SPARK-54536.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In this PR I propose to make `QueryPlanningTracker` as `HybridAnalyzer` field.

### Why are the changes needed?
In order to simplify the code and further single-pass analyzer development.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #53277 from mihailoale-db/analyzertracker.

Authored-by: mihailoale-db <mihailo.aleksic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…for Single ColFamily

### What changes were proposed in this pull request?

Introducing a new StatePartitionReader - StatePartitionReaderAllColumnFamilies to support offline repartition.
StatePartitionReaderAllColumnFamilies is invoked when user specify option `readAllColumnFamilies` to true.

We have the StateDataSource Reader, which allows customers to read the rows in an operator state store using the DataFrame API, just like they read a normal table. But it currently only supports reading one column family in the state store at a time.

We would introduce a change to allow reading all the state rows in all the column families, so that we can repartition them at once. This would allow us to read the entire state store, repartition the rows, and then save the new repartition state rows to the cloud. This also has a perf impact, since we don’t have to read each column family separately. We would read the state based on the last committed batch version.

Since each column family can have a different schema, the DataFrame we will return will treat the key and value row as bytes -
- partition_key (string)
- key_bytes (binary)
- value_bytes (binary)
- column_family_name (string)

### Why are the changes needed?

See above

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

See unit test. It not only verify the schema, but also validate the data are serialized to bytes correctly by comparing them against the normal queried data frame

### Was this patch authored or co-authored using generative AI tooling?

Yes. haiku, sonnet.

Closes #53104 from zifeif2/repartition-reader-single-cf.

Lead-authored-by: zifeif2 <zifeifeng11@gmail.com>
Co-authored-by: Ubuntu <zifei.feng@your.hostname.com>
Signed-off-by: Anish Shrigondekar <anish.shrigondekar@databricks.com>
@pull pull bot locked and limited conversation to collaborators Dec 3, 2025
@pull pull bot added the ⤵️ pull label Dec 3, 2025
@pull pull bot merged commit df63cb7 into huangxiaopingRD:master Dec 3, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants