Skip to content

Conversation

viirya
Copy link
Member

@viirya viirya commented Oct 19, 2025

What changes were proposed in this pull request?

This patch proposes a fix to a deadlock bug in Observation. It replaces synchronized locks with a promise to avoid deadlock happened between get and onFinish methods

Why are the changes needed?

Observation class has been evolved a few times during Spark 3.5 to Spark 4.0.0. Previously it uses locking mechanism (synchronized) between get and onFinish methods to coordinate metrics update and retrieval.

But it has a potential deadlocking bug. If get is called before ObservationListener is triggered to call onFinish, get will forever be waiting for metrics because it locks the observation object by synchronized so later onFinish call is locked out from updating the metrics.

This locking mechanism was replaced by a promise by #47921 that is a large refacroring on observation feature. But in the PR, I don’t see the deadlock bug was mentioned, and there is no bug fix PR proposed to earlier versions. So I think that the bug was not known and the fix is unintentional in Spark 4.0.0. The bug is still in Spark 3.5 branch.

Does this PR introduce any user-facing change?

No

How was this patch tested?

The deadlock bug was hit by customer and is tricky to reproducing by unit test. This patch should pass existing tests.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Oct 19, 2025
@viirya viirya changed the title [SPARK-XXXXX][SQL][3.5] Fix deadlock in Observation [SPARK-53948][SQL][3.5] Fix deadlock in Observation Oct 19, 2025
@gengliangwang
Copy link
Member

Thanks, merging to 3.5

gengliangwang pushed a commit that referenced this pull request Oct 20, 2025
### What changes were proposed in this pull request?

This patch proposes a fix to a deadlock bug in `Observation`. It replaces `synchronized` locks with a promise to avoid deadlock happened between `get` and `onFinish` methods

### Why are the changes needed?

`Observation` class has been evolved a few times during Spark 3.5 to Spark 4.0.0. Previously it uses locking mechanism (`synchronized`) between `get` and `onFinish` methods to coordinate metrics update and retrieval.

But it has a potential deadlocking bug. If `get` is called before `ObservationListener` is triggered to call `onFinish`, `get` will forever be waiting for metrics because it locks the observation object by `synchronized` so later `onFinish` call is locked out from updating the metrics.

This locking mechanism was replaced by a promise by #47921 that is a large refacroring on observation feature. But in the PR, I don’t see the deadlock bug was mentioned, and there is no bug fix PR proposed to earlier versions. So I think that the bug was not known and the fix is unintentional in Spark 4.0.0. The bug is still in Spark 3.5 branch.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

The deadlock bug was hit by customer and is tricky to reproducing by unit test. This patch should pass existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #52657 from viirya/fix_observation_deadlock.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
@viirya
Copy link
Member Author

viirya commented Oct 20, 2025

Thanks @HyukjinKwon @peter-toth @gengliangwang

@viirya viirya deleted the fix_observation_deadlock branch October 20, 2025 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants