Skip to content

Conversation

@AlexRiedler
Copy link
Contributor

Summary

On JVM JobCancellationException is Serializable, but job field is nullable and Transient:

@JvmField @Transient internal actual val job: Job

as a result job is null after deserialization, so when hashCode() is ran on the exception, it results in null pointer exception. This change is to fix this issue by handling nullability of the job field.

Exception from Production while running Apache Flink:

java.lang.NullPointerException: Cannot invoke "Object.hashCode()" because "this.job" is null
	at kotlinx.coroutines.JobCancellationException.hashCode(Exceptions.kt:65)
	at java.base/java.util.HashMap.hash(Unknown Source)
	at java.base/java.util.HashMap.put(Unknown Source)
	at java.base/java.util.HashSet.add(Unknown Source)
	at org.apache.flink.util.SerializedThrowable.<init>(SerializedThrowable.java:91)
	at org.apache.flink.util.SerializedThrowable.<init>(SerializedThrowable.java:93)
	at org.apache.flink.util.SerializedThrowable.<init>(SerializedThrowable.java:62)
	at org.apache.flink.runtime.executiongraph.ErrorInfo.<init>(ErrorInfo.java:72)
	at org.apache.flink.runtime.executiongraph.ErrorInfo.createErrorInfoWithNullableCause(ErrorInfo.java:52)
	at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1154)
	at org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:933)
	at org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.updateStateInternal(DefaultExecutionGraph.java:1396)
	at org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.updateState(DefaultExecutionGraph.java:1355)
	at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:763)
	at org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:83)
	at org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:488)

@qwwdfsad
Copy link
Member

qwwdfsad commented Dec 4, 2024

Thanks! Could you please elaborate -- is this something you encountered in the wild (i.e. you have a domain where you serialize and deserialize JCE consistently) or is it something you just spotted accidentally?

@qwwdfsad qwwdfsad changed the base branch from master to develop December 4, 2024 13:54
@qwwdfsad qwwdfsad changed the base branch from develop to master December 4, 2024 13:54
@qwwdfsad
Copy link
Member

qwwdfsad commented Dec 4, 2024

Could you please also rebase your branch on top of develop (see also: https://github.com/Kotlin/kotlinx.coroutines/blob/master/CONTRIBUTING.md#submitting-prs)?

@AlexRiedler AlexRiedler changed the base branch from master to develop December 4, 2024 13:56
@AlexRiedler AlexRiedler force-pushed the job-cancellation-transient-failure branch from 8b8a0f8 to d24a35c Compare December 4, 2024 13:57
@AlexRiedler
Copy link
Contributor Author

AlexRiedler commented Dec 4, 2024

@qwwdfsad rebased onto develop (missed that stepped in the guide 😅)

Yes this was encountered in the wild while in one of our production Apache Flink Job's.

Since Apache Flink uses a distributed across system mechanism. It will serialize exceptions at job boundaries (that is between user code and framework code) across systems for error tracking and reporting; it happens to use a HashMap as part of this and puts the exception as the key (which in-turn runs hashCode). This however, results in a null pointer exception as a result since job field is Transient in Kotlin Coroutines.

More detail: this appears to happen cause we are capturing the exception in a completable future as a reason to mark the completable future as a failure.

@qwwdfsad qwwdfsad self-requested a review December 4, 2024 17:52
@qwwdfsad
Copy link
Member

qwwdfsad commented Dec 4, 2024

Thank you for the explanation!

I've also filed #4293 and started an internal discussion about this bug -- it seems to be a soundness violation, and it would be nice to at least have a warning here

@qwwdfsad qwwdfsad merged commit e670f62 into Kotlin:develop Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants