Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling CompletionStage.asDeferred() on the same future many times causes stack overflows #4156

Closed
yzhaoa opened this issue Jun 13, 2024 · 3 comments
Labels

Comments

@yzhaoa
Copy link

yzhaoa commented Jun 13, 2024

Describe the bug

Given a shared CompletableFuture object future:

  1. Attach other transformations on future, e.g. val didRun = AtomicBoolean(); future.whenComplete { _, _ -> didRun.set(true) }
  2. Call future.asDeferred() many times. Generally, 10,000 times is sufficient to trigger the failure. The resulting Deferred<*> don't have to be used.
  3. Complete future exceptionally, e.g. future.completeExceptionally(RuntimeException()).

What happened? What should have happened instead?

Actual: Some of the Deferred<*> results would not complete. Some of the extra transformations in (1) do not run.

The exception thrown from the completed Deferred<*> objects may contain a suppressed StackOverflowError. In testing, in most of the cases there is no indication that anything failed, and the StackOverflowError can be observed by setting a breakpoint inside CompletableFuture's internal machinery.

Expected: all the Deferred<*> results should complete exceptionally, and additional transforms attached in (1) should run.

Why would you do this?

The above condition can happen when you use an asynchronous cache, e.g. Caffeine.newBuilder()...buildAsync().

In such use cases, an underlying CompletableFuture-based cache would return a cached (and shared) CompletableFuture object, and code that is using the cache would call CompletionStage.asDeferred() to consume the cached future asynchronously.

In exceptional conditions, e.g. when the backend computation gated behind the cache slows to a crawl and then starts failing, the above described condition can happen - a single CompletableFuture is returned to a bunch of cache consumers while everybody is waiting for the computation to complete. If the incoming request rate is high enough, you can easily end up with 10,000 waiters on the same CompletableFuture.

Then, when backend computation finally fails, this bug is triggered.

Note: this bug is particular insidious because it is very hard to trigger normally, and when it does, caches can behave very unexpectedly.

In the case of Caffeine, because some transforms fail to run (1 above), Caffeine does not get notified that the CompletableFuture fail and keeps returning the same failed CompletableFuture to future callers.

Why not CompletionStage.await()

Because the CompletableFuture is shared between multiple callers. Using CompletionStage.await() will allow one caller's cancellation to break all other concurrent callers waiting on the same future.

Provide a Reproducer

This is a *.main.kts. Run with kotlin <something>.main.kts <num calls>.

@file:Repository("https://repo1.maven.org/maven2")
@file:DependsOn("org.jetbrains.kotlinx:kotlinx-coroutines-core:1.8.1")

import kotlinx.coroutines.delay
import kotlinx.coroutines.future.asDeferred
import kotlinx.coroutines.runBlocking
import java.util.concurrent.CompletableFuture
import java.util.concurrent.atomic.AtomicBoolean
import kotlin.time.Duration.Companion.seconds

val numCalls = args[0].toInt()

val future = CompletableFuture<Unit>()

// Step 1
val didRun = AtomicBoolean(false)
future.whenComplete { _, _ -> didRun.set(true) }

// Step 2
val deferreds = (0 until numCalls).map { future.asDeferred() }

// Step 3
future.completeExceptionally(RuntimeException())

// There are no thread pool and executors involved above. All executions should complete on step 3.
runBlocking {
    // But just in case, wait a few seconds.
    delay(2.seconds)

    println("Expected additional transformation to run. Actual: ${didRun.get()}")
    println("Expected ${deferreds.size} Deferred<Unit> to complete exceptionally. Actual: ${deferreds.filter { it.isCompleted }.size}")
}

Versions:

  • Mac OS X Sonoma 14.4.1, M3 Pro.
  • java -version
openjdk version "21.0.3" 2024-04-16
OpenJDK Runtime Environment Homebrew (build 21.0.3)
OpenJDK 64-Bit Server VM Homebrew (build 21.0.3, mixed mode, sharing)
  • kotlin -version: Kotlin version 2.0.0-release-341 (JRE 21.0.3)
  • kotlinx-coroutines-core: 1.8.1 (above)

Results:

  • $ kotlin test.main.kts 100: no bug
$ JAVA_HOME=/opt/homebrew/Cellar/openjdk@21/21.0.3/libexec/openjdk.jdk/Contents/Home kotlin test.main.kts 100
Expected additional transformation to run. Actual: true
Expected 100 Deferred<Unit> to complete exceptionally. Actual: 100
  • $ kotlin test.main.kts 10000: bug, with various failure modes
$ JAVA_HOME=/opt/homebrew/Cellar/openjdk@21/21.0.3/libexec/openjdk.jdk/Contents/Home kotlin test.main.kts 10000
Expected additional transformation to run. Actual: true
Expected 10000 Deferred<Unit> to complete exceptionally. Actual: 8023
$ JAVA_HOME=/opt/homebrew/Cellar/openjdk@21/21.0.3/libexec/openjdk.jdk/Contents/Home kotlin test.main.kts 10000
Expected additional transformation to run. Actual: true
Expected 10000 Deferred<Unit> to complete exceptionally. Actual: 2970
$ JAVA_HOME=/opt/homebrew/Cellar/openjdk@21/21.0.3/libexec/openjdk.jdk/Contents/Home kotlin test.main.kts 10000
Expected additional transformation to run. Actual: true
Expected 10000 Deferred<Unit> to complete exceptionally. Actual: 9894
$ JAVA_HOME=/opt/homebrew/Cellar/openjdk@21/21.0.3/libexec/openjdk.jdk/Contents/Home kotlin test.main.kts 10000
Expected additional transformation to run. Actual: true
Expected 10000 Deferred<Unit> to complete exceptionally. Actual: 9995
$ JAVA_HOME=/opt/homebrew/Cellar/openjdk@21/21.0.3/libexec/openjdk.jdk/Contents/Home kotlin test.main.kts 10000
Expected additional transformation to run. Actual: false
Expected 10000 Deferred<Unit> to complete exceptionally. Actual: 8075
@yzhaoa yzhaoa added the bug label Jun 13, 2024
@yzhaoa
Copy link
Author

yzhaoa commented Jun 13, 2024

Root cause

The root cause seems to be a bad interaction between CompletionStage.asDeferred and CompletableFuture. Specifically:

  1. When future completes, it calls the CompletionStage.handle callback at .
  2. The callback calls result.completeExceptionally.
  3. Completion of result triggers cancelFutureOnCompletion at
    result.cancelFutureOnCompletion(future)
    .
  4. CompletableFuture.cancel triggers postComplete at https://github.com/openjdk/jdk/blob/jdk-21%2B26/src/java.base/share/classes/java/util/concurrent/CompletableFuture.java#L2512, which in turn runs (1) for the next Deferred, recursively.

Normally CompletableFuture runs all the callbacks iteratively, but due to triggering the cancel() function in a recursive context, it starts running callbacks recursively, and causes a stack overflow.

@yzhaoa
Copy link
Author

yzhaoa commented Jun 13, 2024

Workaround

The above issue can be worked around by breaking the recursive CompletableFuture.cancel call. For example:

fun <T> CompletableFuture<T>.safeAsDeferred(): Deferred<T> {
    val parent = this
    val safeCompletableFuture = object : CompletableFuture<T>() {
        override fun cancel(mayInterruptIfRunning: Boolean): Boolean {
            return if (!parent.isDone) {
                parent.cancel(mayInterruptIfRunning)
            } else {
                parent.isCancelled
            }
        }
    }
    parent.handle {
        result, exception ->
        if (exception == null) {
            safeCompletableFuture.complete(result)
        } else {
            safeCompletableFuture.completeExceptionally(exception)
        }
    }
    return safeCompletableFuture.asDeferred()
}

When used in place of asDeferred in the reproducer above, this fixes the issue:

$ JAVA_HOME=/opt/homebrew/Cellar/openjdk@21/21.0.3/libexec/openjdk.jdk/Contents/Home kotlin test.main.kts 10000
Expected additional transformation to run. Actual: true
Expected 10000 Deferred<Unit> to complete exceptionally. Actual: 10000

And if you don't care about cancelling the Deferred, you can simply create another chained CompletableFuture, since apparently CompletableFuture cancellations doesn't propagate upstream:

future.thenApply { it }.asDeferred()

@dkhalanskyjb
Copy link
Collaborator

Great job figuring out the problem!

I think this is worth reporting to the JDK maintainers as well. As far as I can see, nothing in the docs indicates that this may happen. Though nothing says that it won't, I think:

I think this should be fixed on the side of the JDK as well by detecting the same thread entering the cancellation procedure more than once. That said, we can work around this explicitly on our side: #4173 The downside of this workaround is that when several threads cancel coroutines related to one future, only one of them will deal with the work related to the cancellation now (as opposed to many threads sharing the load), but I don't think that's a significant issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants