-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
#3613 uncovered the whole new class of bugs for coroutines: infinite CPU-intensive spin-loops during "unexpected" exceptions at the core of the implementation.
While this particular bug was addressed mechanically (#3634), the possibility of such bugs is still there:
StackOverflowErrorin system-level methods (this particular issue was addressed by https://openjdk.org/jeps/270 in Java for Java's primitives)OutOfMemoryErrorfrom an arbitrary place of code that attempted an innocuous allocation- An arbitrary programmatic bug in our own implementation
- Any other "implicit" exception (whether it's NPE during non-trivial data race,
LinkageErrordue to misaligned dependency or thread death)
Being an application-level framework, it is close to impossible to ensure that coroutines continue to operate bugless and preserve all the internal invariants in the face of implicit exceptions being thrown from an arbitrary line of code, so the best we can do is to make the best effort (pun intended).
What we should do is to ensure that prior to system collapse, it stays responsive (i.e. available for introspection with tools like jps) and graceful (i.e. it eventualy deadlocks instead of intensively burning CPU and user's pocket).
In order to do that, all our spin-lock based solutions (which, contrary to many kernel-level APIs, spin in scenarios "it shouldn't take long" rather than "this one is totally safe, recoverable and interruption-safe") should degrade gracefully into sleep/yield/onSpinWait behaviour first and, as a last resort, to the full-blown thread parking later.
For now, we are aware of three such places:
- Waiting for reusability token in
DispatchedContinuation.awaitReusabilitythat matches racy scenarios such as "suspend (T1) resume (T2) getResult() (T1)` - Waiting for ownership token of owner-supplied operation in
Mutex - Waiting for logical
expandBufferoperation inBufferedChannel
The key here is to ensure that the solution is robust enough (i.e. that when timings are really unlucky, the solution actually works and proceeds) and doesn't obstruct the fast-path (i.e. "happy path" performance is not affected)