Preventing or handling of service deadlocks #199

tillrohrmann · 2023-03-16T09:14:19Z

Problem

With Restate's current model where we execute calls to a service instance sequentially and allow a service to await other services while keeping the lock, users can build applications that deadlock. Examples are the following:

A service that awaits on a call to itself directly or indirectly (via a chain of other service calls)
Two services A and B which call each other (directly or indirectly) on the same key and await the result

The underlying problem is that we allow awaiting on a future result while holding the lock of a service instance.

Solution ideas

The current solution ideas fall into two categories: deadlock detection and prevention.

Deadlock detection

The idea would be to detect cyclic dependencies between service instances and to notify the user about it. Since a cyclic dependency can be formed by two service instances calling each other (potentially originating from different partitions), the detection algorithm will probably require some algorithm that retrieves this information from different partitions. This entails that the detection can probably not be a local decision.

Resolving the deadlock via undoing changes

Once a deadlock is detected, Restate needs to provide a way to resolve it. Ideally, the user could select an invocation which the runtime can cancel and whose changes will be rolled back so that the state stays consistent. It is unclear at this point in time how Restate could roll back state changes in the general case (e.g. assume that the service changed some state of another service x and that other calls were executed on x afterwards).

Deadlock prevention

If Restate's programming model wouldn't allow the creation of deadlocks, then there would be no need for a deadlock detection algorithm. Potential ideas could be the following:

Directed acyclic call graphs

The problem of deadlocks only arises if there exists a cyclic call graph. Hence, one idea to prevent this from happening is to ask the user to define a hierarchy of services where a service in level x is only allowed to call and await services in level y with y > x. That way, it would not be possible to create cycles.

Such a hierarchy would only be needed for keyed and singleton services since unkeyed services cannot be called again.

How to define a service hierarchy?

One idea for defining the service hierarchy is t ask the user to provide this information when registering a service. She could provide the level as part of the service registration. This would entail that the user has a global understanding of all the services that call her service and the services that her service will call.

Optimistic concurrency control

Since the deadlock situation arises when a service awaits another service while holding a lock, we could prevent it by disallowing awaiting another service while holding the lock. This could be done by introducing a new service type "sequentially executed keyed service" which cannot await. This means that such a service type will only be allowed to send non-blocking background calls. Additionally, calls to a keyed/singleton service instance will be run in parallel (holding no service instance lock anymore) using optimistic concurrency control for resolving conflicts. This means that the user needs to explicitly handle write conflicts by redoing/merging its changes.

The text was updated successfully, but these errors were encountered:

jackkleeman · 2023-06-14T07:11:37Z

NB; i have a verification test that runs into an occasional deadlock, and it feels to me that it is of the more subtle kind that might be a problem for some users, because the cycle is not within one call chain (eg 1 -> 2 -> 3 -> 1)
key 1 syncCalls key 2
key 2 syncCalls key 1
something else calls (sync or otherwise) key 1 and 2 concurrently
So basically one acquires lock on 1, the other on 2, and then both are stuck forever. I think I will have to add some cycle detection to the verif test, as the approach currently used only looks for loops within one call chain, not the whole graph.

one of the interesting things here is that by reordering the inbox you can avoid the deadlock

tillrohrmann added this to the 1A milestone Mar 16, 2023

slinkydeveloper removed this from the 1A milestone Jun 21, 2023

slinkydeveloper added the semantics System semantics and behaviour label Jun 21, 2023

tillrohrmann added api-devex Restate user facing Development model needs-refinement Issues that need further investigation labels Nov 17, 2023

slinkydeveloper added the ops-ex Restate operational experience label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preventing or handling of service deadlocks #199

Preventing or handling of service deadlocks #199

tillrohrmann commented Mar 16, 2023

jackkleeman commented Jun 14, 2023

Preventing or handling of service deadlocks #199

Preventing or handling of service deadlocks #199

Comments

tillrohrmann commented Mar 16, 2023

Problem

Solution ideas

Deadlock detection

Resolving the deadlock via undoing changes

Deadlock prevention

Directed acyclic call graphs

How to define a service hierarchy?

Optimistic concurrency control

jackkleeman commented Jun 14, 2023