Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preventing or handling of service deadlocks #199

Open
tillrohrmann opened this issue Mar 16, 2023 · 1 comment
Open

Preventing or handling of service deadlocks #199

tillrohrmann opened this issue Mar 16, 2023 · 1 comment
Labels
api-devex Restate user facing Development model needs-refinement Issues that need further investigation ops-ex Restate operational experience semantics System semantics and behaviour

Comments

@tillrohrmann
Copy link
Contributor

Problem

With Restate's current model where we execute calls to a service instance sequentially and allow a service to await other services while keeping the lock, users can build applications that deadlock. Examples are the following:

  • A service that awaits on a call to itself directly or indirectly (via a chain of other service calls)
  • Two services A and B which call each other (directly or indirectly) on the same key and await the result

The underlying problem is that we allow awaiting on a future result while holding the lock of a service instance.

Solution ideas

The current solution ideas fall into two categories: deadlock detection and prevention.

Deadlock detection

The idea would be to detect cyclic dependencies between service instances and to notify the user about it. Since a cyclic dependency can be formed by two service instances calling each other (potentially originating from different partitions), the detection algorithm will probably require some algorithm that retrieves this information from different partitions. This entails that the detection can probably not be a local decision.

Resolving the deadlock via undoing changes

Once a deadlock is detected, Restate needs to provide a way to resolve it. Ideally, the user could select an invocation which the runtime can cancel and whose changes will be rolled back so that the state stays consistent. It is unclear at this point in time how Restate could roll back state changes in the general case (e.g. assume that the service changed some state of another service x and that other calls were executed on x afterwards).

Deadlock prevention

If Restate's programming model wouldn't allow the creation of deadlocks, then there would be no need for a deadlock detection algorithm. Potential ideas could be the following:

Directed acyclic call graphs

The problem of deadlocks only arises if there exists a cyclic call graph. Hence, one idea to prevent this from happening is to ask the user to define a hierarchy of services where a service in level x is only allowed to call and await services in level y with y > x. That way, it would not be possible to create cycles.

Such a hierarchy would only be needed for keyed and singleton services since unkeyed services cannot be called again.

How to define a service hierarchy?

One idea for defining the service hierarchy is t ask the user to provide this information when registering a service. She could provide the level as part of the service registration. This would entail that the user has a global understanding of all the services that call her service and the services that her service will call.

Optimistic concurrency control

Since the deadlock situation arises when a service awaits another service while holding a lock, we could prevent it by disallowing awaiting another service while holding the lock. This could be done by introducing a new service type "sequentially executed keyed service" which cannot await. This means that such a service type will only be allowed to send non-blocking background calls. Additionally, calls to a keyed/singleton service instance will be run in parallel (holding no service instance lock anymore) using optimistic concurrency control for resolving conflicts. This means that the user needs to explicitly handle write conflicts by redoing/merging its changes.

@tillrohrmann tillrohrmann added this to the 1A milestone Mar 16, 2023
@jackkleeman
Copy link
Contributor

NB; i have a verification test that runs into an occasional deadlock, and it feels to me that it is of the more subtle kind that might be a problem for some users, because the cycle is not within one call chain (eg 1 -> 2 -> 3 -> 1)
key 1 syncCalls key 2
key 2 syncCalls key 1
something else calls (sync or otherwise) key 1 and 2 concurrently
So basically one acquires lock on 1, the other on 2, and then both are stuck forever. I think I will have to add some cycle detection to the verif test, as the approach currently used only looks for loops within one call chain, not the whole graph.

one of the interesting things here is that by reordering the inbox you can avoid the deadlock

@slinkydeveloper slinkydeveloper removed this from the 1A milestone Jun 21, 2023
@slinkydeveloper slinkydeveloper added the semantics System semantics and behaviour label Jun 21, 2023
@tillrohrmann tillrohrmann added api-devex Restate user facing Development model needs-refinement Issues that need further investigation labels Nov 17, 2023
@slinkydeveloper slinkydeveloper added the ops-ex Restate operational experience label May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-devex Restate user facing Development model needs-refinement Issues that need further investigation ops-ex Restate operational experience semantics System semantics and behaviour
Projects
None yet
Development

No branches or pull requests

3 participants