Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in anyio.Lock #398

Closed
graffic opened this issue Dec 3, 2021 · 12 comments · Fixed by #399
Closed

Deadlock in anyio.Lock #398

graffic opened this issue Dec 3, 2021 · 12 comments · Fixed by #399
Labels
asyncio Involves the asyncio backend bug Something isn't working

Comments

@graffic
Copy link

graffic commented Dec 3, 2021

Related to this issue in httpcore: encode/httpcore#454

TL;DR: With the right timing, the lock acquire step can be cancelled and the lock remain locked for other tasks because when using the lock context manager, if the task is cancelled during the __aenter__ , __aexit__ is not executed.

This snippet reproduces the issue:

import asyncio
import anyio

class Holder:
    def __init__(self):
        # self.lock = asyncio.Lock()
        self.lock = anyio.Lock()

    async def work(self, index):
        print(f"Trying to acquire {index}")
        async with self.lock:
            print(f"Doing index {index}")
            await asyncio.sleep(5)


async def main_anyio():
    holder = Holder()
    t1 = asyncio.create_task(holder.work(1), name="holder 1")
    t2 = asyncio.create_task(holder.work(2), name="holder 2")

    print("Tasks created")
    await asyncio.sleep(0)

    print("Cancelling t1")
    t1.cancel()

    print("Waiting for t2")
    await asyncio.gather(t2)


if __name__ == "__main__":
    asyncio.run(main_anyio())

The issue goes away when using asyncio.Lock because that lock context manager never blocks during the acquire step. Even if the __aenter__ method in asyncio.Lock is asynchronous, no asynchronous code is executed there.

@smurfix
Copy link
Collaborator

smurfix commented Dec 3, 2021

The issue goes away when using asyncio.Lock because that lock context manager never blocks during the acquire step. Even if the __aenter__ method in asyncio.Lock is asynchronous, no asynchronous code is executed there.

? I don't understand that part. Of course the lock's context manager blocks during acquire. How else could it possibly work?

@agronholm
Copy link
Owner

AnyIO has an explicit code path that is executed in case the task is cancelled during acquire(). It checks if it has been set as the owner of the lock, and then proceeds to release the lock to the next task in line.

@graffic
Copy link
Author

graffic commented Dec 3, 2021

The issue goes away when using asyncio.Lock because that lock context manager never blocks during the acquire step. Even if the __aenter__ method in asyncio.Lock is asynchronous, no asynchronous code is executed there.

? I don't understand that part. Of course the lock's context manager blocks during acquire. How else could it possibly work?

My bad there. I mean that the implementation of __aenter__ for asyncio.Lock when you do acquire (when you are the first to acquire) does not have any line of code that awaits. It is all synchronous code. Checked for python 3.9 and 3.10,

@agronholm
Copy link
Owner

On AnyIO, the only await line is await checkpoint_if_cancelled() which checks for cancellation before attempting the synchronous acquisition. If that fails, then it waits for an event that is set when the previous holder releases the lock. If that wait is cancelled, it checks if it got the lock anyway (this can happen). If it did receive ownership of the lock, it will release the lock. Otherwise, it will just remove its waiter event from the queue.

@graffic
Copy link
Author

graffic commented Dec 3, 2021

On AnyIO, the only await line is await checkpoint_if_cancelled() which checks for cancellation before attempting the synchronous acquisition. If that fails, then it waits for an event that is set when the previous holder releases the lock. If that wait is cancelled, it checks if it got the lock anyway (this can happen). If it did receive ownership of the lock, it will release the lock. Otherwise, it will just remove its waiter event from the queue.

Yes, that is exactly the place I got when debugging. There is an await there, the task yields, and in my case the lock is not released.

https://github.com/agronholm/anyio/blob/master/src/anyio/_core/_synchronization.py#L137

What happens when you run my PoC?

@agronholm
Copy link
Owner

So what seems to happen here is this:

  1. Task 1 acquires the lock
  2. Task 1 enters cancel_shielded_checkpoint() where a task switch occurs
  3. Task 1 is cancelled
  4. Task 1 raises CancelledError while still in __aenter__(), thus causing __aexit__() to not be called

The await cancel_shielded_checkpoint() call is shielded from cancellation – but only AnyIO's level triggered cancellation, so native cancellation can still cause an exception to be raised, causing __aexit__() to not be run.

@agronholm
Copy link
Owner

Dealing with two different cancellation systems is a pain, but a pain AnyIO has to live with. We should probably add more tests to check for native cancellation behavior.

@agronholm agronholm added asyncio Involves the asyncio backend bug Something isn't working labels Dec 3, 2021
@agronholm
Copy link
Owner

So this opens up the question of how to deal with cancellations arriving in cancel_shielded_checkpoint()? Do we just let that happen and deal with it, or cancel the tasks's current cancel scope and suppress the original exception? The latter seems a bit risky when mixing anyio and native asyncio code.

@agronholm
Copy link
Owner

I wrote tests against Lock, Semaphore and CapacityLimiter and they all suffer from the same problem. Fortunately this looks like an easy fix.

@agronholm
Copy link
Owner

To be safe, I inspected all other AnyIO classes that have an __aenter__() method that acquires resources but didn't find any other code vulnerable to this issue.

@agronholm
Copy link
Owner

@graffic would you mind reviewing or at least testing the fix? I'm pretty confident about it but as policy, I try to get all new AnyIO code reviewed.

@graffic
Copy link
Author

graffic commented Dec 13, 2021

@agronholm So your fix makes the PoC pass. 👍

Checking the pull request, I see that you've added a try in the point where it could deadlock, plus releasing the lock in case of error. So I'd say that's the fix :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
asyncio Involves the asyncio backend bug Something isn't working
Projects
None yet
3 participants