Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to close a CancelScope produces super confusing errors #882

Closed
Badg opened this issue Jan 26, 2019 · 13 comments · Fixed by #1061
Closed

Failing to close a CancelScope produces super confusing errors #882

Badg opened this issue Jan 26, 2019 · 13 comments · Fixed by #1061

Comments

@Badg
Copy link

Badg commented Jan 26, 2019

(update: this appears to have been due to failing to close the CancelScope; see: #882 (comment))

I'm not sure either way if this is ultimately a dupe of #552, so feel free to close if so. I'm also having a hell of a hard time figuring out how to reliably reproduce this, and it may actually be an issue in trio_asyncio?

What's the issue?

During teardown, there's a KeyError in trio._core._run:CancelScope._remove_task.

Steps to reproduce

I haven't been able to repro this outside of one specific script, but I've tried a lot of different things. For the sake of brevity, I'll call that script "voidreader", since that's what I'm actually calling it and if you can surmise any secret sauce from that name then you deserve to know anyways. Background:

  • Voidreader is starting from a call to trio.run
  • It runs a separate logging system in a companion thread, also running via trio.run, but I'm pretty confident that isn't related, because I still see the error when I comment out the logging system. However, that library does some patching of the stdlib logging module, which... may or may not be relevant?
  • Voidreader starts some async context managers pretty much immediately. One of them runs a Quart server inside of a call to aio_as_trio; there's other uses of it as well
  • Voidreader is also using asks for some stuff, but I don't think that's relevant either
  • Control-C interrupts stop "just working" inside voidreader. It takes me several attempts to get it to actually exit, and then I see the traceback included below

I've tried reproing with stuff like this:

from contextlib import asynccontextmanager
import asyncio

from trio_asyncio import aio_as_trio
from trio_asyncio import trio_as_aio
import trio
import trio_asyncio


@asynccontextmanager
async def nested():
    print('fly, you fools!')
    yield
    await aio_as_trio(asyncio.sleep(7))


async def main():
    async with trio_asyncio.open_loop() as loop:
        async with nested():
            await trio.sleep(1)


if __name__ == '__main__':
    trio.run(main)

but to no avail -- note that control+C still works here, raising a keyboardinterrupt. My hunch is that this may have something to do with the Quart stuff that's running in an asyncio context.

Traceback

Traceback (most recent call last):
  File "C:\Users\Nick\venv\foo\lib\site-packages\trio\_core\_run.py", line 1323, in run
    run_impl(runner, async_fn, args)
  File "C:\Users\Nick\venv\foo\lib\site-packages\trio\_core\_run.py", line 1471, in run_impl
    runner.task_exited(task, final_outcome)
  File "C:\Users\Nick\venv\foo\lib\site-packages\trio\_core\_run.py", line 943, in task_exited
    task._cancel_stack[-1]._remove_task(task)
  File "C:\Users\Nick\venv\foo\lib\site-packages\trio\_core\_run.py", line 203, in _remove_task
    self._tasks.remove(task)
KeyError: <Task '__main__.main' at 0x27f9eaf5b00>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Python37\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Python37\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "d:\dropbox\projekte\foo\bar\baz\__main__.py", line 36, in <module>
    trio.run(main)
  File "C:\Users\Nick\venv\foo\lib\site-packages\trio\_core\_run.py", line 1329, in run
    ) from exc
trio.TrioInternalError: internal error in trio - please file a bug!
Exception ignored in: <function Nursery.__del__ at 0x0000027F9CEE0730>
Traceback (most recent call last):
  File "C:\Users\Nick\venv\foo\lib\site-packages\trio\_core\_run.py", line 530, in __del__
AssertionError:
PS C:\Users\Nick>

Expected behavior

At the very least, a better traceback -- at one point I thought I was seeing that traceback during startup, when ultimately the problem was a recursion issue in my logging patch. While I was trying to repro the trio issue, I stumbled upon the problem in my code, fixed it, and suddenly things ran happily again -- until shutdown, that is, which is where the traceback was originating from anyways. Having more information about surrounding exceptions would have been really helpful early on, to help me isolate which problems were mine vs which problems might be upstream.

Of course also, I'd expect a cleaner shutdown! :)

@njsmith
Copy link
Member

njsmith commented Jan 27, 2019

That does looks pretty weird. Apparently we have a task that thinks it's inside a certain cancel scope, but the cancel scope doesn't think that task is inside it. That really shouldn't be possible. I can't even think of any way "user space" code could even mess this up, unless you're literally hacking around at private attributes of trio internal objects.

I just read through all the code in trio that manages task._cancel_stack and cancel_scope._tasks, trying to see if there were any places where they could get out of sync, and really nothing jumped out at me. There is some pretty complicated code around nursery.start that I trust less than the rest of the code, so that might be something to scrutinize more closely, but everywhere else we always update them together. So this is going to require some more careful debugging.

I don't think #552 is related – that's specifically about a user programming error where someone did await asyncio_future in a trio context, and then for some reason the code we have to catch that mistake didn't work properly.

What might be related is #837, which is another mysterious case involving a cancel scope and a KeyError (but on the global deadline dict, not the task set – though it's possible the root cause involves corruption of the task set).

I gather that you can't share your actual code. I'm not sure from your description how easily you can reproduce this. (One of the things that makes #837 difficult to track down is that it only happened once, in a program that was running 16+ hours, so kinda hard to dig into...) Can you reproduce it quickly and reliably? If so then I guess the next step might be just start systematically deleting chunks of the program until it stops crashing. It's kind of an annoying way to debug, but the nice thing is that its guaranteed to tell you something...

@njsmith
Copy link
Member

njsmith commented Jan 27, 2019

(BTW, just in case you weren't aware, you can now use quart without trio-asyncio. But we should still track this down :-).)

@Badg
Copy link
Author

Badg commented Jan 27, 2019

There is some pretty complicated code around nursery.start that I trust less than the rest of the code

This actually might be a good hunch, more on that in the code "snippet" at bottom.

Can you reproduce it quickly and reliably?

I can reliably repro; I'll take a look tomorrow to see if I can make any better headway on what the bare minimum repro is, which hopefully I should be able to share.

you can now use quart without trio-asyncio

Indeed! I'm planning on upgrading that when I also upgrade hypercorn to 0.5.0 to avoid this gross hack -- which incidentally like I mentioned above, might be related. I'm not sure if this will run in isolation, because I've never tested it in isolation (and wanted to strip out some application-specific stuff), but this is at least the general gist.

QUART_ENGINE: ContextVar[List[Any]] = ContextVar('quart_engine')

class TaskEngine:
    '''Base class for things that need to manage themselves, and are
    therefore "booted" during startup.
    '''

    def __init_subclass__(cls, *args, context, **kwargs):
        super().__init_subclass__(*args, **kwargs)
        cls.ENGINE_CONTEXT = context

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._engine_nursery = None
        self._engine_token = None
        self._engine_cancel_scope = None

    async def __aenter__(self):
        previous_engines = self.ENGINE_CONTEXT.get(None)
        if previous_engines is not None:
            # Make a copy to preserve the original context
            engines = [*previous_engines, self]
        else:
            engines = [self]

        self._engine_token = self.ENGINE_CONTEXT.set(engines)
        self._engine_nursery_manager = trio.open_nursery()
        self._engine_cancel_scope = trio.open_cancel_scope().__enter__()
        self._engine_nursery = await self._engine_nursery_manager.__aenter__()
        await self._engine_nursery.start(self.run)

    async def __aexit__(self, exc_type, exc, exc_tb):
        self._engine_cancel_scope.cancel()
        await self._engine_nursery_manager.__aexit__(exc_type, exc, exc_tb)
        self._engine_cancel_scope.__exit__(exc_type, exc, exc_tb)
        self.ENGINE_CONTEXT.reset(self._engine_token)

    @classmethod
    def get(cls):
        '''Gets a single (the outermost) task engine.'''
        engines = cls.ENGINE_CONTEXT.get()
        if engines is None:
            return None
        else:
            return engines[-1]

    @classmethod
    def get_all(cls):
        '''Gets all task engines.'''
        return cls.ENGINE_CONTEXT.get([])

    async def run(self, *args, task_status=trio.TASK_STATUS_IGNORED, **kwargs):
        raise NotImplementedError('Engine subclasses need to implement run!')


class QuartEngine(TaskEngine, context=QUART_ENGINE):

    def __init__(self, *args, host, port, **kwargs):
        super().__init__(*args, **kwargs)
        self.host = host
        self.port = port

    def _patched_quart_init(self, app, config, loop):
        self.app = app
        self.config = config
        self.loop = loop

    @aio_as_trio
    async def _hypercorn_run(self, task_status):
        '''This is basically a copypasta-ish of run_single inside
        hyperccorn.asyncio.run, ported to start from within a coroutine
        inside an already-running asyncio loop, and stripping out extra
        junk we don't need.
        '''
        # Keep the namespace pollution inside here please!
        from hypercorn.asyncio.lifespan import Lifespan
        from hypercorn.asyncio.run import Server
        from hypercorn.utils import create_socket

        app = self.app
        config = self.config
        loop = self.loop

        loop.set_debug(config.debug)

        lifespan = Lifespan(app, config)
        lifespan_task = asyncio.ensure_future(lifespan.handle_lifespan())

        await lifespan.wait_for_startup()

        ssl_context = config.create_ssl_context()
        sock = create_socket(config)

        server = await loop.create_server(
            lambda: Server(app, loop, config), backlog=config.backlog,
            ssl=ssl_context, sock=sock
        )

        task_status.started()
        try:
            # Biggest non-removal change is here.
            await server.serve_forever()
        finally:
            server.close()
            await server.wait_closed()
            await lifespan.wait_for_shutdown()
            lifespan_task.cancel()
            await lifespan_task

    async def run(self, task_status=trio.TASK_STATUS_IGNORED):
        global _CNC_HOST
        global _CNC_PORT

        try:
            # This sets our app, config, and loop
            with patch('quart.app.run_single', self._patched_quart_init):
                cnc_app.run(host=self.host, port=self.port)

            _CNC_HOST = self.host
            _CNC_PORT = self.port

            await self._hypercorn_run(task_status)

        finally:
            self.app._got_first_request = False

Side note, this is not my proudest code! But of special note, the nested context manager that's having to manually invoke the nursery dunders might be relevant.

@njsmith
Copy link
Member

njsmith commented Jan 27, 2019

Umm. Yeah, that is a lot of risky-looking code, and I don't really understand what you're trying to do even, so I can't give you any clear feedback on whether it makes sense.

One thing that jumps out at me is that this bit in __aexit__ looks dangerously wrong:

        await self._engine_nursery_manager.__aexit__(exc_type, exc, exc_tb)
        self._engine_cancel_scope.__exit__(exc_type, exc, exc_tb)

It's very common that self._engine_nursery_manager.__aexit__ will raise an exception (in particular because you just cancelled it on the line before... I think it might always raise an exception?). When this happens, what you should do is feed this new exception to self._engine_cancel_scope.__exit__, but you don't... (Also btw, __aexit__ and __exit__ have a return value you have to check... though I don't think that will affect anything until you fix the bug with __aexit__ raising.) If you really have to do... whatever it is that this does :-), then I strongly recommend finding a way to use @asynccontextmanager, because it makes it MUCH simpler to compose context managers correctly.

All that said... I still don't see any obvious way that this would cause the traceback that you showed at the beginning of the thread :-(. So there's still an open question about that.

@smurfix
Copy link
Contributor

smurfix commented Jan 27, 2019

One thing that jumps out at me is that this bit in aexit looks dangerously wrong:

Short answer: don't do that, it's tricky to get right at best.

I'd recommend to handle this kind of situation with an AsyncExitStack, assuming that you can't easily do the @asynccontextmanager thing for some reason.

@Badg
Copy link
Author

Badg commented Jan 27, 2019

One thing that jumps out at me is that this bit in __aexit__ looks dangerously wrong:

Thanks for the catch! For the sake of debugging, I updated two different ways. First:

    async def __aexit__(self, exc_type, exc, exc_tb):
        self._engine_cancel_scope.cancel()
        try:
            try:
                await self._engine_nursery_manager.__aexit__(exc_type, exc,
                                                             exc_tb)
            except BaseException:
                cancel_exc_info = sys.exc_info()
            else:
                cancel_exc_info = exc_type, exc, exc_tb
            finally:
                self._engine_cancel_scope.__exit__(*cancel_exc_info)
        finally:
            self.ENGINE_CONTEXT.reset(self._engine_token)

After that, the strange traceback was still there. However, when I more-properly updated it to...

    async def __aenter__(self):
        previous_engines = self.ENGINE_CONTEXT.get(None)
        if previous_engines is not None:
            # Make a copy to preserve the original context
            engines = [*previous_engines, self]
        else:
            engines = [self]

        self._engine_token = self.ENGINE_CONTEXT.set(engines)

        async with AsyncExitStack() as stack:
            self._engine_cancel_scope = stack.enter_context(
                trio.open_cancel_scope())
            self._engine_nursery = await stack.enter_async_context(
                trio.open_nursery())
            self._engine_exitstack = stack.pop_all()

        await self._engine_nursery.start(self.run)

    async def __aexit__(self, exc_type, exc, exc_tb):
        self._engine_cancel_scope.cancel()

        try:
            await self._engine_exitstack.aclose()
        finally:
            self.ENGINE_CONTEXT.reset(self._engine_token)

...and now a ctrl+C results in a CancelledError -> KeyboardInterrupt as expected. So I'm leaning towards the error being my code, and not Trio's, though the traceback could probably be improved to help emphasize that.

I did also notice that in the __aenter__ it appears I originally messed up the order of the following lines:

        self._engine_nursery_manager = trio.open_nursery()
        self._engine_cancel_scope = trio.open_cancel_scope().__enter__()
        self._engine_nursery = await self._engine_nursery_manager.__aenter__()

(it should instead be cancel scope __enter__, then open_nursery, then __aenter__). However, fixing that also didn't make the original traceback disappear. So, I'm at that point where "I don't know what the problem was, and I don't know what fixed it, but it's gone now".

Is it worthwhile for me to try and get the traceback back so we can figure out what was actually causing it? If so, it'll have to wait until next weekend, when I'll have some more time to revisit.

Short answer: don't do that, it's tricky to get right at best.

Like I said, it's gross code. I wrote it without an internet connection, so I was from-memorying the stdlib and didn't remember the API for AsyncExitStack (and didn't want to waste time inspecting it out).

That being said, I think there's a documentation problem here, because it didn't really occur to me that the cancel scope __exit__ might care about the Cancelled raised by the nursery manager until Nathaniel's comment above. It seems to me like using an async context manager to spawn a background task of some sort would be a relatively common pattern, so I think it would be a smart thing to mention somewhere (I'm just not sure where, or I'd already have a PR up for it).

@njsmith
Copy link
Member

njsmith commented Jan 28, 2019

I'm still super-confused about how you're managing to get that specific error, so if you do find a chance to dig into it then please let us know :-). But, you've solved your problem, and it seems very likely it was related somehow to your code breaking Python semantics that Trio relies on, so at this point it's more a fascinating mystery than an urgent fix.

By the way... I still don't understand what this engine thing is trying to do, but I think you can get rid of that cancel scope entirely? Nurseries already have a built-in cancel scope that includes the exact same code as the nursery block; you can access it as nursery.cancel_scope. (And the nursery __aenter__/__aexit__ already handle all the tricky nesting issues this creates :-).)

@Badg
Copy link
Author

Badg commented Jan 28, 2019

if you do find a chance to dig into it then please let us know

Will do! Sometime soon, I hope (famous last words).

I think you can get rid of that cancel scope entirely?

D'oh. Of course, absolutely. I'd completely forgotten about the built-in nursery cancel scope.

I still don't understand what this engine thing is trying to do

That's a much longer conversation (which isn't really relevant here). The short version is that it's a WIP/prototype of a pattern I'm trying to figure out for running in-process async background tasks while avoiding module-level singletons. There's a lot of "lessons learned the hard way" baked into that, which at some point I should probably write a blog post about or something... in all my spare time (ha!)

@smurfix
Copy link
Contributor

smurfix commented Jan 28, 2019

It seems to me like using an async context manager to spawn a background task of some sort would be a relatively common pattern

Right. But explicitly calling multiple sub-objects' __exit__ or __aexit__ so that no semantics get broken isn't trivial; just look at the code of ExitStack and AsyncExitStack. So we might want to add a note to our docs that one shall use an async context manager or an AsyncExitStack for managing that, instead of invariably getting the nitty-gritty details of that sort of thing subtly wrong.

@oremanj
Copy link
Member

oremanj commented May 2, 2019

I still don't see any obvious way that this would cause the traceback that you showed at the beginning of the thread :-(. So there's still an open question about that.

I've run into an error like this a number of times in the course of some of my low-level cancellation work. I'm pretty sure the sequence of events is:

  • nursery __aexit__ raises an exception
  • due to the issue that @njsmith pointed out, cancel scope __exit__ is skipped
  • the next-outer cancel scope __exit__ runs
    def _remove_task(self, task):
        self._tasks.remove(task)
        assert task._cancel_stack[-1] is self
        task._cancel_stack.pop()
  • assert task._cancel_stack[-1] is self fails (because the inner cancel scope wasn't popped) and raises AssertionError
  • now task._cancel_stack[-1] is on the cancel stack of task but doesn't have task in its set of tasks

If you look at outcome in the task_exited frame in the traceback, you should find that it contains an AssertionError originating on the above-mentioned line of CancelScope._remove_task.

This could definitely be made less brittle, but we're planning to revamp a bunch of this infrastructure with #958, so maybe we should put this on hold for now and look again after #958 lands?

@oremanj oremanj changed the title Trio internal error during teardown, possibly related to trio_asyncio and context managers Failing to close a CancelScope produces super confusing errors May 2, 2019
@Badg
Copy link
Author

Badg commented May 2, 2019

Ahh, I'd completely forgotten I said I'd try to get an easy repro of this -- work got especially crazy and I'm not likely to have time anywhen soon.

I'm going to defer everything to you guys on this one, because I think at the end of the day the actual problem was, well, me! But I agree that improving the messaging in the resulting errors could help make that more obvious.

@pipermerriam
Copy link

This appears to be the same bug and I have simple code that reproduces the error. Maybe this helps: python-trio/pytest-trio#82

@oremanj
Copy link
Member

oremanj commented May 17, 2019

Another related issue report in #1055. A first step here might be to have TrioInternalError link to part of the documentation that explains all of the various undefined/misguided/... things we've seen people do that result in TrioInternalErrors. The "please report a bug!" instruction is probably misleading more often than it's right -- the problem is empirically a lot more likely to be in something clever the user is trying to do that violates our invariants, rather than Trio misbehaving despite being used as documented.

oremanj added a commit to oremanj/trio that referenced this issue May 21, 2019
When exiting an outer cancel scope before some inner cancel scope nested inside it, Trio would previously raise a confusing chain of internal errors.
One common source of this misbehavior was an async generator that yielded within a cancel scope.
Now the mis-nesting condition will be noticed and in many cases the user will receive a single RuntimeError with a useful traceback.
This is a best-effort feature, and it's still possible to fool Trio if you try hard enough, such as by closing a nursery's cancel scope
from within some child task of the nursery.

Fixes python-trio#882. Relevant to python-trio#1056, python-trio#264.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants