Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge_contextvars raises a KeyError #591

Open
allonhadaya-maven opened this issue Jan 24, 2024 · 43 comments
Open

merge_contextvars raises a KeyError #591

allonhadaya-maven opened this issue Jan 24, 2024 · 43 comments
Labels

Comments

@allonhadaya-maven
Copy link

During a recent update of structlog from version 20.2.0 to 21.1.0 in a large Flask + Gunicorn web app, we encountered a new error coming from contextvars.merge_contextvars:

builtins.KeyError: <ContextVar name='structlog_http.x-real-ip' default=Ellipsis at 0x7fe6b387c3b0>
Traceback (most recent call last):
  ...
  File "/api/app.py", line 148, in response_after_request
    log.debug("api_hit", **{"http.status_code": response.status_code})
  File "/usr/local/lib/python3.8/site-packages/structlog/stdlib.py", line 142, in debug
    return self._proxy_to_logger("debug", event, *args, **kw)
  File "/usr/local/lib/python3.8/site-packages/structlog/stdlib.py", line 213, in _proxy_to_logger
    return super()._proxy_to_logger(method_name, event=event, **event_kw)
  File "/usr/local/lib/python3.8/site-packages/structlog/_base.py", line 203, in _proxy_to_logger
    args, kw = self._process_event(method_name, event, event_kw)
  File "/usr/local/lib/python3.8/site-packages/structlog/_base.py", line 160, in _process_event
    event_dict = proc(self._logger, method_name, event_dict)
  File "/usr/local/lib/python3.8/site-packages/structlog/contextvars.py", line 81, in merge_contextvars
    if k.name.startswith(STRUCTLOG_KEY_PREFIX) and ctx[k] is not Ellipsis:
KeyError: <ContextVar name='structlog_http.x-real-ip' default=Ellipsis at 0x7fe6b387c3b0>

The same KeyError can be observed in our application for contextvars with different names beyond structlog_http.x-real-ip.

Updating to the latest version of structlog, v23.3.0 does not seem to have resolved the issue.


Further inspection suggests that #302 may have introduced a race condition to the merge_contextvars function when the implementation changed:

from...

def merge_contextvars(
    logger: WrappedLogger, method_name: str, event_dict: Dict[str, Any]
) -> Context:
    """
    A processor that merges in a global (context-local) context.

    Use this as your first processor in :func:`structlog.configure` to ensure
    context-local context is included in all log calls.

    .. versionadded:: 20.1.0
    """
    ctx = _get_context().copy()
    ctx.update(event_dict)

    return ctx

to...

def merge_contextvars(
    logger: WrappedLogger, method_name: str, event_dict: EventDict
) -> EventDict:
    """
    A processor that merges in a global (context-local) context.

    Use this as your first processor in :func:`structlog.configure` to ensure
    context-local context is included in all log calls.

    .. versionadded:: 20.1.0
    .. versionchanged:: 21.1.0 See toplevel note.
    """
    ctx = contextvars.copy_context()

    for k in ctx:
        if k.name.startswith(STRUCTLOG_KEY_PREFIX) and ctx[k] is not Ellipsis:
            event_dict.setdefault(k.name[len(STRUCTLOG_KEY_PREFIX) :], ctx[k])

    return event_dict

It appears possible for ctx[k] to raise a KeyError despite the fact that k is defined as for k in ctx:


We have temporarily reverted the structlog update, but would love to help develop a fix for this issue upstream. Please let us know if/how we can help with further investigation.

@hynek
Copy link
Owner

hynek commented Jan 25, 2024

Could you give me more information on you setup (like is there async or gevent involved)? Does it have some especially high load? It seems weird that a copied context could change while iterating and nobody running into it in 3 years. 🤔

@mLupine
Copy link

mLupine commented Feb 5, 2024

Chiming in to note that we're experiencing the exact same issue. Weirdly enough, we're only observing the issue in a single AWS Lambda project of ours (although structlog is used in all of them, including FastAPI and Flask-based ones, as well as other Lambdas), and only on a single contextvar that I've added to the code recently. It also doesn't happen every time structlog is initialized, suggesting that it might indeed be a race condition.

The new contextvar is added using a second call to structlog.contextvars.bind_contextvars instead of being an additional kwarg to the first call. I've just modified our code to only do a single call to bind_contextvars and will observe if it fixes the issue. I'll post an update with the results.

@hynek
Copy link
Owner

hynek commented Feb 5, 2024

@mLupine thanks for contributing to the fun murder mystery! Mind sharing your versions of Python and structlog? TIA!

@hynek
Copy link
Owner

hynek commented Feb 5, 2024

Which reminds me: @eladnam any chance any of this makes any sense to you?

@mLupine
Copy link

mLupine commented Feb 5, 2024

@mLupine thanks for contributing to the fun murder mystery! Mind sharing your versions of Python and structlog? TIA!

Sure: Python 3.11.6, structlog 23.2.0 (when the problematic contextvar was added) to 23.3.0 (now)

@hynek hynek added the bug label Feb 5, 2024
@hynek
Copy link
Owner

hynek commented Feb 6, 2024

One more question (to both): is asyncio involved in those apps at all?

We've encountered some weirdness when mixing them (since threads use thread-locals and asyncio does not).

@mLupine
Copy link

mLupine commented Feb 6, 2024

One more question (to both): is asyncio involved in those apps at all?

Not at all.

I also confirmed that moving all contextvars binds to a single bind_contextvars call does not seem to help, the issue still persists — weirdly, still with just one of all vars, and always the same one (it's just a string with a random UUID inside).

@hynek
Copy link
Owner

hynek commented Feb 6, 2024

Does the var change if you define them in a different order? 🤔 How many do you have in total?

@hynek
Copy link
Owner

hynek commented Feb 6, 2024

OK, y'all know what: pls try this instead and report back pls:

def merge_contextvars(
    logger: WrappedLogger, method_name: str, event_dict: EventDict
) -> EventDict:
    for k, v in tuple(contextvars.copy_context().items()):
        if k.name.startswith(STRUCTLOG_KEY_PREFIX) and v is not Ellipsis:
            event_dict.setdefault(k.name[STRUCTLOG_KEY_PREFIX_LEN:], v)

    return event_dict

@mLupine
Copy link

mLupine commented Feb 6, 2024

Does the var change if you define them in a different order? 🤔 How many do you have in total?

It doesn't — and when the var is removed completely and another one takes its spot, the issue goes away. There's always between 2 and 5 context vars and only that one generates errors.

OK, y'all know what: pls try this instead and report back pls

Deploying the change, will let you know if the issue persists 🫡

@mLupine
Copy link

mLupine commented Feb 6, 2024

Well, arguably it did indeed fix the issue, but it did so by bringing the entire application down, so I wouldn't exactly count this one as a win 😏

MemoryError: null
  [irrelevant part cut out]
  File "structlog/_native.py", line 134, in meth
    return self._proxy_to_logger(name, event, **kw)
  File "structlog/_base.py", line 216, in _proxy_to_logger
    args, kw = self._process_event(method_name, event, event_kw)
  File "structlog/_base.py", line 167, in _process_event
    event_dict = proc(self._logger, method_name, event_dict)
  File "infrastructure_common/logging/logging.py", line 221, in merge_contextvars
    for k, v in tuple(contextvars.copy_context().items()):

@hynek
Copy link
Owner

hynek commented Feb 7, 2024

MemoryError: null!? Holy shit? I don't think I've seen that in over 15 years of writing Python. I guess that's the same problem/race condition.

I'm not asking you to retry, but did it come immediately? Like is this error more reproducible?

@mLupine
Copy link

mLupine commented Feb 7, 2024

Yeah, I was quite surprised to see that too. In terms of reproducibility, if we were to trust Datadog, failure rate was 100%

@hynek
Copy link
Owner

hynek commented Feb 7, 2024

Can you reproduce it outside of the runtime? And/or share the how you configure structlog / how it's plugged into requests?

@hynek
Copy link
Owner

hynek commented Feb 7, 2024

Also you mentioned datadog which AFAIK does quite aggressive instrumentation – @allonhadaya-maven are you using datadog too?

@hynek
Copy link
Owner

hynek commented Feb 7, 2024

Also: technically, MemoryError means that the interpreter ran out of memory:

Raised when an operation runs out of memory but the situation may still be rescued (by deleting some objects). The associated value is a string indicating what kind of (internal) operation ran out of memory. Note that because of the underlying memory management architecture (C’s malloc() function), the interpreter may not always be able to completely recover from this situation; it nevertheless raises an exception so that a stack traceback can be printed, in case a run-away program was the cause.

Could this be some kind of leak?

@P403n1x87
Copy link

👋 @mLupine On the Datadog instrumentation point, we added a structlog integration with version 2.2.0. To check whether our instrumentation is playing a part in this, may I suggest the following steps:

  • if you are using ddtrace>=2.2.0 and have logs injection enabled, could you try with logs injection disabled
  • if you are using ddtrace>=2.2.0 and the above didn't help, could you try downgrading to ddtrace~=2.1.0

If you are using ddtrace<2.2.0 it's unlikely that the Datadog instrumentation is at fault here, but we're keen to keep looking into it in case the steps above didn't help 🙏 .

@hynek
Copy link
Owner

hynek commented Feb 8, 2024

@P403n1x87 thanks for chiming in! Apart from logging instrumentation, aren’t you doing something to contextvars too? I got the impression from a cursory search.

It doesn’t look to me like a logging thing to me, but either contextvars have a serious bug, wrong documentation, or something is interfering.

@mLupine
Copy link

mLupine commented Feb 15, 2024

Disabling Datadog injection did not, unsurprisingly, help. What I've just found might, though.

It turns out that in the project that shows those weird errors we had some obsolete structlog initialization code, which was executed separately from the unified structlog-based library we recently migrated all of our project to. This resulted in doubled structlog.configure calls with setup that we didn't really expect or want.

I've now cleaned up our code and suspect that this will be enough to make those errors go away. I'll update and close the issue if it turns out to be the case.

@hynek
Copy link
Owner

hynek commented Feb 16, 2024

Hmmm, I'm curious but multiple configures shouldn't be a problem. It's not like structlog initializes contextvars on configuration. The only changes happen in bind_contextvars and clear_contextvars. 🤔

If that really helps, I'd be curious to learn what those configs exactly did.

@mLupine
Copy link

mLupine commented Feb 17, 2024

It did not help, I can still see those errors occurring once every few hours, still only on that poor single contextvar 😏

@hynek
Copy link
Owner

hynek commented Feb 18, 2024

Can you give me:

  • your runtime (something crazy like uWSGI?? you mentioned lambda, but is here more to it?)
  • the structlog config
  • a list of variables with their types
  • whether you run clear_contextvars regularly

given only two people have reported this problem in almost three years, it must be something small/weird.

I've talked to @1st1 who kinda implemented contextvars and he also said the observed behavior of contextvars simply shouldn't happen.

@mLupine
Copy link

mLupine commented Mar 12, 2024

Hey,

Getting back to the issue after a little break.

your runtime (something crazy like uWSGI?? you mentioned lambda, but is here more to it?)

Not at all, just regular Python 3.11 Lambdas

the structlog config

Omitted some portions of the code, but here's most of the configuration function

    processors = [
        structlog.contextvars.merge_contextvars,
        structlog.stdlib.add_log_level,
        structlog.processors.StackInfoRenderer(),
        datadog_tracer_injection,
        structlog.dev.set_exc_info,
        structlog.processors.CallsiteParameterAdder(
            [CallsiteParameter.FILENAME, CallsiteParameter.FUNC_NAME, CallsiteParameter.LINENO],
        ),
        structlog.processors.TimeStamper(fmt="iso"),
        SentryProcessor(event_level=logging.ERROR, tag_keys="__all__"),
        structlog.processors.dict_tracebacks,
        structlog.processors.EventRenamer("message", "_event"),
        structlog.processors.JSONRenderer(serializer=_json_serializer),
    ]

    structlog.configure(
        cache_logger_on_first_use=True,
        wrapper_class=structlog.make_filtering_bound_logger(
            logging.INFO
        ),
        processors=processors,  # type: ignore[arg-type]
    )

a list of variables with their types

Most of the time, it's just the following two top-level variables:

request: dict
   lambda_instance_id: str
   request_id: str
   identity: str | None
   query_string: str | None
   path_parameters: str | None
   body: str | None
lambda_instance_id: str

whether you run clear_contextvars regularly

Not in lambdas, no. If a Lambda is executed on a warm instance, we just bind contextvars again before the request is handled, overwriting previous values.

@hynek
Copy link
Owner

hynek commented Mar 13, 2024

Could you please try using clear_contextvars then? I’m totally fishing in the dark here, but lambda and its potential idiosyncrasies is the only unusual thing about your setup. :(

@mLupine
Copy link

mLupine commented Mar 20, 2024

I've modified our handler decorator to use the bound_contextvars context manager so that it cleans up our custom contextvars nicely after every Lambda invocation.

And yeah, it didn't help at all ☠️

@hynek
Copy link
Owner

hynek commented Mar 20, 2024

Thank you for humoring me!

If we can't figure this out, I guess I'll have to undeprecate the thread local module. :(

@mLupine
Copy link

mLupine commented Mar 28, 2024

Does it exist somewhere in the current codebase so that I can try it out? I've studied the codebase of our project yet another time and once again found nothing that could interfere with contextvars, especially in such a weird way 😏

@hynek
Copy link
Owner

hynek commented Mar 28, 2024

Just use https://www.structlog.org/en/stable/thread-local.html and ignore the warnings (pls use only the contextvars equivalents).

I don't think it's your code, I suspect it's the environment or some kind of instrumentation. Unfortunately we haven't heard back from @allonhadaya-maven so I don't know if theirs is special in any way.

If it would happen under normal circumstances, the issue tracker would be overflowing.

@hynek hynek pinned this issue Apr 2, 2024
@mLupine
Copy link

mLupine commented Jun 3, 2024

After a few weeks after switching to thread local vars I can confirm it "solves" the issue in our Lambdas. However, just a few days ago we started noticing the same errors in one of our Flask-based applications. It doesn't share much code with our Lambdas, although the internal logging library is the same. We're still on structlog 24.1.0, and this leads me to believe that the errors might be related to something we're doing in our code. I'm just out of ideas about what that something could be and it's not exactly easy to debug given that there are like 10 errors a week despite millions of executions 😣

@hynek
Copy link
Owner

hynek commented Jun 26, 2024

I've talked to @1st1 and @elprans at PyCon and given your MemoryError(None) the main/only suspicion is a faulty memory allocator.

  • Are you sure you don't run into low-memory situations?
  • Are you using a weird libc like Alpine's musl?
  • Do you do anything else that could interfere with memory allocation at a lower level?

@MunifTanjim
Copy link

We're also facing this error with Flask. It happens very infrequently. Let me know if I can provide any useful information.

@hynek
Copy link
Owner

hynek commented Aug 21, 2024

Anything I wrote above! It smells like a runtime / CPython issue, so anything you can share about your environment, please do share. Especially if you're running on Alpine (or some similar musl distro) and double-check your memory consumption.

@rinaldo-urbint
Copy link

We were also facing this same error, and I eventually resorted to monkey-patching the merge_contextvars to skip that KeyError exceptions so our datadog doesn't get clouded. I did not, however test it with the old thread_local module. Just an FYI!

@mLupine
Copy link

mLupine commented Nov 19, 2024

Glad to hear we're not the only one. Using thread-local works fine, but the problem with that approach is that variables are, well, thread-local. This results in some data missing from multithreaded pieces of our code, but at least they don't crash, right? 😞

@MunifTanjim
Copy link

I eventually resorted to monkey-patching the merge_contextvars to skip that KeyError exceptions

We did the same.

@hynek
Copy link
Owner

hynek commented Dec 21, 2024

I can only keep begging everyone who has this problem to give me more information.

Is everyone on Datadog who has this problem? Are y’all running on musl or something like that? Are those apps memory-restrained?

A KeyError while iterating over a contextvar is “impossible”. We’re running into some weird runtime bug here, but there’s no way for me to figure it out by thinking really hard about it. 😔

@rinaldo-urbint
Copy link

Is everyone on Datadog who has this problem? Are y’all running on musl or something like that? Are those apps memory-restrained?

Hmm, memory-restrained... Could be a problem, but in my case, no we weren't using musl.

A KeyError while iterating over a contextvar is “impossible”. We’re running into some weird runtime bug here

Yea. 😞 That was also what was throwing me off completely. It shouldn't happen. I'll see if I can add any more details soon.

@mLupine
Copy link

mLupine commented Dec 23, 2024

Is everyone on Datadog who has this problem?

Yeah 😏

Are y’all running on musl or something like that?

Nope

Are those apps memory-restrained?

Yes — either Lambdas or an ECS Fargate task, but applications are nowhere close their memory limit when the issue occurs.

@rinaldo-urbint
Copy link

but applications are nowhere close their memory limit when the issue occurs.

True here as well!

@hynek
Copy link
Owner

hynek commented Dec 23, 2024

but applications are nowhere close their memory limit when the issue occurs.

True here as well!

True also as in that it’s a non-standard runtime like AWS Lambda?

@rinaldo-urbint
Copy link

True also as in that it’s a non-standard runtime like AWS Lambda?

As in there were no memory spikes. And no it was not on Lambda as well. Could it be possible with the Gunicorn threaded run that has something to do with this? 🤔

@methane
Copy link
Contributor

methane commented Dec 24, 2024

Would you provide all the relevant information about your environment in a single response?
It is very difficult to debug without reproduce.

  • Does "no Lambda" mean it can happen not on AWS provided container? Or all errors happens on AWS container (e.g. Firecracker)?
  • Do you use DataDog? Can you provide the code to integrate DataDog with structlog?

@hynek
Copy link
Owner

hynek commented Dec 24, 2024

Thank you so much for dropping by Inada-san!

Would you provide all the relevant information about your environment in a single response? It is very difficult to debug without reproduce.

💯

I wonder if we should convert this to a discussion.

  • Do you use DataDog? Can you provide the code to integrate DataDog with structlog?

We’ve had a response from DataDog in #591 (comment) and I don’t think it’s about their structlog integration. My suspicion is that it has something to do with contextvars in general – whether thru DD integration or not.

As a reminder since this is hard to follow: I’ve asked users to call tuple in items and they got… a MemoryError! #591 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants