-
-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge_contextvars raises a KeyError #591
Comments
Could you give me more information on you setup (like is there async or gevent involved)? Does it have some especially high load? It seems weird that a copied context could change while iterating and nobody running into it in 3 years. 🤔 |
Chiming in to note that we're experiencing the exact same issue. Weirdly enough, we're only observing the issue in a single AWS Lambda project of ours (although structlog is used in all of them, including FastAPI and Flask-based ones, as well as other Lambdas), and only on a single contextvar that I've added to the code recently. It also doesn't happen every time structlog is initialized, suggesting that it might indeed be a race condition. The new contextvar is added using a second call to |
@mLupine thanks for contributing to the fun murder mystery! Mind sharing your versions of Python and structlog? TIA! |
Which reminds me: @eladnam any chance any of this makes any sense to you? |
Sure: Python 3.11.6, structlog 23.2.0 (when the problematic contextvar was added) to 23.3.0 (now) |
One more question (to both): is asyncio involved in those apps at all? We've encountered some weirdness when mixing them (since threads use thread-locals and asyncio does not). |
Not at all. I also confirmed that moving all contextvars binds to a single |
Does the var change if you define them in a different order? 🤔 How many do you have in total? |
OK, y'all know what: pls try this instead and report back pls: def merge_contextvars(
logger: WrappedLogger, method_name: str, event_dict: EventDict
) -> EventDict:
for k, v in tuple(contextvars.copy_context().items()):
if k.name.startswith(STRUCTLOG_KEY_PREFIX) and v is not Ellipsis:
event_dict.setdefault(k.name[STRUCTLOG_KEY_PREFIX_LEN:], v)
return event_dict |
It doesn't — and when the var is removed completely and another one takes its spot, the issue goes away. There's always between 2 and 5 context vars and only that one generates errors.
Deploying the change, will let you know if the issue persists 🫡 |
Well, arguably it did indeed fix the issue, but it did so by bringing the entire application down, so I wouldn't exactly count this one as a win 😏
|
I'm not asking you to retry, but did it come immediately? Like is this error more reproducible? |
Yeah, I was quite surprised to see that too. In terms of reproducibility, if we were to trust Datadog, failure rate was 100% |
Can you reproduce it outside of the runtime? And/or share the how you configure structlog / how it's plugged into requests? |
Also you mentioned datadog which AFAIK does quite aggressive instrumentation – @allonhadaya-maven are you using datadog too? |
Also: technically, MemoryError means that the interpreter ran out of memory:
Could this be some kind of leak? |
👋 @mLupine On the Datadog instrumentation point, we added a
If you are using |
@P403n1x87 thanks for chiming in! Apart from logging instrumentation, aren’t you doing something to contextvars too? I got the impression from a cursory search. It doesn’t look to me like a logging thing to me, but either contextvars have a serious bug, wrong documentation, or something is interfering. |
Disabling Datadog injection did not, unsurprisingly, help. What I've just found might, though. It turns out that in the project that shows those weird errors we had some obsolete structlog initialization code, which was executed separately from the unified structlog-based library we recently migrated all of our project to. This resulted in doubled I've now cleaned up our code and suspect that this will be enough to make those errors go away. I'll update and close the issue if it turns out to be the case. |
Hmmm, I'm curious but multiple configures shouldn't be a problem. It's not like structlog initializes contextvars on configuration. The only changes happen in If that really helps, I'd be curious to learn what those configs exactly did. |
It did not help, I can still see those errors occurring once every few hours, still only on that poor single contextvar 😏 |
Can you give me:
given only two people have reported this problem in almost three years, it must be something small/weird. I've talked to @1st1 who kinda implemented contextvars and he also said the observed behavior of contextvars simply shouldn't happen. |
Hey, Getting back to the issue after a little break.
Not at all, just regular Python 3.11 Lambdas
Omitted some portions of the code, but here's most of the configuration function
Most of the time, it's just the following two top-level variables:
Not in lambdas, no. If a Lambda is executed on a warm instance, we just bind contextvars again before the request is handled, overwriting previous values. |
Could you please try using clear_contextvars then? I’m totally fishing in the dark here, but lambda and its potential idiosyncrasies is the only unusual thing about your setup. :( |
I've modified our handler decorator to use the And yeah, it didn't help at all ☠️ |
Thank you for humoring me! If we can't figure this out, I guess I'll have to undeprecate the thread local module. :( |
Does it exist somewhere in the current codebase so that I can try it out? I've studied the codebase of our project yet another time and once again found nothing that could interfere with contextvars, especially in such a weird way 😏 |
Just use https://www.structlog.org/en/stable/thread-local.html and ignore the warnings (pls use only the contextvars equivalents). I don't think it's your code, I suspect it's the environment or some kind of instrumentation. Unfortunately we haven't heard back from @allonhadaya-maven so I don't know if theirs is special in any way. If it would happen under normal circumstances, the issue tracker would be overflowing. |
After a few weeks after switching to thread local vars I can confirm it "solves" the issue in our Lambdas. However, just a few days ago we started noticing the same errors in one of our Flask-based applications. It doesn't share much code with our Lambdas, although the internal logging library is the same. We're still on structlog 24.1.0, and this leads me to believe that the errors might be related to something we're doing in our code. I'm just out of ideas about what that something could be and it's not exactly easy to debug given that there are like 10 errors a week despite millions of executions 😣 |
I've talked to @1st1 and @elprans at PyCon and given your
|
We're also facing this error with Flask. It happens very infrequently. Let me know if I can provide any useful information. |
Anything I wrote above! It smells like a runtime / CPython issue, so anything you can share about your environment, please do share. Especially if you're running on Alpine (or some similar musl distro) and double-check your memory consumption. |
We were also facing this same error, and I eventually resorted to monkey-patching the |
Glad to hear we're not the only one. Using thread-local works fine, but the problem with that approach is that variables are, well, thread-local. This results in some data missing from multithreaded pieces of our code, but at least they don't crash, right? 😞 |
We did the same. |
I can only keep begging everyone who has this problem to give me more information. Is everyone on Datadog who has this problem? Are y’all running on musl or something like that? Are those apps memory-restrained? A KeyError while iterating over a contextvar is “impossible”. We’re running into some weird runtime bug here, but there’s no way for me to figure it out by thinking really hard about it. 😔 |
Hmm, memory-restrained... Could be a problem, but in my case, no we weren't using musl.
Yea. 😞 That was also what was throwing me off completely. It shouldn't happen. I'll see if I can add any more details soon. |
Yeah 😏
Nope
Yes — either Lambdas or an ECS Fargate task, but applications are nowhere close their memory limit when the issue occurs. |
True here as well! |
True also as in that it’s a non-standard runtime like AWS Lambda? |
As in there were no memory spikes. And no it was not on Lambda as well. Could it be possible with the Gunicorn threaded run that has something to do with this? 🤔 |
Would you provide all the relevant information about your environment in a single response?
|
Thank you so much for dropping by Inada-san!
💯 I wonder if we should convert this to a discussion.
We’ve had a response from DataDog in #591 (comment) and I don’t think it’s about their structlog integration. My suspicion is that it has something to do with contextvars in general – whether thru DD integration or not. As a reminder since this is hard to follow: I’ve asked users to call |
During a recent update of
structlog
from version20.2.0
to21.1.0
in a large Flask + Gunicorn web app, we encountered a new error coming fromcontextvars.merge_contextvars
:The same
KeyError
can be observed in our application for contextvars with different names beyondstructlog_http.x-real-ip
.Updating to the latest version of
structlog
,v23.3.0
does not seem to have resolved the issue.Further inspection suggests that #302 may have introduced a race condition to the
merge_contextvars
function when the implementation changed:from...
to...
It appears possible for
ctx[k]
to raise aKeyError
despite the fact thatk
is defined asfor k in ctx:
We have temporarily reverted the
structlog
update, but would love to help develop a fix for this issue upstream. Please let us know if/how we can help with further investigation.The text was updated successfully, but these errors were encountered: