-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crash: trying to use garbage/null pointer #831
Comments
I suspect this is the same/similar issue to the one that @goaway was investigating for a while around the SslContextManager assert. I couldn't figure out any path that would lead to that assert either other than some strange busted cleanup. One thing that I do wonder is whether iOS will terminate threads without actually letting them run to completion, but potentially let other threads run. That would explain some of these strange cases. Given that we only see this on iOS and not other platforms I suspect there is something strange like this happening. |
Yeah this should stop the event loop from exiting. Per above I suspect there is something deeper and more strange going on here but no immediate theories. |
I was concerned this might be the case given how aggressive iOS tends to be with resource management. I'll do some more research. |
@junr03 I had another thought here: does iOS send signals when it shuts things down? I wonder if Envoy is exiting strangely due to receiving SIGTERM or something like that. You might want to disable signal handling in Envoy if that is the case. |
Just got back to this. Thanks for the suggestion @mattklein123. Just to confirm you mean using https://github.com/lyft/envoy-mobile/blob/master/.bazelrc#L7 |
No, disable/delete this code: https://github.com/envoyproxy/envoy/blob/6151a69f9c0dc4aa7938d987036ec00eedb818d5/source/server/server.cc#L528-L547 (Assuming Darwin sends signals) |
gotcha. I couldn't see how the above code would affect this, as it just catches, adds tracing, and re-raises. Your link makes sense to me, thanks! |
I had a nice discussion with @mattklein123 about this issue this morning. The more we talked about it the more inclined we were to think that the thread created by And for that to be the case, the event loop must have exited. My original commentary was positing that for some reason libevent was exiting outside of the explicit |
…nals (#835) Description: Disabling signal handling in the Server::Options makes it so that the server's event dispatcher does not listen for termination signals such as SIGTERM, SIGINT, etc. Previous crashes in iOS were experienced due to out-of-band event loop exit as described in #831. Ignoring termination signals makes it more likely that the event loop will only exit due to Engine destruction. This PR introduces Envoy::MobileMainCommon, as this is the canonical way to customize how main runs, and options setup per envoyproxy/envoy#3424. The new Envoy::MobileMainCommon also does away with other logic in Envoy::MainCommon that does not apply to Envoy Mobile. Risk Level: med - low-level change in termination handling Testing: unit test to assert option is correctly set. End to end test with iOS and Android devices to ensure clean exit when the app using envoy mobile exits (and thus destructs the engine). Moreover, there is no event loop exit any longer when the simulator app receives a SIGTERM, i.e., the event dispatcher is no longer listening to SIGTERM and exiting due to them. Potentially fixes #831. Will need to verify with wider client release. Signed-off-by: Jose Nino jnino@lyft.com
Unfortunately, still occurring. |
…#1021) Description: this PR attempts to fix two related crashes. By hooking up to iOS's notification system we attempt to terminate the engine earlier, and allow all objects necessary to be available fixing #1035. Potentially fixes #831. Risk Level: med - changes termination sequence Testing: added test Signed-off-by: Jose Nino <jnino@lyft.com>
Stacktrace:
The only pointer accessed in
Http::Dispatcher::post
is theevent_dispatcher_
pointer here:https://github.com/lyft/envoy-mobile/blob/5c01e711e70c80c101087e61722862cbfbe961c9/library/common/http/dispatcher.cc#L282
The pointer is correctly initialized to nullptr, set by an initialization callback, and guarded by synchronization mechanisms. So I believe the crash is happening because the
Event::Dispatcher
the pointer points to is getting deallocated.That
Event::Dispatcher
is uniquely owned by theServer::InstanceImpl
. Which in turn (through some layers of inner classes ends up uniquely owned bymain_common_
in the Envoy Mobile codebase. The only code path where that unique pointer is reset is when the event loop exits here:https://github.com/lyft/envoy-mobile/blob/5c01e711e70c80c101087e61722862cbfbe961c9/library/common/engine.cc#L84
The only place where Envoy Mobile explicitly exits the event loop is in the Engine destructor:
https://github.com/lyft/envoy-mobile/blob/5c01e711e70c80c101087e61722862cbfbe961c9/library/common/engine.cc#L104
Currently envoy mobile has only a singleton engine that is statically instantiated. Therefore, Envoy Mobile should not be exiting the event loop unless the app is quitting, or has otherwise crashed causing static deallocation. However, from the stacktrace above it looks like:
a) The event loop has exited, causing
main_common_
to be reset and theevent_dispatcher_
pointer to become garbageb) The exit is happening without the static Engine singleton being destroyed, as evidenced by the fact that the stack trace above is able to lock the Engine's weak pointer.
I have created a test case where I force event loop exit and subsequently use the engine, resulting in a repro where the
event_dispatcher_
is garbage.Open Question
My remaining unknown is: why is the event loop exiting without the Envoy Mobile code calling
exit()
on it. My only theory is that Envoy runs the event loop withEvent::Dispatcher::RunType::Block
https://github.com/envoyproxy/envoy/blob/c2a63735145ef1ce056ca2a50927e3e6682177eb/source/server/server.cc#L592
Which is described as
// Runs the event-loop until there are no pending events.
All of the crashes show that the app had been sent to the background (and afaict the crash is only happening in iOS). So I am wondering somehow the event loop is exiting prematurely I would assume that pending timers for stats flushes, timeouts, should keep the event loop running given that there are pending events. I am not a libevent or iOS expert, but I am wondering if on mobile libevent behaves differently, and exits under some conditions which does not happen on the server.One idea I have is to allow users of the
Server::InstanceImpl
to dictate how the event loop is run, and run it withEvent::Dispatcher::RunType::Block
@Reflejo @mattklein123 I investigated this yesterday afternoon, I would appreciate if you could give it a read, and let me know what you think. @Reflejo specially around the theory of iOS causing the libevent loop to exit prematurely.
The text was updated successfully, but these errors were encountered: