Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash: trying to use garbage/null pointer #831

Closed
junr03 opened this issue Apr 29, 2020 · 9 comments · Fixed by #835 or #1021
Closed

crash: trying to use garbage/null pointer #831

junr03 opened this issue Apr 29, 2020 · 9 comments · Fixed by #835 or #1021

Comments

@junr03
Copy link
Member

junr03 commented Apr 29, 2020

Stacktrace:

# (sometimes a garbage pointer)
EXC_BAD_ACCESS: Attempted to dereference null pointer. 

0  Lyft                    Envoy::Http::Dispatcher::post(std::__1::function<void ()>) (Lyft)
1  Lyft                    Envoy::Http::Dispatcher::startStream(long, envoy_http_callbacks) (Lyft)
2  Lyft                    start_stream (Lyft)
3  Lyft                    -[EnvoyHTTPStreamImpl initWithHandle:callbacks:] (Lyft)
4  Lyft                    -[EnvoyEngineImpl startStreamWithCallbacks:] (Lyft)

The only pointer accessed in Http::Dispatcher::post is the event_dispatcher_ pointer here:

https://github.com/lyft/envoy-mobile/blob/5c01e711e70c80c101087e61722862cbfbe961c9/library/common/http/dispatcher.cc#L282

The pointer is correctly initialized to nullptr, set by an initialization callback, and guarded by synchronization mechanisms. So I believe the crash is happening because the Event::Dispatcher the pointer points to is getting deallocated.

That Event::Dispatcher is uniquely owned by the Server::InstanceImpl. Which in turn (through some layers of inner classes ends up uniquely owned by main_common_ in the Envoy Mobile codebase. The only code path where that unique pointer is reset is when the event loop exits here:

https://github.com/lyft/envoy-mobile/blob/5c01e711e70c80c101087e61722862cbfbe961c9/library/common/engine.cc#L84

The only place where Envoy Mobile explicitly exits the event loop is in the Engine destructor:

https://github.com/lyft/envoy-mobile/blob/5c01e711e70c80c101087e61722862cbfbe961c9/library/common/engine.cc#L104

Currently envoy mobile has only a singleton engine that is statically instantiated. Therefore, Envoy Mobile should not be exiting the event loop unless the app is quitting, or has otherwise crashed causing static deallocation. However, from the stacktrace above it looks like:

a) The event loop has exited, causing main_common_ to be reset and the event_dispatcher_ pointer to become garbage
b) The exit is happening without the static Engine singleton being destroyed, as evidenced by the fact that the stack trace above is able to lock the Engine's weak pointer.

I have created a test case where I force event loop exit and subsequently use the engine, resulting in a repro where the event_dispatcher_ is garbage.

Open Question

My remaining unknown is: why is the event loop exiting without the Envoy Mobile code calling exit() on it. My only theory is that Envoy runs the event loop with Event::Dispatcher::RunType::Block

https://github.com/envoyproxy/envoy/blob/c2a63735145ef1ce056ca2a50927e3e6682177eb/source/server/server.cc#L592

Which is described as // Runs the event-loop until there are no pending events.All of the crashes show that the app had been sent to the background (and afaict the crash is only happening in iOS). So I am wondering somehow the event loop is exiting prematurely I would assume that pending timers for stats flushes, timeouts, should keep the event loop running given that there are pending events. I am not a libevent or iOS expert, but I am wondering if on mobile libevent behaves differently, and exits under some conditions which does not happen on the server.

One idea I have is to allow users of the Server::InstanceImpl to dictate how the event loop is run, and run it with Event::Dispatcher::RunType::Block

@Reflejo @mattklein123 I investigated this yesterday afternoon, I would appreciate if you could give it a read, and let me know what you think. @Reflejo specially around the theory of iOS causing the libevent loop to exit prematurely.

@mattklein123
Copy link
Member

I suspect this is the same/similar issue to the one that @goaway was investigating for a while around the SslContextManager assert. I couldn't figure out any path that would lead to that assert either other than some strange busted cleanup.

One thing that I do wonder is whether iOS will terminate threads without actually letting them run to completion, but potentially let other threads run. That would explain some of these strange cases. Given that we only see this on iOS and not other platforms I suspect there is something strange like this happening.

@mattklein123
Copy link
Member

pending timers for stats flushes, timeouts, should keep the event loop running given that there are pending events.

Yeah this should stop the event loop from exiting. Per above I suspect there is something deeper and more strange going on here but no immediate theories.

@junr03
Copy link
Member Author

junr03 commented Apr 29, 2020

One thing that I do wonder is whether iOS will terminate threads without actually letting them run to completion, but potentially let other threads run.

I was concerned this might be the case given how aggressive iOS tends to be with resource management. I'll do some more research.

@mattklein123
Copy link
Member

@junr03 I had another thought here: does iOS send signals when it shuts things down? I wonder if Envoy is exiting strangely due to receiving SIGTERM or something like that. You might want to disable signal handling in Envoy if that is the case.

@junr03
Copy link
Member Author

junr03 commented May 4, 2020

You might want to disable signal handling in Envoy if that is the case

Just got back to this. Thanks for the suggestion @mattklein123. Just to confirm you mean using --define=signal_trace=disabled? If so, we already do this per:

https://github.com/lyft/envoy-mobile/blob/master/.bazelrc#L7

@mattklein123
Copy link
Member

No, disable/delete this code: https://github.com/envoyproxy/envoy/blob/6151a69f9c0dc4aa7938d987036ec00eedb818d5/source/server/server.cc#L528-L547

(Assuming Darwin sends signals)

@junr03
Copy link
Member Author

junr03 commented May 4, 2020

gotcha. I couldn't see how the above code would affect this, as it just catches, adds tracing, and re-raises. Your link makes sense to me, thanks!

@junr03
Copy link
Member Author

junr03 commented May 5, 2020

I had a nice discussion with @mattklein123 about this issue this morning. The more we talked about it the more inclined we were to think that the thread created by std::thread where the event loop is run has to be exiting due to catching a termination signal rather than the OS abruptly terminating the thread. The reasoning is that the crash above happens because we hit a garbage pointer that was owned by the thread in question. For that pointer to be garbage, the pointer must have been freed. In order for that pointer to be freed, the thread must have executed the instructions here:

https://github.com/lyft/envoy-mobile/blob/5c01e711e70c80c101087e61722862cbfbe961c9/library/common/engine.cc#L84

And for that to be the case, the event loop must have exited. My original commentary was positing that for some reason libevent was exiting outside of the explicit exit() call in the Envoy::Engine's destructor, and catching a termination signal as @mattklein123 suggested is a likely way for it to happen. If the thread had been "forcibly shutdown" then the memory would have leaked (and while undefined behavior could happen, it would be unlikely for us to see a crash due to a garbage pointer.

junr03 added a commit that referenced this issue May 5, 2020
…nals (#835)

Description: Disabling signal handling in the Server::Options makes it so that the server's event dispatcher does not listen for termination signals such as SIGTERM, SIGINT, etc. Previous crashes in iOS were experienced due to out-of-band event loop exit as described in #831. Ignoring termination signals makes it more likely that the event loop will only exit due to Engine destruction.

This PR introduces Envoy::MobileMainCommon, as this is the canonical way to customize how main runs, and options setup per envoyproxy/envoy#3424. The new Envoy::MobileMainCommon also does away with other logic in Envoy::MainCommon that does not apply to Envoy Mobile.

Risk Level: med - low-level change in termination handling
Testing: unit test to assert option is correctly set. End to end test with iOS and Android devices to ensure clean exit when the app using envoy mobile exits (and thus destructs the engine). Moreover, there is no event loop exit any longer when the simulator app receives a SIGTERM, i.e., the event dispatcher is no longer listening to SIGTERM and exiting due to them.

Potentially fixes #831. Will need to verify with wider client release.

Signed-off-by: Jose Nino jnino@lyft.com
@junr03
Copy link
Member Author

junr03 commented Jun 2, 2020

Unfortunately, still occurring.

@junr03 junr03 reopened this Jun 2, 2020
junr03 added a commit that referenced this issue Sep 4, 2020
…#1021)

Description: this PR attempts to fix two related crashes. By hooking up to iOS's notification system we attempt to terminate the engine earlier, and allow all objects necessary to be available fixing #1035. Potentially fixes #831.
Risk Level: med - changes termination sequence
Testing: added test

Signed-off-by: Jose Nino <jnino@lyft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment