crash: trying to use garbage/null pointer #831

junr03 · 2020-04-29T20:24:53Z

Stacktrace:

# (sometimes a garbage pointer)
EXC_BAD_ACCESS: Attempted to dereference null pointer. 

0  Lyft                    Envoy::Http::Dispatcher::post(std::__1::function<void ()>) (Lyft)
1  Lyft                    Envoy::Http::Dispatcher::startStream(long, envoy_http_callbacks) (Lyft)
2  Lyft                    start_stream (Lyft)
3  Lyft                    -[EnvoyHTTPStreamImpl initWithHandle:callbacks:] (Lyft)
4  Lyft                    -[EnvoyEngineImpl startStreamWithCallbacks:] (Lyft)

The only pointer accessed in Http::Dispatcher::post is the event_dispatcher_ pointer here:

https://github.com/lyft/envoy-mobile/blob/5c01e711e70c80c101087e61722862cbfbe961c9/library/common/http/dispatcher.cc#L282

The pointer is correctly initialized to nullptr, set by an initialization callback, and guarded by synchronization mechanisms. So I believe the crash is happening because the Event::Dispatcher the pointer points to is getting deallocated.

That Event::Dispatcher is uniquely owned by the Server::InstanceImpl. Which in turn (through some layers of inner classes ends up uniquely owned by main_common_ in the Envoy Mobile codebase. The only code path where that unique pointer is reset is when the event loop exits here:

https://github.com/lyft/envoy-mobile/blob/5c01e711e70c80c101087e61722862cbfbe961c9/library/common/engine.cc#L84

The only place where Envoy Mobile explicitly exits the event loop is in the Engine destructor:

https://github.com/lyft/envoy-mobile/blob/5c01e711e70c80c101087e61722862cbfbe961c9/library/common/engine.cc#L104

Currently envoy mobile has only a singleton engine that is statically instantiated. Therefore, Envoy Mobile should not be exiting the event loop unless the app is quitting, or has otherwise crashed causing static deallocation. However, from the stacktrace above it looks like:

a) The event loop has exited, causing main_common_ to be reset and the event_dispatcher_ pointer to become garbage
b) The exit is happening without the static Engine singleton being destroyed, as evidenced by the fact that the stack trace above is able to lock the Engine's weak pointer.

I have created a test case where I force event loop exit and subsequently use the engine, resulting in a repro where the event_dispatcher_ is garbage.

Open Question

My remaining unknown is: why is the event loop exiting without the Envoy Mobile code calling exit() on it. My only theory is that Envoy runs the event loop with Event::Dispatcher::RunType::Block

https://github.com/envoyproxy/envoy/blob/c2a63735145ef1ce056ca2a50927e3e6682177eb/source/server/server.cc#L592

Which is described as // Runs the event-loop until there are no pending events.All of the crashes show that the app had been sent to the background (and afaict the crash is only happening in iOS). So I am wondering somehow the event loop is exiting prematurely I would assume that pending timers for stats flushes, timeouts, should keep the event loop running given that there are pending events. I am not a libevent or iOS expert, but I am wondering if on mobile libevent behaves differently, and exits under some conditions which does not happen on the server.

One idea I have is to allow users of the Server::InstanceImpl to dictate how the event loop is run, and run it with Event::Dispatcher::RunType::Block

@Reflejo @mattklein123 I investigated this yesterday afternoon, I would appreciate if you could give it a read, and let me know what you think. @Reflejo specially around the theory of iOS causing the libevent loop to exit prematurely.

The text was updated successfully, but these errors were encountered:

mattklein123 · 2020-04-29T22:14:01Z

I suspect this is the same/similar issue to the one that @goaway was investigating for a while around the SslContextManager assert. I couldn't figure out any path that would lead to that assert either other than some strange busted cleanup.

One thing that I do wonder is whether iOS will terminate threads without actually letting them run to completion, but potentially let other threads run. That would explain some of these strange cases. Given that we only see this on iOS and not other platforms I suspect there is something strange like this happening.

mattklein123 · 2020-04-29T22:18:29Z

pending timers for stats flushes, timeouts, should keep the event loop running given that there are pending events.

Yeah this should stop the event loop from exiting. Per above I suspect there is something deeper and more strange going on here but no immediate theories.

junr03 · 2020-04-29T22:52:02Z

One thing that I do wonder is whether iOS will terminate threads without actually letting them run to completion, but potentially let other threads run.

I was concerned this might be the case given how aggressive iOS tends to be with resource management. I'll do some more research.

mattklein123 · 2020-04-30T18:20:14Z

@junr03 I had another thought here: does iOS send signals when it shuts things down? I wonder if Envoy is exiting strangely due to receiving SIGTERM or something like that. You might want to disable signal handling in Envoy if that is the case.

junr03 · 2020-05-04T20:53:12Z

You might want to disable signal handling in Envoy if that is the case

Just got back to this. Thanks for the suggestion @mattklein123. Just to confirm you mean using --define=signal_trace=disabled? If so, we already do this per:

https://github.com/lyft/envoy-mobile/blob/master/.bazelrc#L7

mattklein123 · 2020-05-04T21:00:57Z

No, disable/delete this code: https://github.com/envoyproxy/envoy/blob/6151a69f9c0dc4aa7938d987036ec00eedb818d5/source/server/server.cc#L528-L547

(Assuming Darwin sends signals)

junr03 · 2020-05-04T21:14:19Z

gotcha. I couldn't see how the above code would affect this, as it just catches, adds tracing, and re-raises. Your link makes sense to me, thanks!

junr03 · 2020-05-05T19:15:19Z

I had a nice discussion with @mattklein123 about this issue this morning. The more we talked about it the more inclined we were to think that the thread created by std::thread where the event loop is run has to be exiting due to catching a termination signal rather than the OS abruptly terminating the thread. The reasoning is that the crash above happens because we hit a garbage pointer that was owned by the thread in question. For that pointer to be garbage, the pointer must have been freed. In order for that pointer to be freed, the thread must have executed the instructions here:

https://github.com/lyft/envoy-mobile/blob/5c01e711e70c80c101087e61722862cbfbe961c9/library/common/engine.cc#L84

And for that to be the case, the event loop must have exited. My original commentary was positing that for some reason libevent was exiting outside of the explicit exit() call in the Envoy::Engine's destructor, and catching a termination signal as @mattklein123 suggested is a likely way for it to happen. If the thread had been "forcibly shutdown" then the memory would have leaked (and while undefined behavior could happen, it would be unlikely for us to see a crash due to a garbage pointer.

…nals (#835) Description: Disabling signal handling in the Server::Options makes it so that the server's event dispatcher does not listen for termination signals such as SIGTERM, SIGINT, etc. Previous crashes in iOS were experienced due to out-of-band event loop exit as described in #831. Ignoring termination signals makes it more likely that the event loop will only exit due to Engine destruction. This PR introduces Envoy::MobileMainCommon, as this is the canonical way to customize how main runs, and options setup per envoyproxy/envoy#3424. The new Envoy::MobileMainCommon also does away with other logic in Envoy::MainCommon that does not apply to Envoy Mobile. Risk Level: med - low-level change in termination handling Testing: unit test to assert option is correctly set. End to end test with iOS and Android devices to ensure clean exit when the app using envoy mobile exits (and thus destructs the engine). Moreover, there is no event loop exit any longer when the simulator app receives a SIGTERM, i.e., the event dispatcher is no longer listening to SIGTERM and exiting due to them. Potentially fixes #831. Will need to verify with wider client release. Signed-off-by: Jose Nino jnino@lyft.com

junr03 · 2020-06-02T19:36:13Z

Unfortunately, still occurring.

…#1021) Description: this PR attempts to fix two related crashes. By hooking up to iOS's notification system we attempt to terminate the engine earlier, and allow all objects necessary to be available fixing #1035. Potentially fixes #831. Risk Level: med - changes termination sequence Testing: added test Signed-off-by: Jose Nino <jnino@lyft.com>

junr03 added no stalebot core crash labels Apr 29, 2020

junr03 self-assigned this Apr 29, 2020

junr03 mentioned this issue May 4, 2020

main: use custom main common to disable listening for termination signals #835

Merged

junr03 closed this as completed in #835 May 5, 2020

junr03 reopened this Jun 2, 2020

rebello95 mentioned this issue Aug 6, 2020

ios: tear down earlier based on application termination #1007

Open

junr03 mentioned this issue Aug 14, 2020

shutdown: create avenues for shutdown out of static de-initialization #1021

Merged

junr03 closed this as completed in #1021 Sep 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crash: trying to use garbage/null pointer #831

crash: trying to use garbage/null pointer #831

junr03 commented Apr 29, 2020 •

edited

Loading

mattklein123 commented Apr 29, 2020

mattklein123 commented Apr 29, 2020

junr03 commented Apr 29, 2020

mattklein123 commented Apr 30, 2020

junr03 commented May 4, 2020

mattklein123 commented May 4, 2020

junr03 commented May 4, 2020

junr03 commented May 5, 2020

junr03 commented Jun 2, 2020

crash: trying to use garbage/null pointer #831

crash: trying to use garbage/null pointer #831

Comments

junr03 commented Apr 29, 2020 • edited Loading

Open Question

mattklein123 commented Apr 29, 2020

mattklein123 commented Apr 29, 2020

junr03 commented Apr 29, 2020

mattklein123 commented Apr 30, 2020

junr03 commented May 4, 2020

mattklein123 commented May 4, 2020

junr03 commented May 4, 2020

junr03 commented May 5, 2020

junr03 commented Jun 2, 2020

junr03 commented Apr 29, 2020 •

edited

Loading