Skip to content

Linux: avoid zombie process when dispatch_main is called #99

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 6, 2016

Conversation

dgrove-oss
Copy link
Contributor

If the main thread calls pthread_exit on Linux, the process
becomes a zombie. Modify dispatch_main to avoid this by manually
performing thread cleanup and pausing the thread instead of calling
pthread_exit. This is an updated version of an old patch for this
issue (https://lists.macosforge.org/pipermail/libdispatch-dev/2011-May/000522.html).

@MadCoder
Copy link
Contributor

MadCoder commented Jul 6, 2016

I'm not fond of that patch (and Linux still hasn't fixed that bug, really?)

  1. other TSD keys aren't cleaned up which can be a significant memory issue
  2. we already have a wasted thread in a sigsuspend() so that signals are delivered when dispatch_main() is called, maybe it's what this thread should do instead (call _dispatch_sig_thread() directly)

I looked at the glibc implementation and we do have a way out, which is that in dispatch_main() to write code that does this (be prepared to be ultimately grossed out):

void
dispatch_main(void)
{
#if HAVE_PTHREAD_MAIN_NP
    if (pthread_main_np()) {
#endif
        _dispatch_object_debug(&_dispatch_main_q, "%s", __func__);
        _dispatch_program_is_probably_callback_driven = true;
        _dispatch_ktrace0(ARIADNE_ENTER_DISPATCH_MAIN_CODE);
#ifdef __linux__
        pthread_key_t dispatch_main_key;
        pthread_key_create(&dispatch_main_key, _dispatch_sig_thread);
        pthread_setspecific(dispatch_main_key, &dispatch_main_key); // anything non NULL really
#endif
        pthread_exit(NULL);
        DISPATCH_INTERNAL_CRASH(errno, "pthread_exit() returned");
#if HAVE_PTHREAD_MAIN_NP
    }
    DISPATCH_CLIENT_CRASH(0, "dispatch_main() must be called on the main thread");
#endif
}

also in _dispatch_queue_cleanup2(), disable this code for linux:

#ifndef __linux__
    if (_dispatch_program_is_probably_callback_driven) {
        _dispatch_barrier_async_detached_f(_dispatch_get_root_queue(
                _DISPATCH_QOS_CLASS_DEFAULT, true), NULL, _dispatch_sig_thread);
        sleep(1); // workaround 6778970
    }
#endif

since we now have main() becoming the sigsuspend thread.

my code works because glibc calls the TSD destructors in order of creation, so since we create that one last, it will be the last destructor called. It is highly non portable, but we're trying to work around a linux bug to boot. That way all TSD and thread_atexit() (C++ thread_locals) will be called properly.

Again it is a HUGE kludge, but I think it's more robust (to any code) than your patch, albeit brittle to any glibc change in how TSD and __cxa_thread_atexit() work.

@MadCoder
Copy link
Contributor

MadCoder commented Jul 6, 2016

it may also be worth having a unit test on linux that does check that TSD destructors indeed are called in pthread_key_create() call order, because the day that changes, this hack no longer works. But that is the only way I can think of looking at pthread_create

@MadCoder
Copy link
Contributor

MadCoder commented Jul 6, 2016

FWIW a very simple sample program shows that the backtraces are exactly what you'd expect with that trick, it seems to work fine:

(lldb) thread list
Process 3657 stopped
* thread #1: tid = 3657, 0x00007fb3e8a823e2 libc.so.6`sigsuspend + 82, name = 'a', stop reason = trace
  thread #2: tid = 3658, 0x00007fb3e8e0724d libpthread.so.0, name = 'a'
(lldb) bt all
* thread #1: tid = 3657, 0x00007fb3e8a823e2 libc.so.6`sigsuspend + 82, name = 'a', stop reason = trace
  * frame #0: 0x00007fb3e8a823e2 libc.so.6`sigsuspend + 82
    frame #1: 0x00007fb3e8dffe62 libpthread.so.0`__nptl_deallocate_tsd + 146
    frame #2: 0x00007fb3e8a6eb69 libc.so.6`__libc_start_main + 281
    frame #3: 0x00000000004007c9 a

  thread #2: tid = 3658, 0x00007fb3e8e0724d libpthread.so.0, name = 'a'
    frame #0: 0x00007fb3e8e0724d libpthread.so.0
    frame #1: 0x00007fb3e8e000a4 libpthread.so.0`start_thread + 196
    frame #2: 0x00007fb3e8b3587d libc.so.6`clone + 109

@dgrove-oss dgrove-oss force-pushed the dispatch-main-zombie branch from 3208542 to aa824ab Compare July 6, 2016 17:22
@dgrove-oss
Copy link
Contributor Author

I wasn't fond of my patch either; yours is better. Still potentially fragile as you noted, but in a way that is much less likely to break in practice. Thanks!

I think at this point Linux considers it to be a feature, not a bug ;) At least, it still happens on my ubuntu 16.04 virtual machine -- any program that calls dispatch_main becomes a zombie until it exits.

Pull request updated.

@dgrove-oss
Copy link
Contributor Author

sorry for the spam; spoke too soon. I forgot the second piece of the patch. Fixing...

@dgrove-oss dgrove-oss force-pushed the dispatch-main-zombie branch from aa824ab to febea9e Compare July 6, 2016 17:27
#ifdef __linux__
pthread_key_t dispatch_main_key;
pthread_key_create(&dispatch_main_key, _dispatch_sig_thread);
pthread_setspecific(dispatch_main_key, &dispatch_main_key); // anything non NULL really
Copy link
Contributor

@MadCoder MadCoder Jul 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the // anything non null really is a bad comment.
you should explain above that the point of creating that TSD key before pthread_exit() is because glibc will call the destructor of that TSD key last, which gets the thread parked in sigsuspend hopefully after all other TSD destructors have run (there are still holes as a TSD constructor could re-use a TSD and glibc has to loop over TSD keys to find out about this, and I'm pretty sure my hack prevents this, but that's as good it can be).

I'll let you phrase that comment as a native speaker ;) but given how non trivial and horrid that hack is, it needs documentation. thanks

If the main thread calls pthread_exit on Linux, the process
becomes a zombie. Modify dispatch_main to avoid this by adding
a pthread_key destructor that calls dispatch_sig_thread
(effectively stalling pthread exit until the program really exits).

This patch relies on the TSD destructors being called in order of
creation, which is currently the case in glibc.
@dgrove-oss dgrove-oss force-pushed the dispatch-main-zombie branch from febea9e to cf9fe36 Compare July 6, 2016 17:46
@dgrove-oss
Copy link
Contributor Author

updated with comments.

@MadCoder
Copy link
Contributor

MadCoder commented Jul 6, 2016

excellent 👍

@MadCoder MadCoder merged commit 2dbf83c into swiftlang:master Jul 6, 2016
@dgrove-oss dgrove-oss deleted the dispatch-main-zombie branch July 6, 2016 18:14
das pushed a commit that referenced this pull request Aug 11, 2016
Linux: avoid zombie process when dispatch_main is called

Signed-off-by: Daniel A. Steffen <dsteffen@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants