Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "wait morphing" to improve condvar wake performance #24

Closed
nyh opened this issue Jul 18, 2013 · 2 comments
Closed

Add "wait morphing" to improve condvar wake performance #24

nyh opened this issue Jul 18, 2013 · 2 comments
Assignees

Comments

@nyh
Copy link
Contributor

nyh commented Jul 18, 2013

The existing codvar_wake_one()/all() code wakes up one/all of the threads sleeping on the mutex, and they all wake up and try to take the user mutex associated with the condvar.

This creates two performance problems:

  1. The "thundering herd" problem - When there are many threads waiting on the condvar and they are all woken with condvar_wake_all(), all these threads start to run (requiring a lot of work on scheduling, IPIs, etc.) but all of them except one will immediately go back to sleep waiting to take the user mutex. See a description of this problem and how it was solved in Linux in https://lwn.net/Articles/32746/.
  2. The "locked wakeup" problem - Often, condvar_wake is done with the user mutex locked (see a good discussion in http://www.domaigne.com/blog/computing/condvars-signal-with-mutex-locked-or-not/). Most of the BSD code we use does this, for example. When condvar_wake() wakes the thread, if it wakes up quickly enough, it can try to take the mutex before the waker released it, and need to go back to sleep. This problem is especially noticable when both threads run on a single CPU, and the scheduler decides to context-switch to the wakee when it is woken.

Both problems can be solved by a technique called wait morphing: when the condvar has an associated mutex, condvar_wake_*() does not wake up the threads, but rather move them from waiting on the condition variable, to waiting on the mutex. Each of these threads will later be woken (separately) when the mutex becomes available for it, and when it wakes up it doesn't need to take the mutex, because it was already taken for it.

It is not clear that this issue actually has a lot of performance impact on any of our use cases (once the scheduler bugs are eliminated), but it seems there's popular demand for solving it.

@ghost ghost assigned nyh Jul 18, 2013
@nyh
Copy link
Contributor Author

nyh commented Jul 18, 2013

Here are some tentative implementation ideas for the convar/mutex wait morphing:

  1. Probably we should no longer support the old spin-based mutex implementation. Doing the following changes to both it and the newer lockfree::mutex will be a waste of time.
  2. Condvar and mutex each uses a "wait record" structure that contain a link to the next record, and a thread pointer. In mutex it is linked_itemsched::thread*, in condvar it is struct ccondvar_waiter. Both of them already have the same content, but we now need them to be the same type, so convert them both to use linked_itemsched::thread* (and create an alias for it, "struct waitrrecord", pointers to which we need to use in the C-language condvar.h).
  3. Add to the waitrecord type a method wake(), which does wr->value->wake_with([&] { wr->value = nullptr; }). This is the idiom we always use to wake up a waiter (in both mutex and condvar) and it will be nice to encapsulate it in a method.
  4. Add the assumption that when a condvar is waited on concurrently, all uses must wait with the same mutex or the results will be undefined. pthread_cond_wait(3) also makes this assumption. This way, it will be enough to remember the last mutex associated with the condvar, and we won't need to remember the mutex associate with each waitrecord. This assumption can be relatively easily lifted, but I think it's not worth the extra effort.
  5. Add a new mutex operation, mutex->lockfor(waitrecord *wr). It will be similar to lock(), but rather than taking the mutex for the current thread it will take it for a different thread which is already waiting on the given waitrecord. lockfor(wr) will attempt to take the lock immediately, and if successful will wake up the given waitrecord (wr->wake() explained above), but if can't get the lock it will add the given wr (not some local wr like in lock()) to the mutex's queue.
  6. Now, condvar->wake_one/wake_all, for each wr it pops from the condvar's wait queue, instead of waking up the waitrecord as we do now, will first check if the condvar has an associate mutex. If it does, it will call mutex->lockfor(wr) and not wake up wr - it will be woken either now or later when it gets the mutex. Note how in a condvar_wake_all() the first popped wr might have a chance to be woken up immediately, but the rest will all just be queued in the mutex's wait queue and not cause any context switches.
  7. When the condvar_wait() is woken up from its wait-record (the wait_until() in condvar_wait) it now knows that it already has the user mutex! So it doesn't need to get it again.
  8. The case of timeout is somewhat special - a timeout that occurs while still waiting on the condition variable will work just as it does now. But the mutex grabbing cannot time out - a condvar_timed_wait will wait forever for the mutex (just like any mutex->lock()) does. I think this is fine, but we need to consider this.

@nyh
Copy link
Contributor Author

nyh commented Jul 30, 2013

Done, in commit aa3a624 and adjacent commits.

@nyh nyh closed this as completed Jul 30, 2013
penberg pushed a commit that referenced this issue Aug 26, 2013
Fix mincore() to deal with unmapped addresses like msync() does.

This fixes a SIGSEGV in libunwind's access_mem() when leak detector is
enabled:

   (gdb) bt
  #0  page_fault (ef=0xffffc0003ffe7008) at ../../core/mmu.cc:871
  #1  <signal handler called>
  #2  ContiguousSpace::block_start_const (this=<optimized out>, p=0x77d2f3968)
      at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/oops/oop.inline.hpp:411
  #3  0x00001000008ae16c in GenerationBlockStartClosure::do_space (this=0x2000001f9100, s=<optimized out>)
      at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/memory/generation.cpp:242
  #4  0x00001000007f097c in DefNewGeneration::space_iterate (this=0xffffc0003fb68c00, blk=0x2000001f9100, usedOnly=<optimized out>)
      at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/memory/defNewGeneration.cpp:480
  #5  0x00001000008aca0e in Generation::block_start (this=<optimized out>, p=<optimized out>)
      at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/memory/generation.cpp:251
  #6  0x0000100000b06d2f in os::print_location (st=st@entry=0x2000001f9560, x=32165017960, verbose=verbose@entry=false)
      at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/runtime/os.cpp:868
  #7  0x0000100000b11b5b in os::print_register_info (st=0x2000001f9560, context=0x2000001f9740)
      at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp:839
  #8  0x0000100000c6cde8 in VMError::report (this=0x2000001f9610, st=st@entry=0x2000001f9560)
      at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/utilities/vmError.cpp:551
  #9  0x0000100000c6da3b in VMError::report_and_die (this=this@entry=0x2000001f9610)
      at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/utilities/vmError.cpp:984
  #10 0x0000100000b1109f in JVM_handle_linux_signal (sig=11, info=0x2000001f9bb8, ucVoid=0x2000001f9740,
      abort_if_unrecognized=<optimized out>)
      at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp:528
  #11 0x000000000039f242 in call_signal_handler (frame=0x2000001f9b10) at ../../arch/x64/signal.cc:69
  #12 <signal handler called>
  #13 0x000000000057d721 in access_mem ()
  #14 0x000000000057cb1d in dwarf_get ()
  #15 0x000000000057ce51 in _ULx86_64_step ()
  #16 0x00000000004315fd in backtrace (buffer=0x1ff9d80 <memory::alloc_tracker::remember(void*, int)::bt>, size=20)
      at ../../libc/misc/backtrace.cc:16
  #17 0x00000000003b8d99 in memory::alloc_tracker::remember (this=0x1777ae0 <memory::tracker>, addr=0xffffc0004508de00, size=54)
      at ../../core/alloctracker.cc:59
  #18 0x00000000003b0504 in memory::tracker_remember (addr=0xffffc0004508de00, size=54) at ../../core/mempool.cc:43
  #19 0x00000000003b2152 in std_malloc (size=54) at ../../core/mempool.cc:723
  #20 0x00000000003b259c in malloc (size=54) at ../../core/mempool.cc:856
  #21 0x0000100001615e4c in JNU_GetStringPlatformChars (env=env@entry=0xffffc0003a4dc1d8, jstr=jstr@entry=0xffffc0004591b800,
      isCopy=isCopy@entry=0x0) at ../../../src/share/native/common/jni_util.c:801
  #22 0x000010000161ada6 in Java_java_io_UnixFileSystem_getBooleanAttributes0 (env=0xffffc0003a4dc1d8, this=<optimized out>,
      file=<optimized out>) at ../../../src/solaris/native/java/io/UnixFileSystem_md.c:111
  #23 0x000020000021ed8e in ?? ()
  #24 0x00002000001faa58 in ?? ()
  #25 0x00002000001faac0 in ?? ()
  #26 0x00002000001faa50 in ?? ()
  #27 0x0000000000000000 in ?? ()

Spotted by Avi Kivity.
penberg pushed a commit that referenced this issue Jan 13, 2014
See

  scripts/trace.py prof-wait -h

The command is using sched_wait and sched_wait_ret tracepoints to
calculate the amount of time a thread was waiting. Samples are
collected and presented in a form of call graph tree.

By default callees are closer to the root. To inverse the order pass
-r|--caller-oriented.

If there is too much output, it can be narrowed down using
--max-levels and --min-duration options.

The presented time spectrum can be narrowed down using --since and --until
options which accept timestamps.

Example:

  scripts/trace.py prof-wait --max-levels 3 trace-file

=== Thread 0xffffc0003eaeb010 ===

12.43 s (100.00%, #7696) All
 |-- 12.43 s (99.99%, #7658) sched::thread::do_wait_until
 |    |-- 10.47 s (84.22%, #6417) condvar::wait(lockfree::mutex*, unsigned long)
 |    |    condvar_wait
 |    |    |-- 6.47 s (52.08%, #6250) cv_timedwait
 |    |    |    txg_delay
 |    |    |    dsl_pool_tempreserve_space
 |    |    |    dsl_dir_tempreserve_space
 |    |    |    dmu_tx_try_assign
 |    |    |    dmu_tx_assign
 |    |    |
 |    |    |-- 2.37 s (19.06%, #24) arc_read_nolock
 |    |    |    arc_read
 |    |    |    dsl_read
 |    |    |    traverse_visitbp
 |    |    |
 |    |    |-- 911.75 ms (7.33%, #3) txg_wait_open
 |    |    |    dmu_tx_wait
 |    |    |    zfs_write
 |    |    |    vfs_file::write(uio*, int)
 |    |    |    sys_write
 |    |    |    pwritev
 |    |    |    writev
 |    |    |    __stdio_write
 |    |    |    __fwritex
 |    |    |    fwrite
 |    |    |    0x100000005a5f
 |    |    |    osv::run(std::string, int, char**, int*)

By default every thread has a separate tree, because duration is best
interpreted in the context of particular thread. There is however an
option to merge samples from all threads into one tree:
-m|--merge-threads. It may be useful if you want to inspect all paths
going in/out to/from particular function. The direction can be changed
with -r|--caller-oriented option. Function names is passed to --function
parameter.

Example: check where zfs_write() blocks:

  scripts/trace.py prof-wait -rm --function=zfs_write trace-file

7.46 s (100.00%, #7314) All
 zfs_write
 |-- 6.48 s (86.85%, #6371) dmu_tx_assign
 |    |-- 6.47 s (86.75%, #6273) dmu_tx_try_assign
 |    |    dsl_dir_tempreserve_space
 |    |    |-- 6.47 s (86.75%, #6248) dsl_pool_tempreserve_space
 |    |    |    txg_delay
 |    |    |    cv_timedwait
 |    |    |    condvar_wait
 |    |    |    condvar::wait(lockfree::mutex*, unsigned long)
 |    |    |    sched::thread::do_wait_until
 |    |    |
 |    |    |-- 87.87 us (0.00%, #24) mutex_lock
 |    |    |    sched::thread::do_wait_until
 |    |    |
 |    |    \-- 6.40 us (0.00%, #1) dsl_dir_tempreserve_impl
 |    |         mutex_lock
 |    |         sched::thread::do_wait_until
 |    |
 |    \-- 7.32 ms (0.10%, #98) mutex_lock
 |         sched::thread::do_wait_until
 |
 |-- 911.75 ms (12.22%, #3) dmu_tx_wait
 |    txg_wait_open
 |    condvar_wait
 |    condvar::wait(lockfree::mutex*, unsigned long)
 |    sched::thread::do_wait_until

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant