-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Spin-Wait can have a significant cost on power and performance if not done properly. The PAUSE
instruction on Skylake changed from previous generations such as Haswell and Broadwell, from reference "Intel® 64 and IA-32 Architectures Optimization Reference Manual" section 8.4.7, "The latency of PAUSE instruction in prior generation microarchitecture is about 10 cycles, whereas on Skylake microarchitecture it has been extended to as many as 140 cycles." "As the PAUSE latency has been increased significantly, workloads that are sensitive to PAUSE latency will suffer some performance loss." We are seeing issues with spinning on Skylake Server with TechEmpower Plaintext with excessive amount of time being spent in spin code. It’s beneficial to tune the spin count or spin cycles and we have a proposal below for fix:
For example, the original spin code form j_join::join
is shown below.
int spin_count = 4096 * g_num_processors;
for (int j = 0; j < spin_count; j++)
{
if (color != join_struct.lock_color)
{
break;
}
YieldProcessor(); // indicate to the processor that we are spinning
}
Assume YieldProcessor()
took 10 cycles when the above code was first written and tuned. We could use spin cycles instead of spin count like this:
int spin_count = 4096 * g_num_processors;
// Assume YieldProcessor() took 10 cycles
// when the above code was first written and tuned
ptrdiff_t endTime = get_cycle_count() + spin_count * 10;
while (get_cycle_count() < endTime)
{
if (color != join_struct.lock_color)
{
break;
}
YieldProcessor(); // indicate to the processor that we are spinning
}