__attribute__((interrupt)) Handlers Dangerously Out of Spec on x86-64 #41189
Description
Bugzilla Link | 41844 |
Version | trunk |
OS | All |
Attachments | YMM2.PNG, YMM2_XSAVE.PNG, YMM2.PNG |
Reporter | LLVM Bugzilla Contributor |
CC | @topperc,@francisvm,@RKSimon,@rotateright |
Extended Description
Sorry, it's a long one. Although this issue actually gives rise to Bug 26413.
When compiling an x86-64 interrupt handler using attribute((interrupt)), Clang/LLVM generates some potentially very dangerous assembly that could cause random failures or incorrect results when using x86-64 instructions like SSE/AVX.
In the x86-64 ISA, a special instruction called XSAVE is required to save the state of AVX/SSE registers, and SSE has a similar one called FXSAVE. For restoring the state, there is XRSTOR for AVX/SSE and FXRSTOR for SSE. This is the correct way to save and restore ISA extension register state--not moving the AVX/SSE registers onto the stack, which is what Clang/LLVM currently does (and has done since 3.9.0, according to assembly readouts on gcc.godbolt.org).
You can see the behavior by compiling the following code on gcc.godbolt.org, using any version of Clang since 3.9.0 with any of the -mavx, -mavx2, -msse, etc. flags:
// start of code
#include <stdint.h>
static void some_function(uint64_t * pointer)
{
pointer++;
}
attribute((interrupt)) void handler(uint64_t * some_frame)
{
some_function(some_frame);
}
// end of code
Here's where the major problem lies (actually there are a few):
- Significantly, it looks like certain CPUs trash YMM2/XMM2 when moved with an AVX instruction during an interrupt. You can see this in YMM2.PNG (attached, see Note 1 for corrective info about the data shown in the image), which showcases a divide-by-zero interrupt that uses vmovdqu to move YMM registers onto the stack.
This is the C code I used to trigger the interrupt: note that there is nothing inbetween the asm statements and the forced division error. The divide-by-zero interrupt handler just pushes the general registers on top of the interrupt frame, subtracts the stack pointer to account for the size of the AVX registers, and uses vmovdqu to move the AVX registers into the stack memory area:
// Start of code to trigger interrupt in YMM2.PNG
__m256i_u whaty = _mm256_set1_epi32(0x17);
asm volatile("vmovdqu %[what], %%ymm1" : : [what] "m" (whaty) :);
asm volatile("vmovdqu %[what], %%ymm2" : : [what] "m" (whaty) :); // Odd behavior with YMM2. The rest are fine.
asm volatile("vmovdqu %[what], %%ymm3" : : [what] "m" (whaty) :);
asm volatile("vmovdqu %[what], %%ymm4" : : [what] "m" (whaty) :);
asm volatile("vmovdqu %[what], %%ymm5" : : [what] "m" (whaty) :);
asm volatile("vmovdqu %[what], %%ymm6" : : [what] "m" (whaty) :);
asm volatile("vmovdqu %[what], %%ymm7" : : [what] "m" (whaty) :);
volatile uint64_t c = cs / (cs >> 10); // cs is just a value that will guarantee a divide by zero error
// End of code to trigger interrupt in YMM2.PNG
YMM2_XSAVE.PNG (also attached) shows correct behavior, and it is the same readout using XSAVE to store the AVX registers to a memory area not on the stack. The data is different because I was testing something else, so the C code I used to trigger the interrupt changed to this (basically it's just the values being stored that have changed):
// Start of code to trigger interrupt in YMM2_XSAVE.PNG
__m256i_u whaty = _mm256_set1_epi32(0x17181920);
__m256i_u what2 = _mm256_set1_epi64x(0x1718192011223344);
__m256i_u what3 = _mm256_set1_epi32(0x18);
__m256i_u what9 = _mm256_set1_epi32(0x180019);
asm volatile("vmovdqu %[what], %%ymm1" : : [what] "m" (whaty) :);
asm volatile("vmovdqu %[what], %%ymm2" : : [what] "m" (what2) :);
asm volatile("vmovdqu %[what], %%ymm3" : : [what] "m" (what3) :);
asm volatile("vmovdqu %[what], %%ymm4" : : [what] "m" (whaty) :);
asm volatile("vmovdqu %[what], %%ymm5" : : [what] "m" (whaty) :);
asm volatile("vmovdqu %[what], %%ymm6" : : [what] "m" (whaty) :);
asm volatile("vmovdqu %[what], %%ymm7" : : [what] "m" (whaty) :);
asm volatile("vmovdqu %[what], %%ymm15" : : [what] "m" (what9) :);
volatile __m256i output = _mm256_bsrli_epi128(what2, 1); // To verify the quadword order is correct
volatile uint64_t c = cs / (cs >> 10); // cs is just a value that will guarantee a divide by zero error
// End of code to trigger interrupt in YMM2_XSAVE.PNG
As you can see by the differences, the CPU (an i7-7700HQ) is doing something weird in YMM2 in the first case. I have checked through the code involved and saw nothing in the program's output assembly that modified YMM2 in a way that would cause this behavior--in fact YMM2 doesn't get touched at any point between the move-to-stack and the print. Even if it turns out to be a deep semantic bug in the program (not likely, see Note 2), the stack is not a safe place for the AVX registers to reside, particularly when XSAVE exists for this very purpose. It should be evident here that restoring the registers from such a malformed state could have catastrophic impacts in programs that are using AVX.
-
Clang/LLVM outputs "movaps/vmovaps" to move the registers onto the stack. This is problematic because there is no mechanism to differentiate between interrupts and exceptions, which are offset by 8 bytes. The alignment trick with "and $-32, %rsp" causes dead space on the stack. This breaks any attempt to read the stack registers using a struct, since the size can't be guaranteed. This is essentially the crux of Bug 26413.
-
This could be avoided if Clang/LLVM honored the -mgeneral-regs-only flag, but it doesn't. At least, not when I add it to the compile line (I'm using gcc.godbolt.org to check), where it seems to be ignored. That means this could really bite people hard.
Seeing how this has been around since 3.9.0 and I'm apparently the first one to report this, I suppose it's not "release blocking" in severity, but hopefully I'm not the only one disconcerted by this.
Quick notes about the images:
Note 1: I had the quadword order backwards in YMM2.PNG, so the "6" is actually in the most significant quadword, and what looks like an address is actually in the least significant quadword. The "address" is pointing to somewhere in EfiLoaderData, which is where the executing program resides (more reason to believe it's an address). The quadword order is corrected in YMM2_XSAVE.PNG.
Note 2: Not likely to be a semantic bug since it works fine with XSAVE, and it also happens in XMM2 with movdqu in exactly the same way if using only SSE with AVX enabled: "address" and random "6" in the least and most significant quadwords of XMM2 instead of YMM2. More reason to believe the CPU is doing something weird here.