Skip to content

[X86] Fix overflow with large stack probes on x86-64 #113219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 22, 2025

Conversation

mconst
Copy link
Contributor

@mconst mconst commented Oct 21, 2024

When emitting an inline stack probe loop, we can't use SUBri to calculate the loop bound if it doesn't fit in a 32-bit (possibly sign-extended) immediate.

Fixes #113218.

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Oct 21, 2024

@llvm/pr-subscribers-backend-x86

Author: None (mconst)

Changes

When emitting an inline stack probe loop, we can't use SUBri to calculate the loop bound if it doesn't fit in a 32-bit (possibly sign-extended) immediate.

Fixes #113218.


Full diff: https://github.com/llvm/llvm-project/pull/113219.diff

2 Files Affected:

  • (modified) llvm/lib/Target/X86/X86FrameLowering.cpp (+35-10)
  • (added) llvm/test/CodeGen/X86/stack-clash-huge.ll (+84)
diff --git a/llvm/lib/Target/X86/X86FrameLowering.cpp b/llvm/lib/Target/X86/X86FrameLowering.cpp
index a35b04606e595d..a083f5919837f1 100644
--- a/llvm/lib/Target/X86/X86FrameLowering.cpp
+++ b/llvm/lib/Target/X86/X86FrameLowering.cpp
@@ -798,18 +798,43 @@ void X86FrameLowering::emitStackProbeInlineGenericLoop(
                               : Is64Bit         ? X86::R11D
                                                 : X86::EAX;
 
-  BuildMI(MBB, MBBI, DL, TII.get(TargetOpcode::COPY), FinalStackProbed)
-      .addReg(StackPtr)
-      .setMIFlag(MachineInstr::FrameSetup);
-
   // save loop bound
   {
-    const unsigned BoundOffset = alignDown(Offset, StackProbeSize);
-    const unsigned SUBOpc = getSUBriOpcode(Uses64BitFramePtr);
-    BuildMI(MBB, MBBI, DL, TII.get(SUBOpc), FinalStackProbed)
-        .addReg(FinalStackProbed)
-        .addImm(BoundOffset)
-        .setMIFlag(MachineInstr::FrameSetup);
+    const uint64_t BoundOffset = alignDown(Offset, StackProbeSize);
+
+    // Can we calculate the loop bound using SUB with a 32-bit immediate?
+    // Note that the immediate gets sign-extended when used with a 64-bit
+    // register, so in that case we only have 31 bits to work with.
+    bool canUseSub =
+        Uses64BitFramePtr ? isUInt<31>(BoundOffset) : isUInt<32>(BoundOffset);
+
+    if (canUseSub) {
+      const unsigned SUBOpc = getSUBriOpcode(Uses64BitFramePtr);
+
+      BuildMI(MBB, MBBI, DL, TII.get(TargetOpcode::COPY), FinalStackProbed)
+          .addReg(StackPtr)
+          .setMIFlag(MachineInstr::FrameSetup);
+      BuildMI(MBB, MBBI, DL, TII.get(SUBOpc), FinalStackProbed)
+          .addReg(FinalStackProbed)
+          .addImm(BoundOffset)
+          .setMIFlag(MachineInstr::FrameSetup);
+    } else if (Uses64BitFramePtr) {
+      BuildMI(MBB, MBBI, DL, TII.get(X86::MOV64ri), FinalStackProbed)
+          .addImm(-BoundOffset)
+          .setMIFlag(MachineInstr::FrameSetup);
+      BuildMI(MBB, MBBI, DL, TII.get(X86::ADD64rr), FinalStackProbed)
+          .addReg(FinalStackProbed)
+          .addReg(StackPtr)
+          .setMIFlag(MachineInstr::FrameSetup);
+    } else {
+      // We're being asked to probe a stack frame that's 4 GiB or larger,
+      // but our stack pointer is only 32 bits.
+      DiagnosticInfoResourceLimit Diag(MF.getFunction(),
+                                       "probed stack frame size", BoundOffset,
+                                       0xffffffff, DS_Error, DK_ResourceLimit);
+      MF.getFunction().getContext().diagnose(Diag);
+      return;
+    }
 
     // while in the loop, use loop-invariant reg for CFI,
     // instead of the stack pointer, which changes during the loop
diff --git a/llvm/test/CodeGen/X86/stack-clash-huge.ll b/llvm/test/CodeGen/X86/stack-clash-huge.ll
new file mode 100644
index 00000000000000..03f028dfc25067
--- /dev/null
+++ b/llvm/test/CodeGen/X86/stack-clash-huge.ll
@@ -0,0 +1,84 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --no_x86_scrub_sp
+; RUN: llc -mtriple=x86_64-linux-android < %s | FileCheck -check-prefix=CHECK-X64 %s
+; RUN: llc -mtriple=i686-linux-android < %s | FileCheck -check-prefix=CHECK-X86 %s
+; RUN: llc -mtriple=x86_64-linux-gnux32 < %s | FileCheck -check-prefix=CHECK-X32 %s
+
+define i32 @foo() local_unnamed_addr #0 {
+; CHECK-X64-LABEL: foo:
+; CHECK-X64:       # %bb.0:
+; CHECK-X64-NEXT:    movabsq $-2399997952, %r11 # imm = 0xFFFFFFFF70F2F000
+; CHECK-X64-NEXT:    addq %rsp, %r11
+; CHECK-X64-NEXT:    .cfi_def_cfa_register %r11
+; CHECK-X64-NEXT:    .cfi_adjust_cfa_offset 2399997952
+; CHECK-X64-NEXT:  .LBB0_1: # =>This Inner Loop Header: Depth=1
+; CHECK-X64-NEXT:    subq $4096, %rsp # imm = 0x1000
+; CHECK-X64-NEXT:    movq $0, (%rsp)
+; CHECK-X64-NEXT:    cmpq %r11, %rsp
+; CHECK-X64-NEXT:    jne .LBB0_1
+; CHECK-X64-NEXT:  # %bb.2:
+; CHECK-X64-NEXT:    subq $1928, %rsp # imm = 0x788
+; CHECK-X64-NEXT:    .cfi_def_cfa_register %rsp
+; CHECK-X64-NEXT:    .cfi_def_cfa_offset 2399999888
+; CHECK-X64-NEXT:    movl $1, 264(%rsp)
+; CHECK-X64-NEXT:    movl $1, 28664(%rsp)
+; CHECK-X64-NEXT:    movl -128(%rsp), %eax
+; CHECK-X64-NEXT:    movl $2399999880, %ecx # imm = 0x8F0D1788
+; CHECK-X64-NEXT:    addq %rcx, %rsp
+; CHECK-X64-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-X64-NEXT:    retq
+;
+; CHECK-X86-LABEL: foo:
+; CHECK-X86:       # %bb.0:
+; CHECK-X86-NEXT:    movl %esp, %eax
+; CHECK-X86-NEXT:    subl $2399997952, %eax # imm = 0x8F0D1000
+; CHECK-X86-NEXT:    .cfi_def_cfa_register %eax
+; CHECK-X86-NEXT:    .cfi_adjust_cfa_offset 2399997952
+; CHECK-X86-NEXT:  .LBB0_1: # =>This Inner Loop Header: Depth=1
+; CHECK-X86-NEXT:    subl $4096, %esp # imm = 0x1000
+; CHECK-X86-NEXT:    movl $0, (%esp)
+; CHECK-X86-NEXT:    cmpl %eax, %esp
+; CHECK-X86-NEXT:    jne .LBB0_1
+; CHECK-X86-NEXT:  # %bb.2:
+; CHECK-X86-NEXT:    subl $2060, %esp # imm = 0x80C
+; CHECK-X86-NEXT:    .cfi_def_cfa_register %esp
+; CHECK-X86-NEXT:    .cfi_def_cfa_offset 2400000016
+; CHECK-X86-NEXT:    movl $1, 392(%esp)
+; CHECK-X86-NEXT:    movl $1, 28792(%esp)
+; CHECK-X86-NEXT:    movl (%esp), %eax
+; CHECK-X86-NEXT:    movl $2400000012, %ecx # imm = 0x8F0D180C
+; CHECK-X86-NEXT:    addl %ecx, %esp
+; CHECK-X86-NEXT:    .cfi_def_cfa_offset 4
+; CHECK-X86-NEXT:    retl
+;
+; CHECK-X32-LABEL: foo:
+; CHECK-X32:       # %bb.0:
+; CHECK-X32-NEXT:    movl %esp, %r11d
+; CHECK-X32-NEXT:    subl $2399997952, %r11d # imm = 0x8F0D1000
+; CHECK-X32-NEXT:    .cfi_def_cfa_register %r11
+; CHECK-X32-NEXT:    .cfi_adjust_cfa_offset 2399997952
+; CHECK-X32-NEXT:  .LBB0_1: # =>This Inner Loop Header: Depth=1
+; CHECK-X32-NEXT:    subl $4096, %esp # imm = 0x1000
+; CHECK-X32-NEXT:    movq $0, (%esp)
+; CHECK-X32-NEXT:    cmpl %r11d, %esp
+; CHECK-X32-NEXT:    jne .LBB0_1
+; CHECK-X32-NEXT:  # %bb.2:
+; CHECK-X32-NEXT:    subl $1928, %esp # imm = 0x788
+; CHECK-X32-NEXT:    .cfi_def_cfa_register %rsp
+; CHECK-X32-NEXT:    .cfi_def_cfa_offset 2399999888
+; CHECK-X32-NEXT:    movl $1, 264(%esp)
+; CHECK-X32-NEXT:    movl $1, 28664(%esp)
+; CHECK-X32-NEXT:    movl -128(%esp), %eax
+; CHECK-X32-NEXT:    movl $2399999880, %ecx # imm = 0x8F0D1788
+; CHECK-X32-NEXT:    addq %rcx, %esp
+; CHECK-X32-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-X32-NEXT:    retq
+  %a = alloca i32, i64 600000000, align 16
+  %b0 = getelementptr inbounds i32, ptr %a, i64 98
+  %b1 = getelementptr inbounds i32, ptr %a, i64 7198
+  store volatile i32 1, ptr %b0
+  store volatile i32 1, ptr %b1
+  %c = load volatile i32, ptr %a
+  ret i32 %c
+}
+
+attributes #0 =  {"probe-stack"="inline-asm"}

.setMIFlag(MachineInstr::FrameSetup);
} else {
// We're being asked to probe a stack frame that's 4 GiB or larger,
// but our stack pointer is only 32 bits.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned about printing an error for valid code... maybe less of an issue here than in other contexts, but generally, stack overflows should be a runtime error.

Maybe we can just set the size of the allocation to 2^32-1, and just let the OS generate a stack overflow trap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at this!

Sure, trapping at runtime sounds fine, although I'd kind of like to make it an unconditional trap rather than allocating less than the user requested and relying on it to fail. (It's not likely, but you could imagine a situation where the user's code is elsewhere in the address space and the entire lower 4 GiB is available for the stack. The stack pointer would also have to be unaligned.)

Would you replace the error with a warning, or just skip it entirely and rely on the existing -Wframe-larger-than warning?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's realistic to worry about an 32-bit x86 application with an address-space larger than 4GB; I don't think it's actually possible to setup a configuration like that. But an explicit trap is probably okay too.

A warning has basically the same issues as an error: optimizations can cause it to trigger in dynamically unreachable code, and the user is left with a nonsense diagnostic that can only be fixed by blindly refactoring their code until the compiler stops complaining. So general policy is that we don't emit warnings from the backend for runtime issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I've updated the code.

@RKSimon RKSimon requested review from RKSimon and phoebewang October 28, 2024 12:27
@efriedma-quic
Copy link
Collaborator

Ping. (Current version looks okay to me, but I want a second opinion from an x86 backend expert.)

; CHECK-X32-NEXT: movl $1, 28664(%esp)
; CHECK-X32-NEXT: movl -128(%esp), %eax
; CHECK-X32-NEXT: movl $2399999880, %ecx # imm = 0x8F0D1788
; CHECK-X32-NEXT: addq %rcx, %esp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad instruction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at this!

I think this is an unrelated bug. LLVM generates the same invalid instruction even without my changes -- you can reproduce it by compiling the following code with -mx32 -fomit-frame-pointer:

void foo() {
    char x[0xa0000000];
}

https://godbolt.org/z/E41ecqPna

If you want, I can try to fix this, but maybe it might make more sense to do that as a separate PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it's not from the modified code. Feel free to fix it or file an issue otherwise.

Comment on lines +830 to +833
// We're being asked to probe a stack frame that's 4 GiB or larger,
// but our stack pointer is only 32 bits. This might be unreachable
// code, so don't complain now; just trap if it's reached at runtime.
BuildMI(MBB, MBBI, DL, TII.get(X86::TRAP));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a test case for it if don't complain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Contributor

@phoebewang phoebewang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

; CHECK-X32-NEXT: movl $1, 28664(%esp)
; CHECK-X32-NEXT: movl -128(%esp), %eax
; CHECK-X32-NEXT: movl $2399999880, %ecx # imm = 0x8F0D1788
; CHECK-X32-NEXT: addq %rcx, %esp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it's not from the modified code. Feel free to fix it or file an issue otherwise.

@efriedma-quic
Copy link
Collaborator

Ping

@phoebewang
Copy link
Contributor

@mconst do you have any update? Otherwise, I can help merge it.

@mconst
Copy link
Contributor Author

mconst commented Jan 22, 2025

Nope, I think this is ready to merge!

@phoebewang phoebewang merged commit a88f31d into llvm:main Jan 22, 2025
8 checks passed
Copy link

@mconst Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

@llvm-ci
Copy link
Collaborator

llvm-ci commented Jan 22, 2025

LLVM Buildbot has detected a new failure on builder llvm-clang-x86_64-expensive-checks-debian running on gribozavr4 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/16/builds/12423

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/X86/stack-clash-huge.ll' FAILED ********************
Exit Code: 2

Command Output (stderr):
--
RUN: at line 2: /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/llc -mtriple=x86_64-linux-android < /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/X86/stack-clash-huge.ll | /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/FileCheck -check-prefix=CHECK-X64 /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/X86/stack-clash-huge.ll
+ /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/llc -mtriple=x86_64-linux-android
+ /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/FileCheck -check-prefix=CHECK-X64 /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/X86/stack-clash-huge.ll
RUN: at line 3: /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/llc -mtriple=i686-linux-android < /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/X86/stack-clash-huge.ll | /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/FileCheck -check-prefix=CHECK-X86 /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/X86/stack-clash-huge.ll
+ /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/llc -mtriple=i686-linux-android
+ /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/FileCheck -check-prefix=CHECK-X86 /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/X86/stack-clash-huge.ll
RUN: at line 4: /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/llc -mtriple=x86_64-linux-gnux32 < /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/X86/stack-clash-huge.ll | /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/FileCheck -check-prefix=CHECK-X32 /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/X86/stack-clash-huge.ll
+ /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/FileCheck -check-prefix=CHECK-X32 /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/X86/stack-clash-huge.ll
+ /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/llc -mtriple=x86_64-linux-gnux32

# After Prologue/Epilogue Insertion & Frame Finalization
# Machine code for function foo: NoPHIs, TracksLiveness, NoVRegs, TiedOpsRewritten, TracksDebugUserValues
Frame Objects:
  fi#0: size=2400000000, align=16, at location [SP-2400000008]

bb.0 (%ir-block.0):
  successors: %bb.1(0x80000000); %bb.1(100.00%)

  $r11d = frame-setup COPY $esp
  $r11d = frame-setup SUB32ri $r11d(tied-def 0), 2399997952, implicit-def $eflags
  CFI_INSTRUCTION def_cfa_register $r11
  CFI_INSTRUCTION adjust_cfa_offset 2399997952

bb.1 (%ir-block.0):
; predecessors: %bb.1, %bb.0
  successors: %bb.1(0x40000000), %bb.2(0x40000000); %bb.1(50.00%), %bb.2(50.00%)
  liveins: $r11d
  $esp = frame-setup SUB32ri $esp(tied-def 0), 4096, implicit-def dead $eflags
  frame-setup MOV64mi32 $esp, 1, $noreg, 0, $noreg, 0
  frame-setup CMP32rr $esp, $r11d, implicit-def $eflags
  frame-setup JCC_1 %bb.1, 5, implicit $eflags

bb.2 (%ir-block.0):
; predecessors: %bb.1

  $esp = frame-setup SUB32ri $esp(tied-def 0), 1928, implicit-def dead $eflags
  CFI_INSTRUCTION def_cfa_register $rsp
  frame-setup CFI_INSTRUCTION def_cfa_offset 2399999888
  MOV32mi $esp, 1, $noreg, 264, $noreg, 1 :: (volatile store (s32) into %ir.b0, align 8)
  MOV32mi $esp, 1, $noreg, 28664, $noreg, 1 :: (volatile store (s32) into %ir.b1, align 8)
  renamable $eax = MOV32rm $esp, 1, $noreg, -128, $noreg :: (volatile load (s32) from %ir.a, align 16)
  $rcx = frame-destroy MOV32ri64 2399999880
  $esp = ADD64rr $esp(tied-def 0), $rcx, implicit-def dead $eflags
  frame-destroy CFI_INSTRUCTION def_cfa_offset 8
  RET 0, $eax
...

@mconst
Copy link
Contributor Author

mconst commented Jan 22, 2025

The buildbot failure looks like it's from the bad instruction you pointed out earlier (which actually gets detected automatically when expensive checks are enabled). I think the fix should be pretty simple -- I'll try writing up a PR for it soon.

@RKSimon
Copy link
Collaborator

RKSimon commented Jan 22, 2025

@mconst is the fix imminent? currently the expensive checks bots are failing due to this.

I'd also suggest you add -verify-machineinstrs to the RUN commands in the stack-clash test files to better expose this.

@mconst
Copy link
Contributor Author

mconst commented Jan 23, 2025

I should have the PR ready within a couple of hours at the latest, although it'll probably still take time for it to be reviewed and merged. If that's too long, feel free to revert this for now!

Sure, I'll add -verify-machineinstrs to the test files.

@mconst
Copy link
Contributor Author

mconst commented Jan 23, 2025

Okay, I've submitted the fix as #124041.

abhishek-kaushik22 pushed a commit that referenced this pull request Jan 23, 2025
`X86FrameLowering::emitSPUpdate()` assumes that 64-bit targets use a
64-bit stack pointer, but that's not true on x32.
When checking the stack pointer size, we need to look at
`Uses64BitFramePtr` rather than `Is64Bit`. This avoids generating
invalid instructions like `add esp, rcx`.

For impossibly-large stack frames (4 GiB or larger with a 32-bit stack
pointer), we were also generating invalid instructions like `mov eax,
5000000000`. The inline stack probe code already had a check for that
situation; I've moved the check into `emitSPUpdate()`, so any attempt to
allocate a 4 GiB stack frame with a 32-bit stack pointer will now trap
rather than adjusting ESP by the wrong amount. This also fixes the
"can't have 32-bit 16GB stack frame" assertion, which used to be
triggerable by user code but is now correct.

To help catch situations like this in the future, I've added
`-verify-machineinstrs` to the stack clash tests that generate large
stack frames.

This fixes the expensive-checks buildbot failure caused by #113219.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[X86] Large stack frames are miscompiled with -fstack-clash-protection
6 participants