Closed
Description
SimplifyCFG recently gained ede27d8 by @nikic. Now, it seems to backfire in some cases:
Compile attached bcmp.ll (generated from llvm-project/libc/src/string/bcmp.cpp
) like so:
$ clang -O3 -S bcmp.ll -o bcmp.s
Then I get:
Without the patch:
# %bb.26:
movdqu -16(%rdi,%rdx), %xmm0
movdqu -16(%rsi,%rdx), %xmm1
jmp .LBB1_34
:
:
:
# %bb.33:
movdqu (%rdi,%rdx), %xmm0
movdqu (%rsi,%rdx), %xmm1
.LBB1_34: # %.loopexit
With the patch:
# %bb.25:
addq %rdx, %rdi
addq $-16, %rdi
addq %rdx, %rsi
addq $-16, %rsi
jmp .LBB1_33
:
:
:
# %bb.32:
addq %rdx, %rdi
addq %rdx, %rsi
.LBB1_33: # %.loopexit.sink.split
movdqu (%rdi), %xmm0
movdqu (%rsi), %xmm1
Notice that the two load instructions sink just below the join point while the address calculation is left behind. This seems to result in worse performance and increased dynamic instruction counts.