Skip to content

[LLVM] APInt::tcAdd has quiet poor codegen. #135486

Open
@Ralender

Description

@Ralender

Here is an example of 1 iteration of the loop
Compiled with clang trunk -O3

; start of iteration
      89: 4c 8b 14 cf                  	movq	(%rdi,%rcx,8), %r10 ; load dst[i]
      8d: 4c 8b 0c ce                  	movq	(%rsi,%rcx,8), %r9 ; load rhs[i]
      91: 48 85 d2                     	testq	%rdx, %rdx ; test if we have a carry
      94: 74 1a                        	je	0xb0 <_ZN4llvm5APInt5tcAddEPmPKmmj+0xb0>

      96: 4f 8d 4c 0a 01               	leaq	0x1(%r10,%r9), %r9 ; add with a carry
      9b: 4d 39 d1                     	cmpq	%r10, %r9
      9e: 41 0f 96 c2                  	setbe	%r10b ; put carry of next iteration in r10b
      a2: eb 13                        	jmp	0xb7 <_ZN4llvm5APInt5tcAddEPmPKmmj+0xb7>

      b0: 4d 01 d1                     	addq	%r10, %r9 ; add without carry
      b3: 41 0f 92 c2                  	setb	%r10b ; put carry of next iteration in r10b

      b7: 4c 89 0c cf                  	movq	%r9, (%rdi,%rcx,8) ; write result back to memory
; next iteration unrolled
      bb: 4c 8b 4c cf 08               	movq	0x8(%rdi,%rcx,8), %r9
      c0: 48 8b 54 ce 08               	movq	0x8(%rsi,%rcx,8), %rdx
      c5: 45 84 d2                     	testb	%r10b, %r10b
; ...

Just as a reference here is What I beleive optimal x86-64 code (at least without vectorization) looks like

; start of iteration
; carry is in CF, 
    12a1: 4a 8b 4c ce f0               	movq	-0x10(%rsi,%r9,8), %rcx ; load rhs[i]
    12a6: 4a 11 4c cf f0               	adcq	%rcx, -0x10(%rdi,%r9,8) ; lhs[i] = lhs[i] + rhs[i] + carry
; next iteration unrolled
    12ab: 4a 8b 4c ce e8               	movq	-0x18(%rsi,%r9,8), %rcx
    12b0: 4a 11 4c cf e8               	adcq	%rcx, -0x18(%rdi,%r9,8)
;...

Stuff that could be done better:

  • one side of the branch detects overflow with a cmp instead of CF
  • The branch doesn't get eliminated.
  • adc is not used.

I made a benchmark to get an idea of how big the impact is.
APInt::tcAdd (with unpredictable branch) : 9834.9 ns for 4096 iterations in average
APInt::tcAdd (with predictable branch) : 1728.4 ns for 4096 iterations in average
optimized code (no branch) : 1045.6 ns for 4096 iterations in average

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions