[LLVM] APInt::tcAdd has quiet poor codegen.

Here is an example of 1 iteration of the loop
Compiled with clang trunk -O3
```asm
; start of iteration
      89: 4c 8b 14 cf                  	movq	(%rdi,%rcx,8), %r10 ; load dst[i]
      8d: 4c 8b 0c ce                  	movq	(%rsi,%rcx,8), %r9 ; load rhs[i]
      91: 48 85 d2                     	testq	%rdx, %rdx ; test if we have a carry
      94: 74 1a                        	je	0xb0 <_ZN4llvm5APInt5tcAddEPmPKmmj+0xb0>

      96: 4f 8d 4c 0a 01               	leaq	0x1(%r10,%r9), %r9 ; add with a carry
      9b: 4d 39 d1                     	cmpq	%r10, %r9
      9e: 41 0f 96 c2                  	setbe	%r10b ; put carry of next iteration in r10b
      a2: eb 13                        	jmp	0xb7 <_ZN4llvm5APInt5tcAddEPmPKmmj+0xb7>

      b0: 4d 01 d1                     	addq	%r10, %r9 ; add without carry
      b3: 41 0f 92 c2                  	setb	%r10b ; put carry of next iteration in r10b

      b7: 4c 89 0c cf                  	movq	%r9, (%rdi,%rcx,8) ; write result back to memory
; next iteration unrolled
      bb: 4c 8b 4c cf 08               	movq	0x8(%rdi,%rcx,8), %r9
      c0: 48 8b 54 ce 08               	movq	0x8(%rsi,%rcx,8), %rdx
      c5: 45 84 d2                     	testb	%r10b, %r10b
; ...
```
Just as a reference here is What I beleive optimal x86-64 code (at least without vectorization) looks like
```asm
; start of iteration
; carry is in CF, 
    12a1: 4a 8b 4c ce f0               	movq	-0x10(%rsi,%r9,8), %rcx ; load rhs[i]
    12a6: 4a 11 4c cf f0               	adcq	%rcx, -0x10(%rdi,%r9,8) ; lhs[i] = lhs[i] + rhs[i] + carry
; next iteration unrolled
    12ab: 4a 8b 4c ce e8               	movq	-0x18(%rsi,%r9,8), %rcx
    12b0: 4a 11 4c cf e8               	adcq	%rcx, -0x18(%rdi,%r9,8)
;...
```

Stuff that could be done better:
 - one side of the branch detects overflow with a cmp instead of CF
 - The branch doesn't get eliminated.
 - adc is not used.

I made a benchmark to get an idea of how big the impact is.
 APInt::tcAdd (with unpredictable branch) : 9834.9 ns for 4096 iterations in average
 APInt::tcAdd (with predictable branch) : 1728.4 ns for 4096 iterations in average
 optimized code (no branch)    : 1045.6 ns for 4096 iterations in average


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LLVM] APInt::tcAdd has quiet poor codegen. #135486

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[LLVM] APInt::tcAdd has quiet poor codegen. #135486

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions