Skip to content

Conversation

@jmert
Copy link
Contributor

@jmert jmert commented Jun 3, 2020

This PR has reduced in scope/changed goals. The original message is shown below, but the purpose now is to simply satisfy an old TODO in the definition of muladd for Complex arguments that mentions a "future" mulsub function. The mulsub function was never added (see #15985), and modern LLVM's optimizations are sufficient to take a sequence of muladd and negations and turn it into the corresponding fused multiply-subtract where appropriate.


PR #15980 defined muladd for Complex (and various combinations of mixed Complex-Real) arguments. It seems the same could (should?) be done for fma as well.

The original TODO comments concerning a future mulsub no longer apply since LLVM appears to be capable of reconstructing the fused multiply-subtract, as already mentioned recently in #35562 (comment):

julia> code_native(fma, Tuple{ComplexF64, ComplexF64, ComplexF64}, debuginfo=:none)
        .text
        vmovsd  8(%rsi), %xmm1          # xmm1 = mem[0],zero
        vmovsd  8(%rdx), %xmm3          # xmm3 = mem[0],zero
        vmovsd  (%rcx), %xmm4           # xmm4 = mem[0],zero
        vmovsd  (%rsi), %xmm0           # xmm0 = mem[0],zero
        vmovsd  (%rdx), %xmm2           # xmm2 = mem[0],zero
        movq    %rdi, %rax
        vfmsub231sd     %xmm3, %xmm1, %xmm4 # xmm4 = (xmm1 * xmm3) - xmm4
        vfmsub231sd     %xmm2, %xmm0, %xmm4 # xmm4 = (xmm0 * xmm2) - xmm4
        vfmadd213sd     8(%rcx), %xmm1, %xmm2 # xmm2 = (xmm1 * xmm2) + mem
        vmovsd  %xmm4, (%rdi)
        vfmadd231sd     %xmm3, %xmm0, %xmm2 # xmm2 = (xmm0 * xmm3) + xmm2
        vmovsd  %xmm2, 8(%rdi)
        retq
        nopl    (%rax,%rax)

For a motivating reason, I've been working on an algorithm where fma on real inputs is necessary for maintaining as much numerical accuracy as possible, but the mathematical function is technically defined over the complex domain as well. Without appropriate definitions of fma for Complex, the algorithm currently fails for complex arguments (or requires an extra type-based branch to fall back to muladd, which would give divergent answers when the complex input is located on the real axis).

@yuyichao
Copy link
Contributor

yuyichao commented Jun 3, 2020

Is fma well defined for complex number? Unlike muladd, fma should be well defined and shouldn't just be defined as using as many fusion as seems convinient.

  1. Is there a standard on this? If there is we should follow that.
  2. From the "definition". fma should have a single rounding to the final result (one for real and one for imaginary should be fine) but it doesn't seems like the proposed implementation has this property.

@simonbyrne
Copy link
Member

  • Is there a standard on this? If there is we should follow that.

I don't think so. The IEEE spec doesn't discuss complex numbers, and the C spec doesn't define an FMA for them.

  • From the "definition". fma should have a single rounding to the final result (one for real and one for imaginary should be fine) but it doesn't seems like the proposed implementation has this property.

Agreed: "fusing" is a necessary property of fma, and this doesn't do that.

@jmert
Copy link
Contributor Author

jmert commented Jun 3, 2020

OK, I was concerned that that might be the response, but I thought I'd give it a try.

muladd does effectively reduce to the fma operations I want on modern LLVM (when FMA is available as a architecture instruction), so I guess my generic way forward is to define a private fma/muladd function for real and complex arguments, respectively.

@simonbyrne
Copy link
Member

Doesn’t the complex muladd already do that?

We should change the TODOs now that the mulsub issue is resolved: did you want to update the PR to do that?

@jmert
Copy link
Contributor Author

jmert commented Jun 3, 2020

Doesn’t the complex muladd already do that?

Yes — what I meant was that for the real case, I want to force fma even if it's not the efficient version because the accuracy is required, but then I need an appropriate fallback path for Complex arguments. I can just multiple-dispatch my way to that with a private function.

We should change the TODOs now that the mulsub issue is resolved: did you want to update the PR to do that?

Sure, I can do that.

Modern LLVM has sufficient optimizations to translate muladds with
multiple negations into the appropriate fused-multiple-subtract
instructions on x86_64.
@jmert jmert changed the title Define fma() for Complex arguments Satisfy TODO in Complex muladd Jun 3, 2020
@jmert
Copy link
Contributor Author

jmert commented Jun 4, 2020

OK, I've replaced the original commits with a new one which simply completes the muladd chain with further negated muladds. On (at least my) x86_64 machine the complex-complex-complex case goes from a single FMA with additional packed vector instructions to 4 FMAs and a multiply. I don't have access to any arch64 processors, though I think it looks OK according to Godbolt (though I admittedly don't know much about arch processors).

@simonbyrne simonbyrne merged commit b5868b9 into JuliaLang:master Jun 4, 2020
@simonbyrne
Copy link
Member

Thanks!

@jmert jmert deleted the complex_fma branch June 4, 2020 22:52
simeonschaub pushed a commit to simeonschaub/julia that referenced this pull request Aug 11, 2020
…#36140)

Modern LLVM has sufficient optimizations to translate muladds with
multiple negations into the appropriate fused-multiple-subtract
instructions on x86_64.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants