Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

^(::Float, ::Integer) #42031

Merged
merged 12 commits into from
Oct 4, 2021
Merged

^(::Float, ::Integer) #42031

merged 12 commits into from
Oct 4, 2021

Conversation

oscardssmith
Copy link
Member

@oscardssmith oscardssmith commented Aug 27, 2021

uses compensated power by squaring for Float64. For Float32, this just using power by squaring with Float64. .5 ULP accuracy for both types.

x32 = randn(Float32, 1024); y32 = similar(x32);
x64 = 10 .* randn(Float64, 1024); y64 = similar(x64);

Old timings:

@btime @. $y32 = $x32^101;
  52.840 μs (0 allocations: 0 bytes)
@btime @. $y64 = $x64^101);
  66.471 μs (0 allocations: 0 bytes)

New timings:

@btime @. $y32 = $x32^101;
4.619 μs (0 allocations: 0 bytes)

@btime @. $y64 = $x64^101;
  8.520 μs (0 allocations: 0 bytes)

Furthermore, this gets .5 ULP accuracy on all inputs I've tested so far, and is even faster than what I've shown for smaller powers.

uses compensated power by squaring for Float64. For Float32, this just using power by squaring with Float64. .5 ULP accuracy for both types.
@oscardssmith oscardssmith added performance Must go faster maths Mathematical functions labels Aug 27, 2021
@palday
Copy link
Contributor

palday commented Aug 27, 2021

That's a dramatic increase in memory use (several orders of magnitude) for only halving the computational time. Also, the accuracy should be proven by numerical analysis for all input and with testing only showing that there are no known cases of a particular platform violating the assumptions of that analysis.

@oscardssmith
Copy link
Member Author

Good point. The allocations were because I was incorrectly refering to a method which didn't exist (old name for same method) which was creating a type instability. Fixing that makes the result a lot faster too. Benchmarks updated.

I still need to do the analysis on this, but assuming I haven't screwed up the implementation, it should be guaranteed to be very accurate. The Float32 version performs a maximum of 60 Float64 multiplications and additions (without any that could have catastrophic cancellation, and the Float64 method performs a power by squaring using compensated summation that should ensure very small error.

@palday
Copy link
Contributor

palday commented Aug 27, 2021

The new benchmarks show no allocations which I can't replicate locally (both 1.6.2 and at master 6814f2b3dc) and find a little bit suspicious.

@oscardssmith
Copy link
Member Author

What are you seeing?

@palday
Copy link
Contributor

palday commented Aug 27, 2021

julia> @benchmark @. y64 = $x64^101
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  66.955 μs  132.993 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     67.261 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   68.300 μs ±   4.078 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃ ▁▇▃                                                       ▁
  ██▅████▆▅▇▇▇▆▄▆▃▅▃▃▁▁▁▁▃▁▁▃▄▃▅▃▃▅▃▃▆▅▅▅▄▅▆▆▆▆▆▆▄▅▄▅▆▅▄▄▄▁▄▅▅ █
  67 μs         Histogram: log(frequency) by time      90.3 μs <

 Memory estimate: 48 bytes, allocs estimate: 3.

julia> versioninfo()
Julia Version 1.8.0-DEV.432
Commit 6814f2b3dc* (2021-08-27 14:01 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Xeon(R) E-2288G CPU @ 3.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

@mcabbott
Copy link
Contributor

The benchmark above has $x but no $y, which gives me 3 allocations: 48 bytes. With interpolation:

julia> @btime @. $y32 = $x32 ^ 101;  # as above, but with $y
  5.201 μs (0 allocations: 0 bytes)  # nightly
  4.786 μs (0 allocations: 0 bytes)  # this PR

julia> @btime @. $y64 = $x64 ^ 101;
  8.347 μs (0 allocations: 0 bytes)
  2.250 μs (0 allocations: 0 bytes)

julia> @btime @. $y32 = $x32 ^ 7;  # lower power
  5.188 μs (0 allocations: 0 bytes)
  2.477 μs (0 allocations: 0 bytes)

julia> @btime @. $y64 = $x64 ^ 7;
  8.403 μs (0 allocations: 0 bytes)
  1.046 μs (0 allocations: 0 bytes)

julia> @btime @. $y32 = $x32 ^ 3;  # for comparison
  67.612 ns (0 allocations: 0 bytes)

julia> @btime @. $y64 = $x64 ^ 3;
  126.949 ns (0 allocations: 0 bytes)

This is on an M1 mac.

base/math.jl Outdated Show resolved Hide resolved
Co-authored-by: Kristoffer Carlsson <kcarlsson89@gmail.com>
@oscardssmith
Copy link
Member Author

@palday I don't have a final answer for error analysis yet, but here's what I have so far.
The worst positive exponent is 2^62-1 (because for all bigger exponents, the answer is Inf or 0).

In this case, the algorithm simplifies to.

function mypow2p62m1(x)
    y = 1.0
    xnlo = ynlo = 0.0
    for _ in 1:63
        yn = x*y
        ynlo = fma(x, y , -yn) + muladd(y, xnlo, x*ynlo) # adds error of xnlo*ynlo + eps(x*ynlo) + eps(y*xnlo)
        y = yn
        xn = x * x
        xnlo = muladd(x, 2*xnlo, fma(x, x, -xn))  # adds error of xnlo^2 + eps(x*xnlo)
        x = xn
    end
    return muladd(x, y, muladd(y, xnlo, x*ynlo)) # adds error of xnlo*ynlo + eps(x*ynlo) + eps(y*xnlo)
end

Theoretically, we could get an upper bound on this error, by adding up all the terms (at their maximum), but the intuitive answer is that since every time, the loop introduce very roughly 5eps(eps(xn)) of error, I think the maximal relative error (before the final operation when we return 1 result which has an unavoidable .5 ULP) of roughly c*eps(eps(x)) where c is a constant that should be <1000.

@KristofferC
Copy link
Member

Just some references to some previous accuracy discussions w.r.t this function:

#19872, #19890, #2741, #8939

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Aug 30, 2021

FMA is in basically all platforms, CPUs and Nvidia GPUs, since at least 2011 (and in IEEE 754-2008 revision). Can't we just swallow the slowdown in older CPUs, make a policy change to use FMA (not just for this) given the huge upside on newer (LTS will for a year+ work for older platforms without slowdown)? Does LLVM generate the slower code, instead of the instruction, with same accuracy for older platforms?

Could we make @fast_for_old_processors macro to not perform "poorly" on old platforms, if people are worried (I'm not)?

@palday
Copy link
Contributor

palday commented Aug 30, 2021

@PallHaraldsson It's not uncommon for big academic compute infrastructure to have processors that are 10 years out of date, but to have a lot of them in a high memory set up. That way you can benefit from massive parallelism and they're not spending millions every couple of years to upgrade.

@oscardssmith oscardssmith added the triage This should be discussed on a triage call label Sep 10, 2021
@oscardssmith
Copy link
Member Author

Adding a triage label to decide whether it's acceptable for to have Base that require fma hardware for good performance.

@SamuelBadr
Copy link
Contributor

The new implementation needs n == 0 && return one(x).

@oscardssmith
Copy link
Member Author

This actually already returns 1.0 for n==0. That said, from reading the code, I'm having trouble figuring out why it works.

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Sep 13, 2021

returns 1.0

as it should because of one(x), did you read it as one(n)? That's not the common case, so I tested and seems faster this way:

@inline function p3(x::Float64, n::Integer)
         if n > 0
           y = 1.0
                  xnlo = ynlo = 0.0
                  while n > 1
                      if n&1 > 0
                          yn = x*y
                          ynlo = fma(x, y , -yn) + muladd(y, xnlo, x*ynlo)
                          y = yn
                      end
                      xn = x * x
                      xnlo = muladd(x, 2*xnlo, fma(x, x, -xn))
                      x = xn
                      n >>>= 1
                  end
                  !isfinite(x) && return x*y
                  return muladd(x, y, muladd(y, xnlo, x*ynlo))
              end
           n < 0 && @noinline return inv(x^(-n))
           return one(n) # n == 0
         end

This isn't tested with @noinline that I understand is now allowed, and seems could and should be applied there, since an unusual case.

@SamuelBadr
Copy link
Contributor

This actually already returns 1.0 for n==0. That said, from reading the code, I'm having trouble figuring out why it works.

Are you sure? I'm not at a computer right now so can't check but I'm 99% confident x^0 gave x when I tried it before. (Which aligns with what I expect reading the code)

@JeffBezanson
Copy link
Member

From triage:

  • Option 1: Get multiversioning for if have_fma working very soon (multiversioning pass replaces an intrinsic with a constant based on target).
  • Option 2: Only support CPUs with fma
  • Option 3: Don't care about performance on non-fma CPUs.

Some of us think a good ordering is to do option 3 first, and whenever 1 is implemented those systems will just get faster.

@oscardssmith oscardssmith removed the triage This should be discussed on a triage call label Sep 16, 2021
@oscardssmith
Copy link
Member Author

@samuel3008 I figured out what was happening. I was hitting literal_pow which was giving different results.

base/math.jl Outdated Show resolved Hide resolved
base/math.jl Outdated Show resolved Hide resolved
base/math.jl Outdated Show resolved Hide resolved
@oscardssmith
Copy link
Member Author

I was trying to do this as efficiently as possible. using y==round(y) would require 1 round, 1 check on value (abs(y)<2^64), and then a round(Int,y). This way, if it works only needs 1 rounding.

@oscardssmith
Copy link
Member Author

All tests passing. Anyone have objections to me merging in the next few days?

@oscardssmith
Copy link
Member Author

Merging tomorrow unless there are objections.

@vchuravy
Copy link
Member

vchuravy commented Oct 4, 2021

Don't forget to squash :)

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Oct 4, 2021

[EDIT: Again a bug, looking into it] I did extensive benchmarking of your version vs mine, and I didn't see a benefit to your extra code for e.g. power of three (so I would drop all special cases for that and specific powers to keep machine code short, since it's inlined):

@inline function ^(x::Float64, n::Integer)
           # Doesn't seem to have any effect in the original: n == 0 && return one(x)
           x_original_squared = x*x; y = 1.0
           xnlo = ynlo = 0.0
           if n < 0
               rx = inv(x)
               isfinite(x) && (xnlo = fma(x, rx, -1.) * rx)
               x = rx
               n = -n
           end
           n -= 2; # Not needed: n==3 && return x*x*x # keep compatibility with literal_pow
           while n > 1
               if n&1 != 0
                   yn = x*y
                   ynlo = fma(x, y, -yn) + muladd(y, xnlo, x*ynlo)
                   y = yn
               end
               xn = x * x
               xnlo = muladd(x, 2*xnlo, fma(x, x, -xn))
               x = xn
               n >>= 1
           end
           # I was thinking of moving here: n == 0 && return one(x)
           !isfinite(x) && return x_original_squared*x*y
           return muladd(x_original_squared*x, y, muladd(y, xnlo, x*ynlo))
       end

@oscardssmith
Copy link
Member Author

The reason to special case p=3 isn't for performance. It's so that x^3 returns the same result as p=3; x^p.

@oscardssmith
Copy link
Member Author

oscardssmith commented Oct 4, 2021

The reason I added the code here rather than remove the special case for literal_pow is that I there are a lot of people who rely on the fact that x^3 is really fast, and making the literal version match the results of this would make it noticeably slower. (specifically, it would expand to

x2=x*x
x2lo=fma(x, x, -x2)
return fma(x2, x, x*x2lo)

which is a 2x increase in instructions.

@PallHaraldsson
Copy link
Contributor

If you wanted to avoid fma (for ancient hardware, avoiding slowness), then my code does that for power of 3 and 2 because of "n -= 2", while yours only for power of 3. You state "lot of people who rely on the fact that x^3 is really fast" which my code seems to be, but above you wrote "special case p=3 isn't for performance".

@oscardssmith
Copy link
Member Author

I'm not doing this to avoid slow fma. I'm doing it because x*x*x is at least 2x faster than any form of x^3 that performs error compensation. Your version saves that, but will produce incorrect results (more than .5 ULP for n>3 since it doesn't compensate the multiplication by x^2.

@oscardssmith oscardssmith merged commit c8ae5c6 into master Oct 4, 2021
@oscardssmith oscardssmith deleted the oscardssmith-faster-float-int-pow branch October 4, 2021 18:37
@KristofferC
Copy link
Member

FWIW, it is probably a good idea to run PkgEval on PRs like this that changes very low level functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maths Mathematical functions performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants