Skip to content

Double-width tower extension part 1#72

Merged
mratsim merged 24 commits intomasterfrom
double-width-tower
Aug 20, 2020
Merged

Double-width tower extension part 1#72
mratsim merged 24 commits intomasterfrom
double-width-tower

Conversation

@mratsim
Copy link
Owner

@mratsim mratsim commented Aug 18, 2020

This introduces double-width tower extension part 1 as detailed in

  • High-Speed Software Implementation of the Optimal Ate Pairing over Barreto-Naehrig Curves
    Jean-Luc Beuchat and Jorge Enrique González Díaz and Shigeo Mitsunari and Eiji Okamoto and Francisco Rodríguez-Henríquez and Tadanori Teruya, 2010
    https://eprint.iacr.org/2010/354

which improved Fp2 operation by 30%

image
image


The procedure have a pure Nim, Assembly x86 and MULX/ADCX/ADOX variants.
Benchmarks of individual operations have been added.

Double-width operations are done via introducing a FpDbl type, in the future Fp2Dbl and Fp6Dbl can be introduced to delay reductions for Fp12/pairings. This will be done in future PRs.

This technique is used by both MCL and BLST to significantly increase signature verification / pairing speed:

The code has been added deactivated

# TODO: GCC is adding an unexplainable 30 cycles tax to this function (~10% slow down)
# for seemingly no reason
when true: # Single-width implementation
# Clang 330 cycles on i9-9980XE @4.1 GHz
var a0b0 {.noInit.}, a1b1 {.noInit.}: typeof(r.c0)
a0b0.prod(a.c0, b.c0) # [1 Mul]
a1b1.prod(a.c1, b.c1) # [2 Mul]
r.c0.sum(a.c0, a.c1) # r0 = (a0 + a1) # [2 Mul, 1 Add]
r.c1.sum(b.c0, b.c1) # r1 = (b0 + b1) # [2 Mul, 2 Add]
r.c1 *= r.c0 # r1 = (b0 + b1)(a0 + a1) # [3 Mul, 2 Add] - 𝔽p temporary
r.c0.diff(a0b0, a1b1) # r0 = a0 b0 - a1 b1 # [3 Mul, 2 Add, 1 Sub]
r.c1 -= a0b0 # r1 = (b0 + b1)(a0 + a1) - a0b0 # [3 Mul, 2 Add, 2 Sub]
r.c1 -= a1b1 # r1 = (b0 + b1)(a0 + a1) - a0b0 - a1b1 # [3 Mul, 2 Add, 3 Sub]
else: # Double-width implementation with lazy reduction
# Deactivated for now Clang 360 cycles on i9-9980XE @4.1 GHz
var a0b0 {.noInit.}, a1b1 {.noInit.}: doubleWidth(typeof(r.c0))
var d {.noInit.}: doubleWidth(typeof(r.c0))
const msbSet = r.c0.typeof.C.canUseNoCarryMontyMul()
a0b0.mulNoReduce(a.c0, b.c0) # 44 cycles - cumul 44
a1b1.mulNoReduce(a.c1, b.c1) # 44 cycles - cumul 88
when msbSet:
r.c0.sum(a.c0, a.c1)
r.c1.sum(b.c0, b.c1)
else:
r.c0.sumNoReduce(a.c0, a.c1) # 5 cycles - cumul 93
r.c1.sumNoReduce(b.c0, b.c1) # 5 cycles - cumul 98
d.mulNoReduce(r.c0, r.c1) # 44 cycles - cumul 142
when msbSet:
d -= a0b0
d -= a1b1
else:
d.diffNoReduce(d, a0b0) # 10 cycles - cumul 152
d.diffNoReduce(d, a1b1) # 10 cycles - cumul 162
a0b0.diff(a0b0, a1b1) # 18 cycles - cumul 170
r.c0.reduce(a0b0) # 68 cycles - cumul 248
r.c1.reduce(d) # 68 cycles - cumul 316
# Single-width [3 Mul, 2 Add, 3 Sub]
# 3*81 + 2*14 + 3*12 = 307 theoretical cycles
# 330 measured
# Double-Width
# 316 theoretical cycles
# 365 measured
# Reductions can be 2x10 faster using MCL algorithm
# but there are still unexplained 50 cycles diff between theo and measured
# and unexplained 30 cycles between Clang and GCC
# - Function calls?
# - push/pop stack?

Perf measurement, the "theoretical" number of cycles is the cumulated number of cycles of each Fp or FpDbl operations composed to implement Fp2. It is not the number of cycles taken from Intel or Agner Fog's instruction tables.
We use BLS12-381 as a reference and compare with MCL JIT (status-im/nim-blscurve#47):

  • Using Clang as reference, even on an empty prod_complex (with just the temp variable) which takes 9 cycles for clang, GCC has an unexplained 30 cycles added.
  • A classic Fp2 mul implementation uses 330 cycles (theoretical 307 cycles) with Clang
  • A double-width Fp2 mul uses 365 cycles (theoretical 316 cycles) with Clang

A couple of implementation variations explain this:

  • Fp mul on BLS12-381 is 82 cycles in Constantine while it is 107 cycles in MCL
  • Fp sub is 12 cycles - 9 cycles in MCL (non-constant time :/)
  • FpDbl subReduce is 18 cycles (no Assembly) - 13 cycles in MCL (non-constant-time :/)
  • FpDbl mod is 68 cycles - 59 cycles in MCL

In particular 3x25 cycles are gained on mul and 2x10 cycles are lost on mod which makes single-width more interesting at least for Fp2 mul.
That said, as a whole, MCL sits at 300 cycles for Fp2 mul while Constantine is at 330 cycles.

A full assembly implementation might be needed but inline assembly might not be enough to solve the GCC slowness issue (or we need asmNoStackFrame + manipulating the stack).

So for the future:

  • Improve the FpDbl subReduce Assembly so that it's faster than intrinsics instead of 40% slower
  • Improve the FpDbl mod Assembly to reach MCL speed. Note MCL uses AVX registers! https://github.com/herumi/mcl/blob/a145a214/src/fp_generator.hpp#L2206-L2291
  • Implement Fp2Dbl mul as full assembly to avoid the function call tax, the strange differences between theo speed and measured speed and the extra 30 cycles on GCC vs Clang
  • Implement Fp6Dbl

@mratsim mratsim merged commit d41c653 into master Aug 20, 2020
@mratsim mratsim mentioned this pull request Sep 3, 2020
@mratsim mratsim deleted the double-width-tower branch September 4, 2020 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant