Double-width tower extension part 1 by mratsim · Pull Request #72 · mratsim/constantine

mratsim · 2020-08-18T07:21:59Z

This introduces double-width tower extension part 1 as detailed in

High-Speed Software Implementation of the Optimal Ate Pairing over Barreto-Naehrig Curves
Jean-Luc Beuchat and Jorge Enrique González Díaz and Shigeo Mitsunari and Eiji Okamoto and Francisco Rodríguez-Henríquez and Tadanori Teruya, 2010
https://eprint.iacr.org/2010/354

which improved Fp2 operation by 30%

The procedure have a pure Nim, Assembly x86 and MULX/ADCX/ADOX variants.
Benchmarks of individual operations have been added.

Double-width operations are done via introducing a FpDbl type, in the future Fp2Dbl and Fp6Dbl can be introduced to delay reductions for Fp12/pairings. This will be done in future PRs.

This technique is used by both MCL and BLST to significantly increase signature verification / pairing speed:

MCL High-level: https://github.com/herumi/mcl/blob/a145a214/include/mcl/fp_tower.hpp
BLST JIT codegen: https://github.com/herumi/mcl/blob/a145a214/src/fp_generator.hpp
BLST High-level: https://github.com/supranational/blst/blob/2077ff37/src/fp12_tower.c
BLST Assembly: https://github.com/supranational/blst/blob/2077ff37/src/asm/mulx_mont_384-x86_64.pl#L461-L588

The code has been added deactivated

constantine/constantine/tower_field_extensions/quadratic_extensions.nim

Lines 87 to 139 in 594fcf2

    
           # TODO: GCC is adding an unexplainable 30 cycles tax to this function (~10% slow down) 
        
           #       for seemingly no reason 
        
           when true: # Single-width implementation 
        
                      # Clang 330 cycles on i9-9980XE @4.1 GHz 
        
             var a0b0 {.noInit.}, a1b1 {.noInit.}: typeof(r.c0) 
        
             a0b0.prod(a.c0, b.c0)                                         # [1 Mul] 
        
             a1b1.prod(a.c1, b.c1)                                         # [2 Mul] 
        
             r.c0.sum(a.c0, a.c1)  # r0 = (a0 + a1)                        # [2 Mul, 1 Add] 
        
             r.c1.sum(b.c0, b.c1)  # r1 = (b0 + b1)                        # [2 Mul, 2 Add] 
        
             r.c1 *= r.c0          # r1 = (b0 + b1)(a0 + a1)               # [3 Mul, 2 Add] - 𝔽p temporary 
        
             r.c0.diff(a0b0, a1b1) # r0 = a0 b0 - a1 b1                    # [3 Mul, 2 Add, 1 Sub] 
        
             r.c1 -= a0b0          # r1 = (b0 + b1)(a0 + a1) - a0b0        # [3 Mul, 2 Add, 2 Sub] 
        
             r.c1 -= a1b1          # r1 = (b0 + b1)(a0 + a1) - a0b0 - a1b1 # [3 Mul, 2 Add, 3 Sub] 
        
           else: # Double-width implementation with lazy reduction 
        
                 # Deactivated for now Clang 360 cycles on i9-9980XE @4.1 GHz 
        
             var a0b0 {.noInit.}, a1b1 {.noInit.}: doubleWidth(typeof(r.c0)) 
        
             var d {.noInit.}: doubleWidth(typeof(r.c0)) 
        
             const msbSet = r.c0.typeof.C.canUseNoCarryMontyMul() 
        
             a0b0.mulNoReduce(a.c0, b.c0)     # 44 cycles - cumul 44 
        
             a1b1.mulNoReduce(a.c1, b.c1)     # 44 cycles - cumul 88 
        
             when msbSet: 
        
               r.c0.sum(a.c0, a.c1) 
        
               r.c1.sum(b.c0, b.c1) 
        
             else: 
        
               r.c0.sumNoReduce(a.c0, a.c1)   # 5 cycles  - cumul 93 
        
               r.c1.sumNoReduce(b.c0, b.c1)   # 5 cycles  - cumul 98 
        
             d.mulNoReduce(r.c0, r.c1)        # 44 cycles - cumul 142 
        
             when msbSet: 
        
               d -= a0b0 
        
               d -= a1b1 
        
             else: 
        
               d.diffNoReduce(d, a0b0)        # 10 cycles - cumul 152 
        
               d.diffNoReduce(d, a1b1)        # 10 cycles - cumul 162 
        
             a0b0.diff(a0b0, a1b1)            # 18 cycles - cumul 170 
        
             r.c0.reduce(a0b0)                # 68 cycles - cumul 248 
        
             r.c1.reduce(d)                   # 68 cycles - cumul 316 
        
           # Single-width [3 Mul, 2 Add, 3 Sub] 
        
           #    3*81 + 2*14 + 3*12 = 307 theoretical cycles 
        
           #    330 measured 
        
           # Double-Width 
        
           #    316 theoretical cycles 
        
           #    365 measured 
        
           #    Reductions can be 2x10 faster using MCL algorithm 
        
           #    but there are still unexplained 50 cycles diff between theo and measured 
        
           #    and unexplained 30 cycles between Clang and GCC 
        
           #    - Function calls? 
        
           #    - push/pop stack?

Perf measurement, the "theoretical" number of cycles is the cumulated number of cycles of each Fp or FpDbl operations composed to implement Fp2. It is not the number of cycles taken from Intel or Agner Fog's instruction tables.
We use BLS12-381 as a reference and compare with MCL JIT (status-im/nim-blscurve#47):

Using Clang as reference, even on an empty prod_complex (with just the temp variable) which takes 9 cycles for clang, GCC has an unexplained 30 cycles added.
A classic Fp2 mul implementation uses 330 cycles (theoretical 307 cycles) with Clang
A double-width Fp2 mul uses 365 cycles (theoretical 316 cycles) with Clang

A couple of implementation variations explain this:

Fp mul on BLS12-381 is 82 cycles in Constantine while it is 107 cycles in MCL
Fp sub is 12 cycles - 9 cycles in MCL (non-constant time :/)
FpDbl subReduce is 18 cycles (no Assembly) - 13 cycles in MCL (non-constant-time :/)
FpDbl mod is 68 cycles - 59 cycles in MCL

In particular 3x25 cycles are gained on mul and 2x10 cycles are lost on mod which makes single-width more interesting at least for Fp2 mul.
That said, as a whole, MCL sits at 300 cycles for Fp2 mul while Constantine is at 330 cycles.

A full assembly implementation might be needed but inline assembly might not be enough to solve the GCC slowness issue (or we need asmNoStackFrame + manipulating the stack).

So for the future:

Improve the FpDbl subReduce Assembly so that it's faster than intrinsics instead of 40% slower
Improve the FpDbl mod Assembly to reach MCL speed. Note MCL uses AVX registers! https://github.com/herumi/mcl/blob/a145a214/src/fp_generator.hpp#L2206-L2291
Implement Fp2Dbl mul as full assembly to avoid the function call tax, the strange differences between theo speed and measured speed and the extra 30 cycles on GCC vs Clang
Implement Fp6Dbl

…ches

…classic

…4.2 / OSX Catalina)

mratsim added 24 commits July 25, 2020 15:09

Implement double-width field multiplication for double-width towering

0aabd6e

Fp2 mul acceleration via double-width lazy reduction (pure Nim)

8ad4c0f

Inline assembly for basic add and sub

5e18ecc

Use 2 registers instead of 12+ for ASM conditional copy

268172d

Prepare assembly for extended multiprecision multiplication support

07f9475

Add assembly for mul

4b7ba2f

initial implementation of assembly reduction

eda83de

stash current progress of assembly reduction

2195b9c

Fix clobbering issue, only P256 comparison remain buggy

f7b9943

Fix asm montgomery reduction for NIST P256 as well

fa3d094

MULX/ADCX/ADOX multi-precision multiplication

90255ca

MULX/ADCX/ADOX reduction v1

403f883

Add (deactivated) assembly for double-width substraction + rework ben…

c9d3076

…ches

Add bench to nimble and deactivate double-width for now. slower than …

594fcf2

…classic

Fix x86-32 running out of registers for mul

9b59606

Clang needs to be at v9 to support flag output constraints (Xcode 11.…

eacec57

…4.2 / OSX Catalina)

32-bit doesn't have enough registers for ASM mul

4c2a571

Fix again Travis Clang 9 issues

37473c7

LLVM 9 is not whitelisted in travis

4fea06b

deactivated assembler with travis clang

e19783a

syntax error

5db4fa4

another

1f3cae1

...

f078e99

missing space, yeah ...

fd9b60c

mratsim merged commit d41c653 into master Aug 20, 2020

mratsim mentioned this pull request Sep 3, 2020

Endomorphism G2 #79

Merged

mratsim deleted the double-width-tower branch September 4, 2020 08:58

mratsim mentioned this pull request Feb 10, 2021

Double-precision cubic towering + pairing #158

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Double-width tower extension part 1#72

Double-width tower extension part 1#72
mratsim merged 24 commits intomasterfrom
double-width-tower

mratsim commented Aug 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	# TODO: GCC is adding an unexplainable 30 cycles tax to this function (~10% slow down)
	# for seemingly no reason

	when true: # Single-width implementation
	# Clang 330 cycles on i9-9980XE @4.1 GHz
	var a0b0 {.noInit.}, a1b1 {.noInit.}: typeof(r.c0)
	a0b0.prod(a.c0, b.c0) # [1 Mul]
	a1b1.prod(a.c1, b.c1) # [2 Mul]

	r.c0.sum(a.c0, a.c1) # r0 = (a0 + a1) # [2 Mul, 1 Add]
	r.c1.sum(b.c0, b.c1) # r1 = (b0 + b1) # [2 Mul, 2 Add]
	r.c1 *= r.c0 # r1 = (b0 + b1)(a0 + a1) # [3 Mul, 2 Add] - 𝔽p temporary

	r.c0.diff(a0b0, a1b1) # r0 = a0 b0 - a1 b1 # [3 Mul, 2 Add, 1 Sub]
	r.c1 -= a0b0 # r1 = (b0 + b1)(a0 + a1) - a0b0 # [3 Mul, 2 Add, 2 Sub]
	r.c1 -= a1b1 # r1 = (b0 + b1)(a0 + a1) - a0b0 - a1b1 # [3 Mul, 2 Add, 3 Sub]

	else: # Double-width implementation with lazy reduction
	# Deactivated for now Clang 360 cycles on i9-9980XE @4.1 GHz
	var a0b0 {.noInit.}, a1b1 {.noInit.}: doubleWidth(typeof(r.c0))
	var d {.noInit.}: doubleWidth(typeof(r.c0))
	const msbSet = r.c0.typeof.C.canUseNoCarryMontyMul()

	a0b0.mulNoReduce(a.c0, b.c0) # 44 cycles - cumul 44
	a1b1.mulNoReduce(a.c1, b.c1) # 44 cycles - cumul 88
	when msbSet:
	r.c0.sum(a.c0, a.c1)
	r.c1.sum(b.c0, b.c1)
	else:
	r.c0.sumNoReduce(a.c0, a.c1) # 5 cycles - cumul 93
	r.c1.sumNoReduce(b.c0, b.c1) # 5 cycles - cumul 98
	d.mulNoReduce(r.c0, r.c1) # 44 cycles - cumul 142
	when msbSet:
	d -= a0b0
	d -= a1b1
	else:
	d.diffNoReduce(d, a0b0) # 10 cycles - cumul 152
	d.diffNoReduce(d, a1b1) # 10 cycles - cumul 162
	a0b0.diff(a0b0, a1b1) # 18 cycles - cumul 170
	r.c0.reduce(a0b0) # 68 cycles - cumul 248
	r.c1.reduce(d) # 68 cycles - cumul 316

	# Single-width [3 Mul, 2 Add, 3 Sub]
	# 381 + 214 + 3*12 = 307 theoretical cycles
	# 330 measured
	# Double-Width
	# 316 theoretical cycles
	# 365 measured
	# Reductions can be 2x10 faster using MCL algorithm
	# but there are still unexplained 50 cycles diff between theo and measured
	# and unexplained 30 cycles between Clang and GCC
	# - Function calls?
	# - push/pop stack?

Uh oh!

Conversation

mratsim commented Aug 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant