Conversation
…4.2 / OSX Catalina)
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This introduces double-width tower extension part 1 as detailed in
Jean-Luc Beuchat and Jorge Enrique González Díaz and Shigeo Mitsunari and Eiji Okamoto and Francisco Rodríguez-Henríquez and Tadanori Teruya, 2010
https://eprint.iacr.org/2010/354
which improved Fp2 operation by 30%
The procedure have a pure Nim, Assembly x86 and MULX/ADCX/ADOX variants.
Benchmarks of individual operations have been added.
Double-width operations are done via introducing a FpDbl type, in the future Fp2Dbl and Fp6Dbl can be introduced to delay reductions for Fp12/pairings. This will be done in future PRs.
This technique is used by both MCL and BLST to significantly increase signature verification / pairing speed:
The code has been added deactivated
constantine/constantine/tower_field_extensions/quadratic_extensions.nim
Lines 87 to 139 in 594fcf2
Perf measurement, the "theoretical" number of cycles is the cumulated number of cycles of each Fp or FpDbl operations composed to implement Fp2. It is not the number of cycles taken from Intel or Agner Fog's instruction tables.
We use BLS12-381 as a reference and compare with MCL JIT (status-im/nim-blscurve#47):
prod_complex(with just the temp variable) which takes 9 cycles for clang, GCC has an unexplained 30 cycles added.A couple of implementation variations explain this:
In particular 3x25 cycles are gained on mul and 2x10 cycles are lost on mod which makes single-width more interesting at least for Fp2 mul.
That said, as a whole, MCL sits at 300 cycles for Fp2 mul while Constantine is at 330 cycles.
A full assembly implementation might be needed but inline assembly might not be enough to solve the GCC slowness issue (or we need asmNoStackFrame + manipulating the stack).
So for the future: