Skip to content

Conversation

@vneiger
Copy link
Collaborator

@vneiger vneiger commented Nov 15, 2025

Here are some enhancements of evaluation at an nmod point for nmod_poly. I don't have specific uses for this, this was done as a "warmup" for writing more efficient implementations of reduction modulo polynomials x^n - c for n >= 1 (draft started at #2470 , itself useful for the in-progress FFT #2107). But since this seems to accelerate the existing code in all cases, this might as well be merged in(?).

  1. Acceleration of existing cases (mainly through unrolling loops):
  • polynomials of small length (up to about 10) are unaffected
  • speed-up becomes substantial for length 32 (factor 2)
  • speed-up between 2.5 and 3 for large lengths

See first table below: for each modulus bitsize, the first column measures the old version, the second column measures the new one. (And some time ago, we only had the very first column, for all moduli...)

  1. Adding specific functions for evaluation at +1 and -1.
  • the main evaluation function detects these cases and chooses the relevant function depending on the bitsize of the modulus
  • independently of the modulus bitsize, the speed-up is consistently about 4, versus the best general (not specific to 1 or -1) variant we have at hand for that bitsize (see second table below).
Intel(R) Core(TM) Ultra 7 165H
length |    64 bits    |    63 bits    |   <= 62 bits
1      | 6.71    8.57  | 13.15   13.18 | 13.15   13.17
2      | 4.52    4.68  | 3.42    4.04  | 3.37    3.58
3      | 4.04    4.18  | 2.98    3.63  | 2.47    2.61
4      | 4.91    4.84  | 3.01    3.25  | 2.33    2.34
6      | 5.93    5.90  | 3.39    3.49  | 2.17    2.19
8      | 6.61    6.65  | 3.73    3.84  | 2.10    2.12
10     | 7.08    7.15  | 4.05    4.12  | 2.04    2.10
12     | 8.02    6.14  | 4.32    4.21  | 2.09    2.16
16     | 9.22    5.71  | 4.71    3.88  | 2.28    3.07
20     | 9.95    5.46  | 5.34    3.57  | 2.45    2.88
32     | 11.06   5.09  | 6.34    3.29  | 2.81    2.47
45     | 11.59   4.91  | 6.77    3.13  | 3.46    2.30
64     | 12.02   4.75  | 7.14    2.96  | 3.95    2.15
128    | 12.46   4.59  | 7.59    2.85  | 4.51    2.02
256    | 12.78   4.52  | 7.80    2.73  | 4.87    1.92
1024   | 12.90   4.41  | 7.92    2.64  | 5.09    1.87
8192   | 12.93   4.41  | 8.06    2.65  | 5.16    1.86
65536  | 12.99   4.39  | 7.98    2.63  | 5.19    1.86
200000 | 13.03   4.42  | 7.97    2.65  | 5.20    1.86
1000000| 13.04   4.48  | 8.01    2.69  | 5.22    1.92
AMD Ryzen 7 PRO 7840U
nbits = 62
length   generic precomp lazy    one     mone
1        8.07    7.51    7.69    6.11    6.23
2        6.34    4.39    4.18    4.72    5.06
3        6.38    4.36    3.40    3.15    3.61
4        6.75    4.41    3.25    2.55    2.95
6        7.76    5.28    3.92    2.23    1.90
8        8.86    5.44    3.71    1.71    1.88
10       9.96    5.74    3.70    1.45    1.53
12       7.87    5.39    3.55    1.20    1.32
16       7.02    5.03    4.13    1.27    1.32
20       6.76    4.78    3.73    1.05    1.08
32       6.37    4.47    3.24    0.88    0.92
45       6.00    4.14    2.79    0.73    0.75
64       5.66    3.93    2.58    0.75    0.74
128      5.46    3.72    2.31    0.64    0.66
256      5.34    3.62    2.21    0.60    0.59
1024     5.21    3.55    2.08    0.56    0.55
8192     5.21    3.54    2.06    0.57    0.56
65536    5.32    3.55    2.15    0.58    0.55
200000   5.44    3.55    2.08    0.57    0.55
1000000  5.24    3.56    2.07    0.58    0.55

nbits = 63
length   generic precomp one     mone
1        8.18    7.94    6.36    6.38
2        6.38    4.46    5.09    4.78
3        6.36    4.29    3.63    3.64
4        6.61    4.56    3.00    3.53
6        8.13    5.42    2.17    2.25
8        9.00    5.49    2.01    2.03
10       9.73    5.53    1.77    1.79
12       8.04    5.38    1.62    1.62
16       7.13    5.28    1.44    1.43
20       6.82    4.73    1.29    1.30
32       6.20    4.35    1.10    1.11
45       6.04    4.17    1.02    1.07
64       5.67    3.95    1.02    1.00
128      5.62    3.70    0.91    0.93
256      5.33    3.75    0.88    0.88
1024     5.26    3.55    0.86    0.86
8192     5.23    3.66    0.88    0.87
65536    5.25    3.52    0.86    0.85
200000   5.29    3.55    0.85    0.85
1000000  5.27    3.55    0.90    0.87

nbits = 64
length   generic one     mone
1        8.38    6.40    6.33
2        6.45    5.42    5.15
3        6.37    3.40    3.67
4        6.59    2.67    3.08
6        7.89    2.54    2.35
8        8.96    2.18    2.26
10       9.76    2.11    2.03
12       7.91    1.85    1.92
16       7.18    1.70    1.74
20       6.83    1.63    1.65
32       6.19    1.46    1.48
45       6.07    1.42    1.40
64       5.88    1.35    1.35
128      5.50    1.27    1.28
256      5.35    1.24    1.24
1024     5.26    1.21    1.21
8192     5.39    1.24    1.25
65536    5.23    1.22    1.23
200000   5.25    1.22    1.22
1000000  5.26    1.25    1.24

  previous version
- makes test more robust (more iterations; part of it focusing on large
  bitlengths)
@albinahlback
Copy link
Collaborator

Unrolling is useful here because compilers cannot unroll well whenever there is assembly in the loop? Because we push -funroll-loops for nmod_poly.

Can you make a comparison when using Clang? It does not use assembly for umul_ppmm.

@vneiger
Copy link
Collaborator Author

vneiger commented Nov 15, 2025

Unrolling is useful here because compilers cannot unroll well whenever there is assembly in the loop? Because we push -funroll-loops for nmod_poly.

Well, I was initially surprised to see that unrolling helped here, on such simple loops. But then, actually unrolling by hand did involve some logic: to avoid a dependency in consecutive loop iterations, I use the fourth power c^4 of the point c at which we evaluate. I guess the compiler will not introduce this kind of logic --- especially since this is an nmod power, not just c*c*c*c, and I suspect that there is maybe no plain, simple unrolling that would be efficient...? (I'm actually hoping that the explanation is not this one, because this is not the case anymore in the more general remainder modulo x^n - c: as soon as n >= 4 or so, you can unroll more trivially without using powers of c.)

Can you make a comparison when using Clang? It does not use assembly for umul_ppmm.

Sure, it's interesting to know what this gives in any case. I'll have a look and report here.

@fredrik-johansson
Copy link
Collaborator

fredrik-johansson commented Nov 16, 2025

Evaluation at 1 is just the sum of coefficients. Would it not be faster to add_ssaaaa up everything and then reduce?

Or does the compiler generate good SIMD code for the conditional subtractions?

Another idea that could be good for SIMD is to do the sum both in ulong (i.e. mod 2^64) and double simultaneously and finally combine the low and high bits.

The general case can also be reduced asymptotically to dot products with a negligible number of modular reductions.

@vneiger
Copy link
Collaborator Author

vneiger commented Nov 16, 2025

Unrolling is useful here because compilers cannot unroll well whenever there is assembly in the loop? Because we push -funroll-loops for nmod_poly.

Can you make a comparison when using Clang? It does not use assembly for umul_ppmm.

Here is the assembly with gcc and clang. Through a (too) quick inspection, I don't see a huge difference, but maybe you will.
ev_nmod.zip
Also, not visible in the files, but -funroll-loops for gcc does basically nothing for the manually unrolled version of the function, and does not do much for the "natural" version (it does a little thing but this does not look like unrolling the main loop).

I've tried to run the profile file with clang but got into an issue. prof_repeat seems to never finish, just increasing the count variable indefinitely. I tried making sure the content of the profiled part is not compiler-optimized away or something like this, but this did not help. I thought I had already used clang and prof_repeat before, without this issue. Any quick hint at what I might be doing wrong? otherwise I'll open an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants