Accelerate single point evaluation for `nmod_poly` #2492

vneiger · 2025-11-15T18:31:28Z

Here are some enhancements of evaluation at an nmod point for nmod_poly. I don't have specific uses for this, this was done as a "warmup" for writing more efficient implementations of reduction modulo polynomials x^n - c for n >= 1 (draft started at #2470 , itself useful for the in-progress FFT #2107). But since this seems to accelerate the existing code in all cases, this might as well be merged in(?).

Acceleration of existing cases (mainly through unrolling loops):

polynomials of small length (up to about 10) are unaffected
speed-up becomes substantial for length 32 (factor 2)
speed-up between 2.5 and 3 for large lengths

See first table below: for each modulus bitsize, the first column measures the old version, the second column measures the new one. (And some time ago, we only had the very first column, for all moduli...)

Adding specific functions for evaluation at +1 and -1.

the main evaluation function detects these cases and chooses the relevant function depending on the bitsize of the modulus
independently of the modulus bitsize, the speed-up is consistently about 4, versus the best general (not specific to 1 or -1) variant we have at hand for that bitsize (see second table below).

Intel(R) Core(TM) Ultra 7 165H
length |    64 bits    |    63 bits    |   <= 62 bits
1      | 6.71    8.57  | 13.15   13.18 | 13.15   13.17
2      | 4.52    4.68  | 3.42    4.04  | 3.37    3.58
3      | 4.04    4.18  | 2.98    3.63  | 2.47    2.61
4      | 4.91    4.84  | 3.01    3.25  | 2.33    2.34
6      | 5.93    5.90  | 3.39    3.49  | 2.17    2.19
8      | 6.61    6.65  | 3.73    3.84  | 2.10    2.12
10     | 7.08    7.15  | 4.05    4.12  | 2.04    2.10
12     | 8.02    6.14  | 4.32    4.21  | 2.09    2.16
16     | 9.22    5.71  | 4.71    3.88  | 2.28    3.07
20     | 9.95    5.46  | 5.34    3.57  | 2.45    2.88
32     | 11.06   5.09  | 6.34    3.29  | 2.81    2.47
45     | 11.59   4.91  | 6.77    3.13  | 3.46    2.30
64     | 12.02   4.75  | 7.14    2.96  | 3.95    2.15
128    | 12.46   4.59  | 7.59    2.85  | 4.51    2.02
256    | 12.78   4.52  | 7.80    2.73  | 4.87    1.92
1024   | 12.90   4.41  | 7.92    2.64  | 5.09    1.87
8192   | 12.93   4.41  | 8.06    2.65  | 5.16    1.86
65536  | 12.99   4.39  | 7.98    2.63  | 5.19    1.86
200000 | 13.03   4.42  | 7.97    2.65  | 5.20    1.86
1000000| 13.04   4.48  | 8.01    2.69  | 5.22    1.92

AMD Ryzen 7 PRO 7840U
nbits = 62
length   generic precomp lazy    one     mone
1        8.07    7.51    7.69    6.11    6.23
2        6.34    4.39    4.18    4.72    5.06
3        6.38    4.36    3.40    3.15    3.61
4        6.75    4.41    3.25    2.55    2.95
6        7.76    5.28    3.92    2.23    1.90
8        8.86    5.44    3.71    1.71    1.88
10       9.96    5.74    3.70    1.45    1.53
12       7.87    5.39    3.55    1.20    1.32
16       7.02    5.03    4.13    1.27    1.32
20       6.76    4.78    3.73    1.05    1.08
32       6.37    4.47    3.24    0.88    0.92
45       6.00    4.14    2.79    0.73    0.75
64       5.66    3.93    2.58    0.75    0.74
128      5.46    3.72    2.31    0.64    0.66
256      5.34    3.62    2.21    0.60    0.59
1024     5.21    3.55    2.08    0.56    0.55
8192     5.21    3.54    2.06    0.57    0.56
65536    5.32    3.55    2.15    0.58    0.55
200000   5.44    3.55    2.08    0.57    0.55
1000000  5.24    3.56    2.07    0.58    0.55

nbits = 63
length   generic precomp one     mone
1        8.18    7.94    6.36    6.38
2        6.38    4.46    5.09    4.78
3        6.36    4.29    3.63    3.64
4        6.61    4.56    3.00    3.53
6        8.13    5.42    2.17    2.25
8        9.00    5.49    2.01    2.03
10       9.73    5.53    1.77    1.79
12       8.04    5.38    1.62    1.62
16       7.13    5.28    1.44    1.43
20       6.82    4.73    1.29    1.30
32       6.20    4.35    1.10    1.11
45       6.04    4.17    1.02    1.07
64       5.67    3.95    1.02    1.00
128      5.62    3.70    0.91    0.93
256      5.33    3.75    0.88    0.88
1024     5.26    3.55    0.86    0.86
8192     5.23    3.66    0.88    0.87
65536    5.25    3.52    0.86    0.85
200000   5.29    3.55    0.85    0.85
1000000  5.27    3.55    0.90    0.87

nbits = 64
length   generic one     mone
1        8.38    6.40    6.33
2        6.45    5.42    5.15
3        6.37    3.40    3.67
4        6.59    2.67    3.08
6        7.89    2.54    2.35
8        8.96    2.18    2.26
10       9.76    2.11    2.03
12       7.91    1.85    1.92
16       7.18    1.70    1.74
20       6.83    1.63    1.65
32       6.19    1.46    1.48
45       6.07    1.42    1.40
64       5.88    1.35    1.35
128      5.50    1.27    1.28
256      5.35    1.24    1.24
1024     5.26    1.21    1.21
8192     5.39    1.24    1.25
65536    5.23    1.22    1.23
200000   5.25    1.22    1.22
1000000  5.26    1.25    1.24

previous version - makes test more robust (more iterations; part of it focusing on large bitlengths)

- insert them in testing and profile

albinahlback · 2025-11-15T20:01:40Z

Unrolling is useful here because compilers cannot unroll well whenever there is assembly in the loop? Because we push -funroll-loops for nmod_poly.

Can you make a comparison when using Clang? It does not use assembly for umul_ppmm.

vneiger · 2025-11-15T20:47:54Z

Unrolling is useful here because compilers cannot unroll well whenever there is assembly in the loop? Because we push -funroll-loops for nmod_poly.

Well, I was initially surprised to see that unrolling helped here, on such simple loops. But then, actually unrolling by hand did involve some logic: to avoid a dependency in consecutive loop iterations, I use the fourth power c^4 of the point c at which we evaluate. I guess the compiler will not introduce this kind of logic --- especially since this is an nmod power, not just c*c*c*c, and I suspect that there is maybe no plain, simple unrolling that would be efficient...? (I'm actually hoping that the explanation is not this one, because this is not the case anymore in the more general remainder modulo x^n - c: as soon as n >= 4 or so, you can unroll more trivially without using powers of c.)

Can you make a comparison when using Clang? It does not use assembly for umul_ppmm.

Sure, it's interesting to know what this gives in any case. I'll have a look and report here.

fredrik-johansson · 2025-11-16T01:06:44Z

Evaluation at 1 is just the sum of coefficients. Would it not be faster to add_ssaaaa up everything and then reduce?

Or does the compiler generate good SIMD code for the conditional subtractions?

Another idea that could be good for SIMD is to do the sum both in ulong (i.e. mod 2^64) and double simultaneously and finally combine the low and high bits.

The general case can also be reduced asymptotically to dot products with a negligible number of modular reductions.

vneiger · 2025-11-16T14:17:04Z

Unrolling is useful here because compilers cannot unroll well whenever there is assembly in the loop? Because we push -funroll-loops for nmod_poly.

Can you make a comparison when using Clang? It does not use assembly for umul_ppmm.

Here is the assembly with gcc and clang. Through a (too) quick inspection, I don't see a huge difference, but maybe you will.
ev_nmod.zip
Also, not visible in the files, but -funroll-loops for gcc does basically nothing for the manually unrolled version of the function, and does not do much for the "natural" version (it does a little thing but this does not look like unrolling the main loop).

I've tried to run the profile file with clang but got into an issue. prof_repeat seems to never finish, just increasing the count variable indefinitely. I tried making sure the content of the profiled part is not compiler-optimized away or something like this, but this did not help. I thought I had already used clang and prof_repeat before, without this issue. Any quick hint at what I might be doing wrong? otherwise I'll open an issue.

vneiger added 2 commits November 15, 2025 11:23

- speed-up nmod evaluation via unrolling loops; always faster than

8263b5b

previous version - makes test more robust (more iterations; part of it focusing on large bitlengths)

- add specific, faster evaluation versions for points +1 and -1

655cfff

- insert them in testing and profile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accelerate single point evaluation for `nmod_poly` #2492

Accelerate single point evaluation for `nmod_poly` #2492

vneiger commented Nov 15, 2025

Uh oh!

albinahlback commented Nov 15, 2025

Uh oh!

vneiger commented Nov 15, 2025

Uh oh!

fredrik-johansson commented Nov 16, 2025 •

edited

Loading

Uh oh!

vneiger commented Nov 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Accelerate single point evaluation for nmod_poly #2492

Are you sure you want to change the base?

Accelerate single point evaluation for nmod_poly #2492

Conversation

vneiger commented Nov 15, 2025

Uh oh!

albinahlback commented Nov 15, 2025

Uh oh!

vneiger commented Nov 15, 2025

Uh oh!

fredrik-johansson commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vneiger commented Nov 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Accelerate single point evaluation for `nmod_poly` #2492

Accelerate single point evaluation for `nmod_poly` #2492

fredrik-johansson commented Nov 16, 2025 •

edited

Loading