You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/english/hpc/number-theory/euclid-extended.md
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -9,9 +9,9 @@ $$
9
9
a^{\phi(m)} \equiv 1 \pmod m
10
10
$$
11
11
12
-
where $\phi(m)$ is [Euler's totient function](https://en.wikipedia.org/wiki/Euler%27s_totient_function) defined as the number of positive integers $x < m$ that are coprime with $m$. In particular case when $m$ is a prime, then all the $m - 1$ residues are coprime and $\phi(m) = m - 1$, yielding the Fermat's theorem.
12
+
where $\phi(m)$ is [Euler's totient function](https://en.wikipedia.org/wiki/Euler%27s_totient_function) defined as the number of positive integers $x < m$ that are coprime with $m$. In the special case when $m$ is a prime, then all the $m - 1$ residues are coprime and $\phi(m) = m - 1$, yielding the Fermat's theorem.
13
13
14
-
This lets us calculate the inverse of $a$ as $a^{\phi(m) - 1}$ if we know $\phi(m)$, but calculating it is, in turn, not so fast: you usually need to obtain the factorization of $m$. There is a more general method that works by modifying the [the Euclidean algorthm](/hpc/algorithms/gcd/).
14
+
This lets us calculate the inverse of $a$ as $a^{\phi(m) - 1}$ if we know $\phi(m)$, but in turn, calculating it isnot so fast: you usually need to obtain the [factorization](/hpc/algorithms/factorization/) of $m$ to do it. There is a more general method that works by modifying the [the Euclidean algorthm](/hpc/algorithms/gcd/).
15
15
16
16
### Algorithm
17
17
@@ -95,6 +95,6 @@ int inverse(int a) {
95
95
}
96
96
```
97
97
98
-
Note that, unlike binary exponentiation, the running time depends on the value of $a$. For example, for this particular value of $m$ ($10^9 + 7$), the worst input happens to be 564400443, on which the algorithm performs 37 iterations and taking 250ns.
98
+
Note that, unlike binary exponentiation, the running time depends on the value of $a$. For example, for this particular value of $m$ ($10^9 + 7$), the worst input happens to be 564400443, for which the algorithm performs 37 iterations and takes 250ns.
99
99
100
-
**Exercise**. Try to adapt the same technique for binary GCD (it won't give performance speedup though unless you are better than me at optimization).
100
+
**Exercise**. Try to adapt the same technique for the [binary GCD](/hpc/algorithms/gcd/#binary-gcd) (it won't give performance speedup though unless you are better than me at optimization).
Copy file name to clipboardExpand all lines: content/english/hpc/number-theory/exponentiation.md
+9-7Lines changed: 9 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -3,7 +3,7 @@ title: Binary Exponentiation
3
3
weight: 2
4
4
---
5
5
6
-
In modular arithmetic and computational algebra in general, you often need to raise a number to the $n$-th power — to do [modular division](../modular/#modular-division), perform [primality tests](../modular/#fermats-theorem), or compute some combinatorial values — and you usually want to spend fewer than $\Theta(n)$ operations calculating it.
6
+
In modular arithmetic (and computational algebra in general), you often need to raise a number to the $n$-th power — to do [modular division](../modular/#modular-division), perform [primality tests](../modular/#fermats-theorem), or compute some combinatorial values — and you usually want to spend fewer than $\Theta(n)$ operations calculating it.
7
7
8
8
*Binary exponentiation*, also known as *exponentiation by squaring*, is a method that allows for computation of the $n$-th power using $O(\log n)$ multiplications, relying on the following observation:
9
9
@@ -54,9 +54,11 @@ u64 inverse(u64 a) {
54
54
}
55
55
```
56
56
57
-
We use $m = 10^9+7$, which is a modulo value is commonly used in *competitive programming* to calculate checksums in combinatorial problems because it is prime (allowing inverse via binary exponentiation), sufficiently large while not overflowing `int` in addition or `long long` in multiplication, and is easy to type as `1e9 + 7`. Since it is a compile-time constant, the compiler can optimize the modulo by [replacing it with multiplication](/hpc/arithmetic/division/) (even if it is not a compile-time constant, it is still cheaper to compute the magic constants by hand once and use them for fast reduction).
57
+
We use $m = 10^9+7$, which is a modulo value commonly used in competitive programming to calculate checksums in combinatorial problems — because it is prime (allowing inverse via binary exponentiation), sufficiently large, not overflowing `int` in addition, not overflowing `long long` in multiplication, and easy to type as `1e9 + 7`.
58
58
59
-
The execution path and hence the running time depends on the value of $n$. For this particular $n$, the baseline implementation takes around 330ns per call. As recursion introduces some [overhead](/hpc/architecture/functions/), it makes sense to unroll it into an iterative one.
59
+
Since we use it as compile-time constant in the code, the compiler can optimize the modulo by [replacing it with multiplication](/hpc/arithmetic/division/) (even if it is not a compile-time constant, it is still cheaper to compute the magic constants by hand once and use them for fast reduction).
60
+
61
+
The execution path — and consequently the running time — depends on the value of $n$. For this particular $n$, the baseline implementation takes around 330ns per call. As recursion introduces some [overhead](/hpc/architecture/functions/), it makes sense to unroll the implementation into an iterative procedure.
60
62
61
63
### Iterative Implementation
62
64
@@ -66,7 +68,7 @@ $$
66
68
a^{42} = a^{32+8+2} = a^{32} \cdot a^8 \cdot a^2
67
69
$$
68
70
69
-
To calculate this product, we can iterate over the bits of $n$ maintaining two variables: the value of $a^{2^k}$ and the current product after considering $k$ lowest bits. On each step, we multiply the current product by $a^{2^k}$ if the $k$-th bit of $n$ is set, and, in either case, square $a^k$ to get $a^{2^k \cdot 2} = a^{2^{k+1}}$ that will be used on the next iteration.
71
+
To calculate this product, we can iterate over the bits of $n$ maintaining two variables: the value of $a^{2^k}$ and the current product after considering $k$ lowest bits of $n$. On each step, we multiply the current product by $a^{2^k}$ if the $k$-th bit of $n$ is set, and, in either case, square $a^k$ to get $a^{2^k \cdot 2} = a^{2^{k+1}}$ that will be used on the next iteration.
70
72
71
73
```c++
72
74
u64binpow(u64 a, u64 n) {
@@ -85,7 +87,7 @@ u64 binpow(u64 a, u64 n) {
85
87
86
88
The iterative implementation takes about 180ns per call. The heavy calculations are the same; the improvement mainly comes from the reduced dependency chain: `a = a * a % M` needs to finish before the loop can proceed, and it can now execute concurrently with `r = res * a % M`.
87
89
88
-
The performance also benefits from $n$ being a constant, [making all branches predictable](/hpc/pipelining/branching/) and letting the scheduler know what needs to be executed in advance. The compiler, however, does not take advantage of it and does not unroll the `while(n) n >>= 1` loop. We can rewrite it as a `for` loop that takes constant 30 iterations:
90
+
The performance also benefits from $n$ being a constant, [making all branches predictable](/hpc/pipelining/branching/) and letting the scheduler know what needs to be executed in advance. The compiler, however, does not take advantage of it and does not unroll the `while(n) n >>= 1` loop. We can rewrite it as a `for` loop that performs constant 30 iterations:
89
91
90
92
```c++
91
93
u64 inverse(u64 a) {
@@ -102,6 +104,6 @@ u64 inverse(u64 a) {
102
104
}
103
105
```
104
106
105
-
This forces the compiler to generate only the instructions we need, shoving off another 10ns and making the total running time ~170ns.
107
+
This forces the compiler to generate only the instructions we need, shaving off another 10ns and making the total running time ~170ns.
106
108
107
-
Note that the performance depends not only on the binary length of $n$, but also on the number of binary 1s. If $n$ is $2^{30}$, it takes around 20ns less not having to perform these off-path multiplications.
109
+
Note that the performance depends not only on the binary length of $n$, but also on the number of binary 1s. If $n$ is $2^{30}$, it takes around 20ns less as we don't have to to perform any off-path multiplications.
Copy file name to clipboardExpand all lines: content/english/hpc/number-theory/modular.md
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -3,10 +3,10 @@ title: Modular Arithmetic
3
3
weight: 1
4
4
---
5
5
6
-
TODO: use it in binary exponentiation.
7
-
8
6
<!--
9
7
8
+
TODO: use it in binary exponentiation.
9
+
10
10
In this section, we are going to discuss some preliminaries before discussing more advanced topics.
11
11
12
12
we use the 1st of January, 1970 as the start of the "Unix era," and all time computations are usually done relative to that timestamp.
@@ -19,9 +19,9 @@ And the beautiful thing about it is that remainders are small and cyclic. Think
19
19
20
20
Computers usually store time as the number of seconds that have passed since the 1st of January, 1970 — the start of the "Unix era" — and use these timestamps in all computations that have to do with time.
21
21
22
-
We humans also keep track of time relative to some point in the past, which usually has a political or religious significance. For example, at the moment of writing, approximately 63882260594 seconds have passed since 0 AD.
22
+
We humans also keep track of time relative to some point in the past, which usually has a political or religious significance. For example, at the moment of writing, approximately 63882260594 seconds have passed since 1 AD — [6th century Eastern Roman monks' best estimate](https://en.wikipedia.org/wiki/Anno_Domini) of the day Jesus Christ was born.
23
23
24
-
But unlike computers, we do not always need *all* that information. Depending on the task at hand, the relevant part may be that it's 2 pm right now and it's time to go to dinner, or that it's Thursday, and so Subway's sub of the day is an Italian BMT. Instead of the whole timestamp, we use its *remainder* containing just the information we need: it is much easier to deal with 1- or 2-digit numbers than 11-digit ones.
24
+
But unlike computers, we do not always need *all* that information. Depending on the task at hand, the relevant part may be that it's 2 pm right now, and it's time to go to dinner; or that it's Thursday, and so Subway's sub of the day is an Italian BMT. Instead of the whole timestamp, we use its *remainder* containing just the information we need: it is much easier to deal with 1- or 2-digit numbers than 11-digit ones.
25
25
26
26
**Problem.** Today is Thursday. What day of the week will be exactly in a year?
Copy file name to clipboardExpand all lines: content/english/hpc/number-theory/montgomery.md
+5-5Lines changed: 5 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -3,9 +3,9 @@ title: Montgomery Multiplication
3
3
weight: 4
4
4
---
5
5
6
-
Unsurprisingly, large fractions of computations in [modular arithmetic](../modular)are often spent on calculating the modulo operation, which is as slow as general integer division and typically taking 15-20 cycles, depending on the operand size.
6
+
Unsurprisingly, a large fraction of computation in [modular arithmetic](../modular)is often spent on calculating the modulo operation, which is as slow as [general integer division](/hpc/arithmetic/division/) and typically takes 15-20 cycles, depending on the operand size.
7
7
8
-
The best way to deal this nuisance is to avoid modulo operation altogether, delaying or replacing it with [predication](/hpc/pipelining/branchless), which can be donewhen calculating sums, for example:
8
+
The best way to deal this nuisance is to avoid modulo operation altogether, delaying or replacing it with [predication](/hpc/pipelining/branchless), which can be done, for example, when calculating modular sums:
9
9
10
10
```cpp
11
11
constint M = 1e9 + 7;
@@ -44,7 +44,7 @@ But there is another technique designed specifically for modular arithmetic, cal
44
44
45
45
Montgomery multiplication works by first transforming the multipliers into *Montgomery space*, where modular multiplication can be performed cheaply, and then transforming them back when their actual values are needed. Unlike general integer division methods, Montgomery multiplication is not efficient for performing just one modular reduction and only becomes worthwhile when there is a chain of modular operations.
46
46
47
-
The space is defined by the modulo $n$ and a positive integer $r \ge n$ coprime to $n$. The algorithm involves division and modulo by $r$, so in practice, $r$ is chosen to be $2^m$ with $m$ being equal 32 or 64, so that these operations can be done with a right-shift and a bitwise AND respectively.
47
+
The space is defined by the modulo $n$ and a positive integer $r \ge n$ coprime to $n$. The algorithm involves modulo and division by $r$, so in practice, $r$ is chosen to be $2^{32}$ or $2^{64}$, so that these operations can be done with a right-shift and a bitwise AND respectively.
48
48
49
49
<!-- Therefore $n$ needs to be an odd number so that every power of $2$ will be coprime to $n$. And if it is not, we can make it odd (?). -->
50
50
@@ -54,7 +54,7 @@ $$
54
54
\bar{x} = x \cdot r \bmod n
55
55
$$
56
56
57
-
Computing this transformation involves a multiplication and a modulo — an expensive operation that we wanted to optimize away in the first place — which is why we don't use this method for general modular multiplication and only long sequences of operations where transforming numbers to and from the Montgomery space is worth it.
57
+
Computing this transformation involves a multiplication and a modulo — an expensive operation that we wanted to optimize away in the first place — which is why we only use this method when the overhead of transforming numbers to and from the Montgomery space is worth it and not for general modular multiplication.
58
58
59
59
<!-- Note that the transformation is actually such a multiplication that we want to optimize, so it is still an expensive operation. However, we will only need to transform a number into the space once, perform as many operations as we want efficiently in that space and at the end transform the final result back, which should be profitable if we are doing lots of operations modulo $n$. -->
60
60
@@ -287,6 +287,6 @@ int inverse(int _a) {
287
287
}
288
288
```
289
289
290
-
While vanilla binary exponentiation with a compiler-generated fast modulo trick requires ~170ns per `inverse` call, this implementation takes ~166ns, going down to ~158s we omit `transform` and `reduce` (a reasonable use case in modular arithmetic is for `inverse` to be used as a subprocedure in a bigger computation). This is a small improvement, but Montgomery multiplication becomes much more advantageous for SIMD applications and larger data types.
290
+
While vanilla binary exponentiation with a compiler-generated fast modulo trick requires ~170ns per `inverse` call, this implementation takes ~166ns, going down to ~158s we omit `transform` and `reduce` (a reasonable use case is for `inverse` to be used as a subprocedure in a bigger modular computation). This is a small improvement, but Montgomery multiplication becomes much more advantageous for SIMD applications and larger data types.
0 commit comments