-
Notifications
You must be signed in to change notification settings - Fork 8k
ext/bcmath: Use divide and conquer method in bcmul
#14376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
90c9984 to
0edb77d
Compare
|
There was a calculation error in memory management, so will fix it. |
c78e1b3 to
600590c
Compare
|
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting but also very complex.
I also think we need good dedicated test cases for this.
| * | ||
| * As can see by actually doing it, the value added to [3] in this example is the | ||
| * accumulation of all (high + low - mid) when calculating 2 digits. | ||
| * In this example, the ret size is 8, so the calculation length is 4, and from 4^1.585, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you obtain the number 1.585?
I see, this is the time complexity O(N^log2(3))... That wasn't clear to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll elaborate a bit more in my comments.
ext/bcmath/libbcmath/src/recmul.c
Outdated
| * | ||
| * At this time, if consider one portion (high + low - mid) using ab*cd as an example, | ||
| * this becomes: ac + bd - (a-b)(c-d) = ac + bd - ac + ad + bc - bd = ad + bc | ||
| * Can see that the maximum value of this is obtained when 99*99. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Can see that the maximum value of this is obtained when 99*99. | |
| * Can see that the maximum value of this is obtained by 99*99. |
ext/bcmath/libbcmath/src/recmul.c
Outdated
| * this becomes: ac + bd - (a-b)(c-d) = ac + bd - ac + ad + bc - bd = ad + bc | ||
| * Can see that the maximum value of this is obtained when 99*99. | ||
| * | ||
| * This law holds true regardless of the calculation length, so when considering the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik "holds" already means it's true
| * This law holds true regardless of the calculation length, so when considering the | |
| * This law holds regardless of the calculation length, so when considering the |
ext/bcmath/libbcmath/src/recmul.c
Outdated
| * BC_MUL_MAX_ADD_COUNT (because the calculation length is always adjusted to the power of 2). | ||
| */ | ||
| #if SIZEOF_SIZE_T >= 8 | ||
| # define BC_REC_MUL_DO_ADJUST_EXPO 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so because BC_MUL_MAX_ADD_COUNT is around 1844 we go to the power of 2 below it, i.e. 1024. I think you should add somewhere that BC_MUL_MAX_ADD_COUNT is around 1844.
| */ | ||
| #if SIZEOF_SIZE_T >= 8 | ||
| # define BC_REC_MUL_DO_ADJUST_EXPO 1024 | ||
| # define BC_USE_REC_MUL_DIGITS 160 * 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you arrive at 160?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simply compared the measurements with the standard version and specified the number at which rec was faster. But I forgot to include the results of that comparison...
ext/bcmath/libbcmath/src/recmul.c
Outdated
| * reaches 2. In other words, it is the sum of a geometric progression of 2 with a geometric | ||
| * ratio of 2 from 2 to N-1, where the calculation length is N. | ||
| * This sum can be calculated using the following formula, where the first term is a and the | ||
| * geometric ratio is r: a(r^(N-1) - 1)(r - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * geometric ratio is r: a(r^(N-1) - 1)(r - 1) | |
| * geometric ratio is r: a(r^(N-1) - 1)/(r - 1) |
ext/bcmath/libbcmath/src/recmul.c
Outdated
| * ratio of 2 from 2 to N-1, where the calculation length is N. | ||
| * This sum can be calculated using the following formula, where the first term is a and the | ||
| * geometric ratio is r: a(r^(N-1) - 1)(r - 1) | ||
| * Here, a and r are both 2, so this formula becomes: 2(2^(N-1) - 1)(2 - 1) = 2^N - 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Here, a and r are both 2, so this formula becomes: 2(2^(N-1) - 1)(2 - 1) = 2^N - 2 | |
| * Here, a and r are both 2, so this formula becomes: 2(2^(N-1) - 1)/(2 - 1) = 2^N - 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are a and r both 2? I would've expected a = M and r = 1/2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was intended that rearranging the numbers in reverse would result in a sequence like the comments, but the format you've written may be easier to understand.
ext/bcmath/libbcmath/src/recmul.c
Outdated
| * Share the results and the buffers used in intermediate calculations. The result is prod_arr_size. | ||
| * The buffer increases by half like n_buf, but when calc_size is 2, the required buffer size is 4. | ||
| * In other words, they are almost the same geometric progression, but the first term is 4. | ||
| * a(r^(N-1) - 1)(r - 1): 4(2^(N-1) - 1)(2 - 1) = 2^N - 2 = 2(2^N - 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * a(r^(N-1) - 1)(r - 1): 4(2^(N-1) - 1)(2 - 1) = 2^N - 2 = 2(2^N - 2) | |
| * a(r^(N-1) - 1)/(r - 1): 4(2^(N-1) - 1)/(2 - 1) = 2^N - 2 = 2(2^N - 2) |
ext/bcmath/libbcmath/src/recmul.c
Outdated
| n1_buf_size -= 2; | ||
| n2_buf_size -= 2; | ||
|
|
||
| BC_UINT_T *buf = safe_emalloc(prod_arr_size * 2 + calc_size * 2 - 4 + n1_buf_size + n2_buf_size, sizeof(BC_UINT_T), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this sum overflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. For very large numbers it will overflow...
ext/bcmath/libbcmath/src/recmul.c
Outdated
| * adjust the topmost entry | ||
| */ | ||
| if (UNEXPECTED(calc_size >= BC_REC_MUL_DO_ADJUST_EXPO)) { | ||
| prod_uint[prod_arr_real_size - 1] += prod_uint[prod_arr_real_size] * BC_MUL_UINT_OVERFLOW; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a comment
|
I discovered that calculations can break under certain conditions. I'm currently thinking of some test cases, and I'd like to be able to detect this even with 64-bit CI. (I was calculating as a uint, thinking that there was no place where it would be affected, but underflow in some cases when it becomes a negative value is causing problems.) |
|
You may try differential fuzzing to catch some mistakes and find potential bugs, with one fuzz variant passing inputs to the old code and the other fuzz variant passing inputs to the new code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am seriously struggling to review the code, all the BC_MUL_INT_DIGITS to BC_MUL_INT_DIGITS changes pollute the review, and if those are a prerequisite for this PR, then I would rather have them move to another PR, verified and validated, so that this PR can focus only on implementing the Karatsuba algorithm.
I am also struggling to see why there is so much code for what a pseudo implementation seems to do in not a lot of code, but maybe that's because the naming of the functions is not helping me and there are a bunch of other optimizations done at the same time that makes verifying the correctness of the algorithm difficult.
It might be better to go with a "dumb" implementation of karatsuba (and call the function that (e.g. bc_karatsuba rather than bc_rec_mul) and have the helper subroutines be clearly defined and named appropriately so that each part can be checked for correctness and then optimized in follow-up commits or PRs.
Aside: Can we move away from using "BCD" (BC Digit) when referring to BC numbers and just use actual words too.
|
|
||
| /* | ||
| * In divide-and-conquer calculations, additions are concentrated on array | ||
| * entries around half of the ret size length. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: not sure abbreviating here makes any sense.
| * entries around half of the ret size length. | |
| * entries around half of the return size length. |
| /* | ||
| * In divide-and-conquer calculations, additions are concentrated on array | ||
| * entries around half of the ret size length. | ||
| * e.g. ret size is 8, [7][6][5][4][3][2][1][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * e.g. ret size is 8, [7][6][5][4][3][2][1][0] | |
| * e.g. return size is 8, [7][6][5][4][3][2][1][0] |
| /* | ||
| * If n1 and n2 are both greater than this order of magnitude, use the | ||
| * divide-and-conquer method (as a result of measurement, a clear speed | ||
| * difference appears from this order of magnitude). | ||
| * If the number of digits is small, the overhead impact is large and slow. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are multiple algorithms for the multiplication of large integers, such as Toom–Cook/Toom-3 and Schönhage–Strassen algorithms. So be clear about what we are using from the get-go.
Also, some wording improvements, at least to me.
| /* | |
| * If n1 and n2 are both greater than this order of magnitude, use the | |
| * divide-and-conquer method (as a result of measurement, a clear speed | |
| * difference appears from this order of magnitude). | |
| * If the number of digits is small, the overhead impact is large and slow. | |
| */ | |
| /* | |
| * When n1 and n2 are both greater than this order of magnitude, | |
| * use the Karatsuba divide-and-conquer algorithm. | |
| * For smaller magnitudes the overhead of the algorithm makes it worse than the | |
| * naive long-multiplication algorithms. | |
| */ |
| * In divide-and-conquer calculations, additions are concentrated on array | ||
| * entries around half of the ret size length. | ||
| * e.g. ret size is 8, [7][6][5][4][3][2][1][0] | ||
| * In this case, addition is most concentrated on [3]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * In this case, addition is most concentrated on [3]. | |
| * In this case, addition is most concentrated on digit number [3]. |
| * (If normal multiplication of N digits and N digits involves multiplying one digit N^2 | ||
| * times, the Karatsuba-algorithm requires N^log2(3) times of calculation. N^log2(3) is | ||
| * approximately N^1.585.) | ||
| * there is a minimum unit calculation set of 9, so add (high + low - mid) 9 times. | ||
| * | ||
| * At this time, if consider one portion (high + low - mid) using ab*cd as an example, | ||
| * this becomes: ac + bd - (a-b)(c-d) = ac + bd - ac + ad + bc - bd = ad + bc | ||
| * Can see that the maximum value of this is obtained by 99*99. | ||
| * | ||
| * This law holds regardless of the calculation length, so when considering the | ||
| * maximum value, all mids are canceled out and can be ignored. Therefore, mid and all | ||
| * calculations that further divide mid can be ignored from the calculation results that | ||
| * are being accumulated. | ||
| * In other words, if the calculation length is N and the minimum calculation unit length | ||
| * is 2, there are N/2 high and low pairs. Therefore, the number of times the value is | ||
| * added is N times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it maybe just better to reference Wikipedia or some paper for an explanation of the algorithm?
One can use a permalink such that new revision don't mess up what we are linking.
e.g. https://en.wikipedia.org/w/index.php?title=Karatsuba_algorithm&oldid=1190009898
| # define BC_UINT_T uint64_t | ||
| # define BC_INT_T int64_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just use typedefs for these actually rather than a macro?
| memcpy(str, &digits, sizeof(digits)); | ||
| } | ||
|
|
||
| static inline void bc_mul_convert_int_to_bcd(BC_INT_T *prod_int, size_t prodlen, size_t prod_arr_size, bc_num *prod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| static inline void bc_mul_convert_int_to_bcd(BC_INT_T *prod_int, size_t prodlen, size_t prod_arr_size, bc_num *prod) | |
| static inline void bc_mul_convert_int_to_bcd(BC_INT_T *prod_int, size_t prodlen, size_t prod_arr_size, bc_num *prod) |
| /* | ||
| * In divide-and-conquer calculations, determine whether the calculation length is | ||
| * such that digits should be adjusted to prevent overflow during calculation. | ||
| * Digit adjustment is performed when the calculation length is a power of | ||
| * BC_REC_MUL_DO_ADJUST_EXPO. | ||
| */ | ||
| static inline bool bc_rec_mul_near_overflow(size_t calc_arr_size) | ||
| { | ||
| if (EXPECTED(calc_arr_size < BC_REC_MUL_DO_ADJUST_EXPO)) { | ||
| return false; | ||
| } | ||
|
|
||
| while (calc_arr_size > 0) { | ||
| calc_arr_size /= BC_REC_MUL_DO_ADJUST_EXPO; | ||
| if (UNEXPECTED(calc_arr_size == 1)) { | ||
| return true; | ||
| } | ||
| } | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why calculations can overflow, according to Wikipedia there is a way to always do the required steps without multiplication overflow.
See the last paragraph of https://en.wikipedia.org/w/index.php?title=Karatsuba_algorithm&oldid=1190009898#Implementation
|
The brilliant.org wiki might also be good to link for an explanation of the algorithm: https://brilliant.org/wiki/karatsuba-algorithm/ |
|
@Girgias |
|
It's too complicated, so I'll try to make the commit easier to understand. Thanks both of you for checking it out. |
|
I've reworked the PR so I'm closing this. |
It does not significantly affect the speed of computations that do not use divide and conquer, making computations with large numbers of digits faster.
I was wondering if I could make the commits smaller somehow, but since all changes depend on each other, I couldn't separate the commits...
There are some places where SIMD could be used, but I won't include it in this PR because it would make the PR too complicated.
Benchmarks
1:
2:
3:
4:
before
after