Skip to content

Conversation

@SakiTakamachi
Copy link
Member

@SakiTakamachi SakiTakamachi commented May 30, 2024

It does not significantly affect the speed of computations that do not use divide and conquer, making computations with large numbers of digits faster.

I was wondering if I could make the commits smaller somehow, but since all changes depend on each other, I couldn't separate the commits...

There are some places where SIMD could be used, but I won't include it in this PR because it would make the PR too complicated.

Benchmarks

1:

$num1 = '1.2345678';
$num2 = '2.1234567';

for ($i = 0; $i < 5000000; $i++) {
    bcmul($num1, $num2, 7);
}

2:

$num1 = '1.2345678901234567890';
$num2 = '2.12345678901234567890';

for ($i = 0; $i < 3000000; $i++) {
    bcmul($num1, $num2, 20);
}

3:

$num1 = str_repeat('1234567890', 300);
$num2 = str_repeat('9876543210', 300);

for ($i = 0; $i < 6000; $i++) {
    bcmul($num1, $num2, 0);
}

4:

$num1 = str_repeat('1234567890', 1024);
$num2 = str_repeat('9876543210', 1024);

for ($i = 0; $i < 500; $i++) {
    bcmul($num1, $num2, 0);
}

before

# hyperfine "php /bc/mul/1.php" --warmup 10
Benchmark 1: php /bc/mul/1.php
  Time (mean ± σ):     611.1 ms ±   3.3 ms    [User: 605.5 ms, System: 4.8 ms]
  Range (min … max):   607.7 ms … 619.1 ms    10 runs
 
# hyperfine "php /bc/mul/2.php" --warmup 10
Benchmark 1: php /bc/mul/2.php
  Time (mean ± σ):     507.5 ms ±   4.4 ms    [User: 503.1 ms, System: 3.5 ms]
  Range (min … max):   503.3 ms … 518.3 ms    10 runs
 
# hyperfine "php /bc/mul/3.php" --warmup 10
Benchmark 1: php /bc/mul/3.php
  Time (mean ± σ):     537.0 ms ±   7.3 ms    [User: 533.4 ms, System: 2.8 ms]
  Range (min … max):   528.3 ms … 549.0 ms    10 runs
 
# hyperfine "php /bc/mul/4.php" --warmup 10
Benchmark 1: php /bc/mul/4.php
  Time (mean ± σ):     476.2 ms ±   5.0 ms    [User: 471.6 ms, System: 3.8 ms]
  Range (min … max):   470.7 ms … 487.4 ms    10 runs

after

# hyperfine "php /bc/mul/1.php" --warmup 10
Benchmark 1: php /bc/mul/1.php
  Time (mean ± σ):     610.4 ms ±   8.1 ms    [User: 606.1 ms, System: 2.5 ms]
  Range (min … max):   601.6 ms … 628.6 ms    10 runs
 
# hyperfine "php /bc/mul/2.php" --warmup 10
Benchmark 1: php /bc/mul/2.php
  Time (mean ± σ):     502.7 ms ±   4.4 ms    [User: 498.6 ms, System: 3.2 ms]
  Range (min … max):   496.4 ms … 509.2 ms    10 runs
 
# hyperfine "php /bc/mul/3.php" --warmup 10
Benchmark 1: php /bc/mul/3.php
  Time (mean ± σ):     450.4 ms ±  14.3 ms    [User: 446.5 ms, System: 3.1 ms]
  Range (min … max):   433.8 ms … 474.1 ms    10 runs
 
# hyperfine "php /bc/mul/4.php" --warmup 10
Benchmark 1: php /bc/mul/4.php
  Time (mean ± σ):     292.2 ms ±   6.9 ms    [User: 287.3 ms, System: 4.1 ms]
  Range (min … max):   285.7 ms … 307.3 ms    10 runs

@SakiTakamachi
Copy link
Member Author

There was a calculation error in memory management, so will fix it.

@SakiTakamachi SakiTakamachi force-pushed the refactor_bcmath_mul_rec branch from c78e1b3 to 600590c Compare May 31, 2024 14:37
@SakiTakamachi
Copy link
Member Author

done

Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting but also very complex.
I also think we need good dedicated test cases for this.

*
* As can see by actually doing it, the value added to [3] in this example is the
* accumulation of all (high + low - mid) when calculating 2 digits.
* In this example, the ret size is 8, so the calculation length is 4, and from 4^1.585,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you obtain the number 1.585?
I see, this is the time complexity O(N^log2(3))... That wasn't clear to me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll elaborate a bit more in my comments.

*
* At this time, if consider one portion (high + low - mid) using ab*cd as an example,
* this becomes: ac + bd - (a-b)(c-d) = ac + bd - ac + ad + bc - bd = ad + bc
* Can see that the maximum value of this is obtained when 99*99.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Can see that the maximum value of this is obtained when 99*99.
* Can see that the maximum value of this is obtained by 99*99.

* this becomes: ac + bd - (a-b)(c-d) = ac + bd - ac + ad + bc - bd = ad + bc
* Can see that the maximum value of this is obtained when 99*99.
*
* This law holds true regardless of the calculation length, so when considering the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik "holds" already means it's true

Suggested change
* This law holds true regardless of the calculation length, so when considering the
* This law holds regardless of the calculation length, so when considering the

* BC_MUL_MAX_ADD_COUNT (because the calculation length is always adjusted to the power of 2).
*/
#if SIZEOF_SIZE_T >= 8
# define BC_REC_MUL_DO_ADJUST_EXPO 1024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so because BC_MUL_MAX_ADD_COUNT is around 1844 we go to the power of 2 below it, i.e. 1024. I think you should add somewhere that BC_MUL_MAX_ADD_COUNT is around 1844.

*/
#if SIZEOF_SIZE_T >= 8
# define BC_REC_MUL_DO_ADJUST_EXPO 1024
# define BC_USE_REC_MUL_DIGITS 160 * 8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you arrive at 160?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simply compared the measurements with the standard version and specified the number at which rec was faster. But I forgot to include the results of that comparison...

* reaches 2. In other words, it is the sum of a geometric progression of 2 with a geometric
* ratio of 2 from 2 to N-1, where the calculation length is N.
* This sum can be calculated using the following formula, where the first term is a and the
* geometric ratio is r: a(r^(N-1) - 1)(r - 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* geometric ratio is r: a(r^(N-1) - 1)(r - 1)
* geometric ratio is r: a(r^(N-1) - 1)/(r - 1)

* ratio of 2 from 2 to N-1, where the calculation length is N.
* This sum can be calculated using the following formula, where the first term is a and the
* geometric ratio is r: a(r^(N-1) - 1)(r - 1)
* Here, a and r are both 2, so this formula becomes: 2(2^(N-1) - 1)(2 - 1) = 2^N - 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Here, a and r are both 2, so this formula becomes: 2(2^(N-1) - 1)(2 - 1) = 2^N - 2
* Here, a and r are both 2, so this formula becomes: 2(2^(N-1) - 1)/(2 - 1) = 2^N - 2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are a and r both 2? I would've expected a = M and r = 1/2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was intended that rearranging the numbers in reverse would result in a sequence like the comments, but the format you've written may be easier to understand.

* Share the results and the buffers used in intermediate calculations. The result is prod_arr_size.
* The buffer increases by half like n_buf, but when calc_size is 2, the required buffer size is 4.
* In other words, they are almost the same geometric progression, but the first term is 4.
* a(r^(N-1) - 1)(r - 1): 4(2^(N-1) - 1)(2 - 1) = 2^N - 2 = 2(2^N - 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* a(r^(N-1) - 1)(r - 1): 4(2^(N-1) - 1)(2 - 1) = 2^N - 2 = 2(2^N - 2)
* a(r^(N-1) - 1)/(r - 1): 4(2^(N-1) - 1)/(2 - 1) = 2^N - 2 = 2(2^N - 2)

n1_buf_size -= 2;
n2_buf_size -= 2;

BC_UINT_T *buf = safe_emalloc(prod_arr_size * 2 + calc_size * 2 - 4 + n1_buf_size + n2_buf_size, sizeof(BC_UINT_T), 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this sum overflow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. For very large numbers it will overflow...

* adjust the topmost entry
*/
if (UNEXPECTED(calc_size >= BC_REC_MUL_DO_ADJUST_EXPO)) {
prod_uint[prod_arr_real_size - 1] += prod_uint[prod_arr_real_size] * BC_MUL_UINT_OVERFLOW;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment

@SakiTakamachi
Copy link
Member Author

I discovered that calculations can break under certain conditions.
This is currently only detectable in CI when testing 32bit builds.

I'm currently thinking of some test cases, and I'd like to be able to detect this even with 64-bit CI.

(I was calculating as a uint, thinking that there was no place where it would be affected, but underflow in some cases when it becomes a negative value is causing problems.)

@nielsdos
Copy link
Member

nielsdos commented Jun 2, 2024

You may try differential fuzzing to catch some mistakes and find potential bugs, with one fuzz variant passing inputs to the old code and the other fuzz variant passing inputs to the new code.

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am seriously struggling to review the code, all the BC_MUL_INT_DIGITS to BC_MUL_INT_DIGITS changes pollute the review, and if those are a prerequisite for this PR, then I would rather have them move to another PR, verified and validated, so that this PR can focus only on implementing the Karatsuba algorithm.

I am also struggling to see why there is so much code for what a pseudo implementation seems to do in not a lot of code, but maybe that's because the naming of the functions is not helping me and there are a bunch of other optimizations done at the same time that makes verifying the correctness of the algorithm difficult.

It might be better to go with a "dumb" implementation of karatsuba (and call the function that (e.g. bc_karatsuba rather than bc_rec_mul) and have the helper subroutines be clearly defined and named appropriately so that each part can be checked for correctness and then optimized in follow-up commits or PRs.

Aside: Can we move away from using "BCD" (BC Digit) when referring to BC numbers and just use actual words too.


/*
* In divide-and-conquer calculations, additions are concentrated on array
* entries around half of the ret size length.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not sure abbreviating here makes any sense.

Suggested change
* entries around half of the ret size length.
* entries around half of the return size length.

/*
* In divide-and-conquer calculations, additions are concentrated on array
* entries around half of the ret size length.
* e.g. ret size is 8, [7][6][5][4][3][2][1][0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* e.g. ret size is 8, [7][6][5][4][3][2][1][0]
* e.g. return size is 8, [7][6][5][4][3][2][1][0]

Comment on lines +48 to +53
/*
* If n1 and n2 are both greater than this order of magnitude, use the
* divide-and-conquer method (as a result of measurement, a clear speed
* difference appears from this order of magnitude).
* If the number of digits is small, the overhead impact is large and slow.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple algorithms for the multiplication of large integers, such as Toom–Cook/Toom-3 and Schönhage–Strassen algorithms. So be clear about what we are using from the get-go.

Also, some wording improvements, at least to me.

Suggested change
/*
* If n1 and n2 are both greater than this order of magnitude, use the
* divide-and-conquer method (as a result of measurement, a clear speed
* difference appears from this order of magnitude).
* If the number of digits is small, the overhead impact is large and slow.
*/
/*
* When n1 and n2 are both greater than this order of magnitude,
* use the Karatsuba divide-and-conquer algorithm.
* For smaller magnitudes the overhead of the algorithm makes it worse than the
* naive long-multiplication algorithms.
*/

* In divide-and-conquer calculations, additions are concentrated on array
* entries around half of the ret size length.
* e.g. ret size is 8, [7][6][5][4][3][2][1][0]
* In this case, addition is most concentrated on [3].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* In this case, addition is most concentrated on [3].
* In this case, addition is most concentrated on digit number [3].

Comment on lines +72 to +87
* (If normal multiplication of N digits and N digits involves multiplying one digit N^2
* times, the Karatsuba-algorithm requires N^log2(3) times of calculation. N^log2(3) is
* approximately N^1.585.)
* there is a minimum unit calculation set of 9, so add (high + low - mid) 9 times.
*
* At this time, if consider one portion (high + low - mid) using ab*cd as an example,
* this becomes: ac + bd - (a-b)(c-d) = ac + bd - ac + ad + bc - bd = ad + bc
* Can see that the maximum value of this is obtained by 99*99.
*
* This law holds regardless of the calculation length, so when considering the
* maximum value, all mids are canceled out and can be ignored. Therefore, mid and all
* calculations that further divide mid can be ignored from the calculation results that
* are being accumulated.
* In other words, if the calculation length is N and the minimum calculation unit length
* is 2, there are N/2 high and low pairs. Therefore, the number of times the value is
* added is N times.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it maybe just better to reference Wikipedia or some paper for an explanation of the algorithm?
One can use a permalink such that new revision don't mess up what we are linking.
e.g. https://en.wikipedia.org/w/index.php?title=Karatsuba_algorithm&oldid=1190009898

Comment on lines 86 to +87
# define BC_UINT_T uint64_t
# define BC_INT_T int64_t
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just use typedefs for these actually rather than a macro?

memcpy(str, &digits, sizeof(digits));
}

static inline void bc_mul_convert_int_to_bcd(BC_INT_T *prod_int, size_t prodlen, size_t prod_arr_size, bc_num *prod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
static inline void bc_mul_convert_int_to_bcd(BC_INT_T *prod_int, size_t prodlen, size_t prod_arr_size, bc_num *prod)
static inline void bc_mul_convert_int_to_bcd(BC_INT_T *prod_int, size_t prodlen, size_t prod_arr_size, bc_num *prod)

Comment on lines +371 to +390
/*
* In divide-and-conquer calculations, determine whether the calculation length is
* such that digits should be adjusted to prevent overflow during calculation.
* Digit adjustment is performed when the calculation length is a power of
* BC_REC_MUL_DO_ADJUST_EXPO.
*/
static inline bool bc_rec_mul_near_overflow(size_t calc_arr_size)
{
if (EXPECTED(calc_arr_size < BC_REC_MUL_DO_ADJUST_EXPO)) {
return false;
}

while (calc_arr_size > 0) {
calc_arr_size /= BC_REC_MUL_DO_ADJUST_EXPO;
if (UNEXPECTED(calc_arr_size == 1)) {
return true;
}
}
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why calculations can overflow, according to Wikipedia there is a way to always do the required steps without multiplication overflow.

See the last paragraph of https://en.wikipedia.org/w/index.php?title=Karatsuba_algorithm&oldid=1190009898#Implementation

@Girgias
Copy link
Member

Girgias commented Jun 3, 2024

The brilliant.org wiki might also be good to link for an explanation of the algorithm: https://brilliant.org/wiki/karatsuba-algorithm/

@SakiTakamachi
Copy link
Member Author

@Girgias
As you said, I think splitting up the PR because there are extensive bug fixes that I noticed after opening the PR.
Leave this PR as it is for now and change it in order with the new PR.

@SakiTakamachi
Copy link
Member Author

It's too complicated, so I'll try to make the commit easier to understand. Thanks both of you for checking it out.

@SakiTakamachi
Copy link
Member Author

I've reworked the PR so I'm closing this.

#14538

@SakiTakamachi SakiTakamachi deleted the refactor_bcmath_mul_rec branch June 13, 2024 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants