Skip to content

faster Toom-Cook 3 algorithms #265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 22, 2019

Conversation

czurnieden
Copy link
Contributor

Just brought the implementation up to state-of-the-art, nothing else changed.
Halfs the cut-off point and brings it therefore into the expected region. The expected region is a bit less than twice the Karatsuba cut-off point and that is where it is now.

This code was part of #227.

@minad
Copy link
Member

minad commented May 19, 2019

If what you are saying about the cutoff points, this is good 👍

How is test coverage? Maybe add a tiny test specifically calling s_mul_karatsuba and s_mul_toom and comparing for very large numbers?

@czurnieden
Copy link
Contributor Author

czurnieden commented May 19, 2019

If what you are saying about the cutoff points, this is good

I have updated my home-machine, if you remember and since then, some things work…a bit differently, to say the least.
The multiplication algorithms work as expected but the cut-off points of the squaring algorithms went through the roof (Tom's original ones and mine).

Can you do me the favour and, if you have access to Bionic (virtual won't work here), run tune?

EDIT II:
Tried a version written by Marco Bodrato for libtommath (not usable! GPL 3!), It is highly optimized and the result is as expected: his T-C-3-sqr is nearly as fast as the Karatsuba squaring.

So it is my algorithm?
Profiling is a good idea in this case, so I run gprof. Funny result:

 branch develop
#define MP_DEFAULT_KARATSUBA_MUL_CUTOFF 96
#define MP_DEFAULT_KARATSUBA_SQR_CUTOFF 127
#define MP_DEFAULT_TOOM_MUL_CUTOFF      373
#define MP_DEFAULT_TOOM_SQR_CUTOFF      503

 branch faster_toom_cook_three, no profiling
#define MP_DEFAULT_KARATSUBA_MUL_CUTOFF 93
#define MP_DEFAULT_KARATSUBA_SQR_CUTOFF 126
#define MP_DEFAULT_TOOM_MUL_CUTOFF      182
#define MP_DEFAULT_TOOM_SQR_CUTOFF      973

 branch faster_toom_cook_three, with profiling
#define MP_DEFAULT_KARATSUBA_MUL_CUTOFF 128
#define MP_DEFAULT_KARATSUBA_SQR_CUTOFF 127
#define MP_DEFAULT_TOOM_MUL_CUTOFF      224
#define MP_DEFAULT_TOOM_SQR_CUTOFF      261

The results of the run with profiling is quite exactly what I got with my old system. That's why I proudly talked about halving the cutoffs.
WTF?
EDIT III:
After some memory optimization

#define MP_DEFAULT_KARATSUBA_MUL_CUTOFF 92
#define MP_DEFAULT_KARATSUBA_SQR_CUTOFF 128
#define MP_DEFAULT_TOOM_MUL_CUTOFF      181
#define MP_DEFAULT_TOOM_SQR_CUTOFF      504

> summary(data)
       km               ks             tc3m            tc3s      
 Min.   : 84.00   Min.   :112.0   Min.   :165.0   Min.   :255.0  
 1st Qu.: 92.75   1st Qu.:126.0   1st Qu.:176.0   1st Qu.:500.0  
 Median : 93.00   Median :127.5   Median :180.0   Median :503.0  
 Mean   : 93.40   Mean   :126.1   Mean   :179.9   Mean   :511.4  
 3rd Qu.: 95.00   3rd Qu.:128.0   3rd Qu.:183.0   3rd Qu.:505.0  
 Max.   :102.00   Max.   :128.0   Max.   :200.0   Max.   :823.0  
> apply(data, 2, sd)
       km        ks      tc3m      tc3s 
 3.117238  3.867124  5.713744 59.342739

Mmh…
With compiler optimization changed from -O3 to -O2:

#define MP_DEFAULT_KARATSUBA_MUL_CUTOFF 157
#define MP_DEFAULT_KARATSUBA_SQR_CUTOFF 128
#define MP_DEFAULT_TOOM_MUL_CUTOFF      225
#define MP_DEFAULT_TOOM_SQR_CUTOFF      255

> summary(data)
       km            ks           tc3m            tc3s      
 Min.   :150   Min.   :128   Min.   :212.0   Min.   :254.0  
 1st Qu.:156   1st Qu.:128   1st Qu.:223.0   1st Qu.:254.0  
 Median :158   Median :128   Median :224.0   Median :254.0  
 Mean   :158   Mean   :128   Mean   :224.8   Mean   :254.5  
 3rd Qu.:160   3rd Qu.:128   3rd Qu.:226.0   3rd Qu.:255.0  
 Max.   :174   Max.   :128   Max.   :246.0   Max.   :255.0  
> apply(data, 2, sd)
       km        ks      tc3m      tc3s 
2.8850013 0.0000000 4.4387982 0.5024

*scratches head*
*strokes beard*
WTF?

Maybe add a tiny test specifically calling s_mul_karatsuba and s_mul_toom and comparing for very large numbers?

tune.c has testing abilities build in (just a mp_cmp against already generated values). I would like to call it for that purpose instead of mirroring the relevant code in test.c but I need to take it apart for it.

Was already working at it ( it's all in main, it's quite some spaghetti code and in urgent need of refactoring ;-) ) when I stumbled over the problem described above.
Quite a WTF moment, I can tell you!

@czurnieden czurnieden force-pushed the faster_toom_cook_three branch 2 times, most recently from bd63311 to 787da71 Compare May 20, 2019 17:00
@czurnieden
Copy link
Contributor Author

Got to something acceptable by reducing the amount of variables to one, but it is still not known what the precise cause is.

New version:

> summary(data)
       km               ks             tc3m            tc3s      
 Min.   : 85.00   Min.   :111.0   Min.   :167.0   Min.   :162.0  
 1st Qu.: 92.00   1st Qu.:126.0   1st Qu.:175.0   1st Qu.:200.8  
 Median : 93.00   Median :128.0   Median :178.0   Median :213.0  
 Mean   : 93.83   Mean   :125.8   Mean   :178.7   Mean   :213.0  
 3rd Qu.: 97.00   3rd Qu.:128.0   3rd Qu.:182.0   3rd Qu.:220.0  
 Max.   :104.00   Max.   :128.0   Max.   :191.0   Max.   :253.0  
> apply(data, 2, sd)
       km        ks      tc3m      tc3s 
 3.544964  4.493261  5.179710 14.389895

@minad
Copy link
Member

minad commented May 21, 2019

@czurnieden This is also still work in progress? I added a few labels to the PRs. I hope you don't mind. Feel free to remove them if you think you are done!

@minad
Copy link
Member

minad commented May 21, 2019

@czurnieden I saw you mentioned gprof - I recommend perf these days, which doesn't require recompilation and has other advantages afaik.

@czurnieden
Copy link
Contributor Author

This is also still work in progress?

I'm waiting for the approval&merging of #280 to rebase on that.
What's the problem with #280?
Oh.
*blushes*

I saw you mentioned gprof - I recommend perf these days

Am just used to it. Don't need more than a short glimpse at the output to see what the problem/bottleneck is. And recompiling? Don't use it very often, once or twice a year max, so that doesn't matter.

@minad
Copy link
Member

minad commented May 21, 2019

Am just used to it. Don't need more than a short glimpse at the output to see what the problem/bottleneck is. And recompiling? Don't use it very often, once or twice a year max, so that doesn't matter.

Ah ok. I profile much more often and like it to have these cpu perf counter infos. I like to have timings per asm instruction from time to time ;)

@czurnieden czurnieden force-pushed the faster_toom_cook_three branch from 38a137f to 8b5f4a6 Compare May 21, 2019 14:21
@czurnieden
Copy link
Contributor Author

czurnieden commented May 21, 2019

With git merge it is -Xours but with git rebase it is -Xtheirs. I hope I've chosen the right one ;-)

Oh, that dreaded callgraph.txt again *grrr*

@sjaeckel sjaeckel mentioned this pull request May 21, 2019
@czurnieden czurnieden force-pushed the faster_toom_cook_three branch from 8b5f4a6 to 5303d8e Compare May 21, 2019 14:31
@minad
Copy link
Member

minad commented May 21, 2019

Oh, that dreaded callgraph.txt again grrr

I thought before about removing it. But it blows up the sizes of my commits. I already changed like 16K lines :-P

@minad minad closed this May 21, 2019
@minad minad reopened this May 21, 2019
@czurnieden czurnieden force-pushed the faster_toom_cook_three branch from 5303d8e to bfe08db Compare May 21, 2019 15:07
@czurnieden
Copy link
Contributor Author

I thought before about removing [callgraph.txt]

It is a "nice to have" but I'm pretty sure any of the better IDEs around are able to it and do it much better. Should be generated on demand only.

I'll take a look at it later.

@minad
Copy link
Member

minad commented May 21, 2019

It is a "nice to have" but I'm pretty sure any of the better IDEs around are able to it and do it much better. Should be generated on demand only.

I would keep generating it and just gitignore it and remove it from git. But we also have tommath_class.h which is also bad, but not as bad.

@czurnieden
Copy link
Contributor Author

I would keep generating it and just gitignore it and remove it from git

Was quickly done, see #283

But we also have tommath_class.h which is also bad, but not as bad.

Yeah, true.
But that needs a bit more work to get rid off.

@sjaeckel
Copy link
Member

@minad fine for you to merge even with the multi-clears instead of a single one?

@czurnieden
Copy link
Contributor Author

czurnieden commented May 21, 2019

even with the multi-clears instead of a single one?

There is no mp_init_size_multi?
Or init all and grow later?

@sjaeckel
Copy link
Member

There is no mp_init_size_multi?

probably mp_init_multi() and mp_grow()?

@czurnieden czurnieden force-pushed the faster_toom_cook_three branch from bfe08db to bacbacb Compare May 21, 2019 16:14
@sjaeckel sjaeckel force-pushed the faster_toom_cook_three branch from bacbacb to 8056a60 Compare May 21, 2019 16:20
@minad
Copy link
Member

minad commented May 21, 2019

Still change to init_mult/grow?

Just a technical question - I wonder about that. What is causing the slowdown? More allocations? If you access the malloc cache I wouldn't expect such a big difference. I had expect that all the time is eaten up by the numerics code.

@czurnieden
Copy link
Contributor Author

Still change to init_mult/grow?

IMO no

I know that realloc is a bit slower than a fresh allocation, but it is quite a lot, indeed!

@minad
Copy link
Member

minad commented May 21, 2019

Well realloc can be much faster but if you really realloc, it is as slow as free+malloc. However with modern allocators usually allocations are taken from a "hot" cache and not that slow then. In particular in your tuner, you are doing many times the same thing. Therefore there should be plenty of touched memory ready.

@minad
Copy link
Member

minad commented May 21, 2019

If allocations are really an issue and mp_init_size_multi would solve that, why not consider that?

@sjaeckel
Copy link
Member

Just a technical question - I wonder about that. What is causing the slowdown? More allocations? If you access the malloc cache I wouldn't expect such a big difference. I had expect that all the time is eaten up by the numerics code.

I know that realloc is a bit slower than a fresh allocation, but it is quite a lot, indeed!

okay, if both of you are wondering about this I'll merge later

@minad
Copy link
Member

minad commented May 21, 2019

I haven't really looked I am only speculating (I can look later). If you allocate a lot of memory before you free stuff again, it will get very slow since the OS has to supply memory. But in this case you have many alloc/free pairs, right? How much memory are we talking btw? Are your blocks larger than page size or maybe multiple times larger? Then malloc could directly do mmap/munmap. In particular munmap is unbelievably slow on linux (I had a case where I did munmapping in a background thread for that reason). Did you strace?

Edit: So if we do mp_init+mp_grow it means a very small allocation plus maybe a very large one. The result should be fully dominated by mp_grow, since mp_init will just give and take from the cache.

@sjaeckel
Copy link
Member

this change brought the tune build-job seriously down from >7min to <2min?

that's pretty impressive!

@czurnieden
Copy link
Contributor Author

Did you strace?

I had no time yet to do anything to look into it, most of it is based on educated guesses.
But it's interesting, so I will dig deeper, no worry.

I also need to read through the documentations because I just updated from the old LTS to the new one, that is I skipped one LTS release. New kernel, new LibC, new all, I'll need a couple of days to get sufficiently up-to-date with all the gory details and there are a lot of entrails to dig through ;-)

this change brought the tune build-job seriously down from >7min to <2min?

You complained, I delivered ;-)

@minad
Copy link
Member

minad commented May 21, 2019

this change brought the tune build-job seriously down from >7min to <2min?

You complained, I delivered ;-)

Oh no. Now we have to reorder the travis yamml again

@sjaeckel
Copy link
Member

should I merge or leave this open until the investigation by @czurnieden is finished?

@minad
Copy link
Member

minad commented May 21, 2019

My only complaint are those many labels and mp_clear. But I don't want to stand in the way here. The investigation and maybe other improvements could still happen after merging.

@sjaeckel
Copy link
Member

@czurnieden you can decide whether you want to put in more effort or if it's fine like this for you, please rebase and remove the WIP label if you think it's ready :)

maybe other improvements could still happen after merging

and that's also true

@czurnieden czurnieden force-pushed the faster_toom_cook_three branch from 8056a60 to 228e487 Compare May 21, 2019 21:45
@czurnieden
Copy link
Contributor Author

czurnieden commented May 21, 2019

you can decide whether you want to put in more effort or if it's fine like this for you

It works, it is fast, it is ugly.
Two out of three is good enough for me ;-)

please rebase

But I just reba…oh.
I'm really envious of the large amount of free time y'all seem to have ;-)
Rebased, will wait until Travis shows all green to remove the label.

Re. investigation: it seems as if the new GLibC is indeed under a new memory-management. Will compile an old one (glibc-2.23) for a short test and dig through the sources if that test supports my suspicions. That will take a couple of days; compiling the GlibC alone is not a small task.

But the result of that investigation won't be of any large relevance for my code (would make me wonder if any at all) and if it does: "maybe other improvements could still happen after merging". The main branch here has not been named "develop" without a reason ;-)

@sjaeckel sjaeckel merged commit e8ae3e8 into libtom:develop May 22, 2019
@minad
Copy link
Member

minad commented May 22, 2019

@czurnieden I did an experiment concerning allocations and running your tuner. I managed to get rid of all allocations in mp_init and all allocations smaller than MP_MIN_PREC. With allocations 13.2s, without allocations 12.4s. So all these (often needless) small allocations we are doing during init cost as around 5%.

This is somehow what I tried:

typedef struct {
   mp_digit sd[MP_STATIC_PREC]; /* use this for small allocs */
   int used, alloc;
   mp_sign sign;
   mp_digit *dp;
} mp_int;

@minad
Copy link
Member

minad commented May 22, 2019

Then I did another experiment where I replaced malloc with a small shim which takes allocations smaller than MP_PREC from a simple cache (singly linked list). With that I got at one run 12.6s. Not as good but still like 5%.

I will test LDPRELOAD jemalloc next.

Until now my experiment at least shows that glibc malloc is not as good as I expected for our usage pattern. I thought the glibc malloc cache is better.

@minad
Copy link
Member

minad commented May 22, 2019

Jemalloc 11.8s
Jemalloc+MP_STATIC_PREC 11.7s

But take the numbers with a grain of salt. No statistics, just for a rough picture. Jemalloc it seems has a very good cache. Better at least than my naive one and the MP_STATIC_PREC gives nearly no advantage.

@czurnieden
Copy link
Contributor Author

The Toom-Cook algorithms do not need a lot of small allocation, the numbers are quite large, so your 5% is probably all you can get without a specialized memory management.
GlibC's one got quite good over the last 10 years, I would have guessed that you would gain much more with a speed optimized malloc implementation.

Tomfastmath could make much better use of an inbuild memory management, but there is already a discussion about that if I remember it correctly.

But I did as you told me and run some tracing programs. The most useful for C&P'ing the results here is probably ltrace, the others (strace and valgrind with the tools callgrind and cachegrind) have either too little or way too much information. There are more programs, I know, but I'm still in the "bracketing" stage, reducing the problem set, so ltrace is sufficient.

The first four lines, the ones with the #defines, are the output of make tune, ltrace gets a fixed-rounds tune to look at.

Single rounds with the version of tc3mul and tc3qsr I started this branch with.
Optimization -O3

#define MP_DEFAULT_KARATSUBA_MUL_CUTOFF 97
#define MP_DEFAULT_KARATSUBA_SQR_CUTOFF 126
#define MP_DEFAULT_TOOM_MUL_CUTOFF      189
#define MP_DEFAULT_TOOM_SQR_CUTOFF      892 <

$ ltrace -S -c ./tune -L 3 -r 10 -M 100 -m 11
% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
 26.92    0.266543          63      4228 memset
 24.36    0.241186          63      3792 free
 24.04    0.238011          62      3792 calloc
 16.93    0.167600          63      2620 realloc
  6.66    0.065979          62      1050 memcpy
  0.92    0.009141          63       144 clock_gettime
  0.05    0.000482         120         4 __printf_chk
  0.03    0.000262          65         4 __errno_location
  0.03    0.000262          65         4 strtol
  0.01    0.000112          18         6 SYS_brk
  0.01    0.000103          20         5 SYS_mmap
  0.01    0.000089          22         4 SYS_write
  0.01    0.000081          20         4 SYS_mprotect
  0.01    0.000061          20         3 SYS_access
  0.00    0.000048          16         3 SYS_fstat
  0.00    0.000047          23         2 SYS_openat
  0.00    0.000029          14         2 SYS_close
  0.00    0.000026          26         1 SYS_munmap
  0.00    0.000019          19         1 SYS_read
  0.00    0.000014          14         1 SYS_arch_prctl
------ ----------- ----------- --------- --------------------
100.00    0.990095                 15670 total

As you can see: TC3-sqr is quite an outlier

Single rounds with the version of tc3mul and tc3qsr I started this branch with
Optimization -O2
(The timings are not really relevant here, the number and kind of calls is)

#define MP_DEFAULT_KARATSUBA_MUL_CUTOFF 160 <
#define MP_DEFAULT_KARATSUBA_SQR_CUTOFF 128
#define MP_DEFAULT_TOOM_MUL_CUTOFF      228
#define MP_DEFAULT_TOOM_SQR_CUTOFF      255

$ ltrace -S -c ./tune -L 3 -r 10 -M 100 -m 11
% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
 36.76    0.232124          61      3792 calloc
 36.46    0.230235          60      3792 free
 25.12    0.158633          60      2620 realloc
  1.41    0.008935          62       144 clock_gettime
  0.07    0.000463         115         4 __printf_chk
  0.04    0.000271          67         4 __errno_location
  0.04    0.000265          66         4 strtol
  0.02    0.000116          19         6 SYS_brk
  0.01    0.000092          18         5 SYS_mmap
  0.01    0.000086          21         4 SYS_write
  0.01    0.000074          18         4 SYS_mprotect
  0.01    0.000063          21         3 SYS_access
  0.01    0.000048          24         2 SYS_openat
  0.01    0.000045          15         3 SYS_fstat
  0.00    0.000027          13         2 SYS_close
  0.00    0.000024          24         1 SYS_munmap
  0.00    0.000018          18         1 SYS_read
  0.00    0.000014          14         1 SYS_arch_prctl
------ ----------- ----------- --------- --------------------
100.00    0.631533                 10392 total

Here Kara-mul is the outlier. With the exception of Kara-mul it is also the result I got with my old system. (got me a Glibc 2.23 compiled *pooh*, no difference. Will try it with an old kernel later)

With the current versions and -O3

Single rounds mp_init_size

$ ltrace -S -c  ./tune -L 3 -r 10 -M 100 -m 11
% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
 30.62    0.313655          67      4615 memset
 21.00    0.215102          68      3145 realloc
 20.16    0.206486          67      3072 free
 20.14    0.206261          67      3072 calloc
  6.93    0.070965          67      1050 memcpy
  0.96    0.009820          68       144 clock_gettime
  0.05    0.000508         127         4 __printf_chk
  0.04    0.000402         100         4 __errno_location
  0.04    0.000387          96         4 strtol
  0.01    0.000121          20         6 SYS_brk
  0.01    0.000108          21         5 SYS_mmap
  0.01    0.000089          22         4 SYS_mprotect
  0.01    0.000086          21         4 SYS_write
  0.01    0.000067          22         3 SYS_access
  0.01    0.000055          18         3 SYS_fstat
  0.00    0.000046          23         2 SYS_openat
  0.00    0.000035          17         2 SYS_close
  0.00    0.000029          29         1 SYS_munmap
  0.00    0.000019          19         1 SYS_read
  0.00    0.000019          19         1 SYS_arch_prctl
------ ----------- ----------- --------- --------------------
100.00    1.024260                 15142 total

And with mp_grow


$ ltrace -S -c ./tune -L 3 -r 10 -M 100 -m 11
% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
 28.81    0.264157          66      3945 memset
 21.88    0.200644          65      3072 free
 21.85    0.200405          65      3072 calloc
 18.54    0.170003          68      2475 realloc
  7.66    0.070236          66      1050 memcpy
  1.04    0.009559          66       144 clock_gettime
  0.05    0.000501         125         4 __printf_chk
  0.05    0.000431         107         4 __errno_location
  0.04    0.000388          97         4 strtol
  0.01    0.000124          20         6 SYS_brk
  0.01    0.000120          24         5 SYS_mmap
  0.01    0.000096          24         4 SYS_mprotect
  0.01    0.000079          19         4 SYS_write
  0.01    0.000074          24         3 SYS_access
  0.01    0.000054          18         3 SYS_fstat
  0.01    0.000053          26         2 SYS_openat
  0.00    0.000036          18         2 SYS_close
  0.00    0.000032          32         1 SYS_munmap
  0.00    0.000023          23         1 SYS_read
  0.00    0.000018          18         1 SYS_arch_prctl
------ ----------- ----------- --------- --------------------
100.00    0.917033                 13802 total

Current version with LTM -O2, tune -O3

#define MP_DEFAULT_KARATSUBA_MUL_CUTOFF 162 <
#define MP_DEFAULT_KARATSUBA_SQR_CUTOFF 128
#define MP_DEFAULT_TOOM_MUL_CUTOFF      191 <
#define MP_DEFAULT_TOOM_SQR_CUTOFF      202

% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
 33.86    0.209098          66      3145 realloc
 32.20    0.198796          64      3072 calloc
 32.18    0.198707          64      3072 free
  1.49    0.009174          63       144 clock_gettime
  0.08    0.000482         120         4 __printf_chk
  0.06    0.000355          88         4 __errno_location
  0.05    0.000295          73         4 strtol
  0.02    0.000102          17         6 SYS_brk
  0.01    0.000087          21         4 SYS_write
  0.01    0.000080          16         5 SYS_mmap
  0.01    0.000067          16         4 SYS_mprotect
  0.01    0.000049          16         3 SYS_access
  0.01    0.000043          14         3 SYS_fstat
  0.01    0.000033          16         2 SYS_openat
  0.01    0.000033          16         2 SYS_close
  0.00    0.000023          23         1 SYS_munmap
  0.00    0.000014          14         1 SYS_read
  0.00    0.000013          13         1 SYS_arch_prctl
------ ----------- ----------- --------- --------------------
100.00    0.617451                  9477 total

Same with -O2 (both)

#define MP_DEFAULT_KARATSUBA_MUL_CUTOFF 156 <
#define MP_DEFAULT_KARATSUBA_SQR_CUTOFF 128
#define MP_DEFAULT_TOOM_MUL_CUTOFF      193 <
#define MP_DEFAULT_TOOM_SQR_CUTOFF      195

% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
 33.26    0.207328          65      3145 realloc
 32.57    0.203006          66      3072 calloc
 32.40    0.201967          65      3072 free
  1.48    0.009204          63       144 clock_gettime
  0.08    0.000483         120         4 __printf_chk
  0.06    0.000377          94         4 __errno_location
  0.05    0.000306          76         4 strtol
  0.02    0.000109          18         6 SYS_brk
  0.02    0.000101          20         5 SYS_mmap
  0.01    0.000083          20         4 SYS_mprotect
  0.01    0.000077          19         4 SYS_write
  0.01    0.000062          20         3 SYS_access
  0.01    0.000049          16         3 SYS_fstat
  0.01    0.000044          22         2 SYS_openat
  0.01    0.000032          16         2 SYS_close
  0.00    0.000028          28         1 SYS_munmap
  0.00    0.000017          17         1 SYS_read
  0.00    0.000017          17         1 SYS_arch_prctl
------ ----------- ----------- --------- --------------------
100.00    0.623290                  9477 total

No difference between [LTM -O2, tune -O3] and both -O2, all well inside the error-bars. Would have made me wonder if not but now I have it that tune itself does not influence the result significantly.

The most significant outlier is TC3sqr within the very first measurement, so I'll start from there.

@minad
Copy link
Member

minad commented May 23, 2019

The Toom-Cook algorithms do not need a lot of small allocation, the numbers are quite large, so your 5% is probably all you can get without a specialized memory management. GlibC's one got quite good over the last 10 years, I would have guessed that you would gain much more with a speed optimized malloc implementation.

I don't understand what you are saying here. I just tested somehow specialised memory managment and I tested jemalloc (probably the best allocator around) and got between 5% to 10% speedup. At some point it is not possible to win more since what is expensive is not the allocation but touching the memory (stalling the cpu and in the worst case mapping pages which have not been touched before).

And interestingly this is aligned with what you are measuring. There is the "memset" call which costs a lot of time and for some reason it does not appear anymore in the last version. We are zeroing the upper digits quite often and this could be a problem. At least some of the memsetting can be disabled now by disabling #255.

EDIT: I did another experiment where I disabled many of the MP_ZERO_DIGITS memsets (Not all of them such that things don't break. Many functions rely on the upper bits being zero). I kept calloc in mp_init_size however. With this I got around 4% to 6% speedup when running tune_it. This is only due to touching the memory less often.

EDIT2: But if I recall correctly, I never saw any such outliers as you do. It is always ordered like kmul<ksqr<tmul<tsqr with somehow reasonable numbers (like factor two between the different muls and the different squares respectively). What I found are relatively consistent speedups of around 5% for the various memory tunings (reduced amount of mallocs, better jemalloc, touching memory less often).

@czurnieden
Copy link
Contributor Author

I just tested somehow specialised memory managment

With "specialized" I meant an ingrained MM, carefully tailored to LTM where LTM can even give feedback. Feedback would be useful in the example of Toom-Cook where the (max) size of the variables is known in advance and they could tell the MM to keep their memory until the end, overwrite instead of doing deep copies et cetera p. p. And the other way around, too, rewrite LTM to make the MM's life easier. I would expect a gain of 30% and more in the case of Toom-Cook.
Having full control over the MM also means that you can tailor the size of the buckets, which can give the many-small-allocs use case a good kick. You can also get rid of all the code for the safety-features and edge-cases you need to implement when you do not know your input (more chances for the unsuspecting user to get holes in their feet ;-) ).

There are most likely a lot of even sillier things possible but I'm not as up-to-date with the current state re MM as you are, it is a long time ago (10, 15 years?) that I had to write my own malloc because what was available at that time was not usable for my application.

I tested jemalloc (probably the best allocator around)

Now that I had the time to look it up, there are others that claim to be better ;-)

But all of these benchmarks you can find with Google's help only answer the question "Better for what?" and that's the crux of the biscuit here.
As a general purpose allocator on the other side jemalloc is really good!

There is the "memset" call which costs a lot of time and for some reason it does not appear anymore in the last version.

It is the -O3 optimization that uses memset, -O2 does not. I should have put some lines in between the listings, sorry.

lstrace is not the most verbose but others, e.g.: valgrind's massif honour their name well and have a massive amount of output, way too much for a post here.

We are zeroing the upper digits quite often and this could be a problem.

Yeah, but LTM is used for cryptography and security does not come without some cost. We can't allow data to linger in memory for longer than absolutely necessary so that's not an option.
Or make it an option?

But if I recall correctly, I never saw any such outliers as you do.

That is a brutal one and I'm pretty sure you would recall it ;-)

The rest is all explainable, the amount of changes in speed are expected and I can even explain the difference between init_size and mp_grow with -O3 now (it is the time needed for the change from the smallest bucket to an extra large one for the latter) but that outlier? More than 300% slower?
It could even be a bug in the optimizer (Outlier with -O3, normal with -O2) but I tested with a couple GCCs and clangs.
Haven't tried the Intel compiler yet, will give it a chance tonight to check that assumption.

I could spend a whole day testing everything and the kitchen-sink but I think it's time now for the assembler-dump.
*sigh*

@minad
Copy link
Member

minad commented May 23, 2019

With "specialized" I meant an ingrained MM, carefully tailored to LTM where LTM can even give feedback. Feedback would be useful in the example of Toom-Cook where the (max) size of the variables is known in advance and they could tell the MM to keep their memory until the end, overwrite instead of doing deep copies et cetera p. p. And the other way around, too, rewrite LTM to make the MM's life easier. I would expect a gain of 30% and more in the case of Toom-Cook.

What one could try is some kind of region allocator which just reserves a region in the beginning of some bigger operation and never frees. Everything is freed in the end of the operation. This is the only optimization I can imagine which could give an advantage.

But instead of optimizing allocation behavior the lowest hanging fruit I am seeing is reducing the amount of MP_ZERO_DIGITSs.

Having full control over the MM also means that you can tailor the size of the buckets, which can give the many-small-allocs use case a good kick. You can also get rid of all the code for the safety-features and edge-cases you need to implement when you do not know your input (more chances for the unsuspecting user to get holes in their feet ;-) ).

I am pretty sure, this won't help us much. The mallocs usually already use fine grained buckets and you might even be able to tune them via some settings. The only thing you could win is inlining some kind of fast path code, accessing the bucket if there is something in it and in the slow path calling into the allocator. I think the linux kernel slab allocator does something like this.

Now that I had the time to look it up, there are others that claim to be better ;-) But all of these benchmarks you can find with Google's help only answer the question "Better for what?" and that's the crux of the biscuit here. As a general purpose allocator on the other side jemalloc is really good!

Sure it always depends on the use case. But according to my recent experience these general purpose allocators are already so good in selecting and returing a chunk of memory that accessing the memory is the far more expensive operation. But good allocators ensure that you allocate recently used chunks first since they are already hot in the cache etc.

Where these allocators usually differ is in their fragmentation, long runtime behavior and if there are many threads etc. Not relevant to our microbenchmarks ;)

Yeah, but LTM is used for cryptography and security does not come without some cost. We can't allow data to linger in memory for longer than absolutely necessary so that's not an option.
Or make it an option?

We could at least make it an option. It is like a 5% cost which is basically added to every fundamental operation. From the crypto/mitigation point of view, I think it would also be ok to just let the upper digits uninitialized or let them contain garbage, since the mp_ints probably live only for a short period of time and the lower digits still contain the probably much more valuable data. What is important is zeroing the memory before free (and this is what we are doing).

But if we would want to make it configurable - it is a bit hard. I tried it but I am not familar with how most algorithms work. I disabled some ZERO_DIGITS, checked valgrind for uninitialized accesses, looked a bit around the control flow, while trying not to break the testsuite.

It is the -O3 optimization that uses memset, -O2 does not. I should have put some lines in between the listings, sorry.

Maybe enable MP_USE_MEMSET for benchmarking? This makes it clear where time is just eaten up by zeroing. For this reason I introduced this option.

I could spend a whole day testing everything and the kitchen-sink but I think it's time now for the assembler-dump.

If you are digging around further, give perf a try :)
This subsumes lstrace/strace and per machine instruction profiling. You can even look at the call stack inside the kernel...

@czurnieden
Copy link
Contributor Author

I am pretty sure, this won't help us much.

The list was just an example of what can be done if one is willing to go to the extreme, not as suggestions for LTM ;-)

I think it would also be ok to just […]

No, LTM is not only meant to, it actually gets used in cryptographic programs, so no fiddling here, at least not with the default behaviour. The cost involved for that kind of security is normally accepted as unavoidable if it is not too much and 5-10% is "not too much" for most people. (Thanks to MS? ;-) )

It is tempting, yes, but I wouldn't touch cryptographic code without a very good reason.

We can only make it optional and offer some kind of make non-crypto or something in that direction but as you rightfully noted:

it is a bit hard

Yeah, a lot of work.

Maybe enable MP_USE_MEMSET for benchmarking?

For the autotuner only:

No zeroing of buffers and digits

MP_DEFAULT_KARATSUBA_MUL_CUTOFF 92
MP_DEFAULT_KARATSUBA_SQR_CUTOFF 128
MP_DEFAULT_TOOM_MUL_CUTOFF      151 <--
MP_DEFAULT_TOOM_SQR_CUTOFF      200 <--

=================================================

No memset(1) for the zeroing of buffers and digits

MP_DEFAULT_KARATSUBA_MUL_CUTOFF 95
MP_DEFAULT_KARATSUBA_SQR_CUTOFF 127
MP_DEFAULT_TOOM_MUL_CUTOFF      167
MP_DEFAULT_TOOM_SQR_CUTOFF      215

=================================================

With memset(1)  for the zeroing of buffers and digits

MP_DEFAULT_KARATSUBA_MUL_CUTOFF 98
MP_DEFAULT_KARATSUBA_SQR_CUTOFF 128
MP_DEFAULT_TOOM_MUL_CUTOFF      165
MP_DEFAULT_TOOM_SQR_CUTOFF      247 <--

For the testsuite (run 10 times)

make test_standalone
time for i in {1..100};do ./test;done

real	33m41,921s
user	33m40,540s
sys	0m0,668s

=================================================

CFLAGS=" -DMP_USE_MEMSET " make test_standalone
time for i in {1..100};do ./test;done

real	33m48,759s
user	33m47,180s
sys	0m0,780s

A bit worse with system memset if the numbers are very large, but drops into insignificance if we assume that the testsuite comes closer to normal use than the autotune. You have the overhead of the function call and the checks&balances but memset is also highly optimized.
Did you have different results?

@minad
Copy link
Member

minad commented May 23, 2019

No, LTM is not only meant to, it actually gets used in cryptographic programs, so no fiddling here, at least not with the default behaviour. The cost involved for that kind of security is normally accepted as unavoidable if it is not too much and 5-10% is "not too much" for most people. (Thanks to MS? ;-) )
It is tempting, yes, but I wouldn't touch cryptographic code without a very good reason.
We can only make it optional and offer some kind of make non-crypto or something in that direction but as you rightfully noted:

For sure I would only make it optional. However I would actually like to know what the reason is to do this zeroing. What kind of attacks are prevented by doing that? If an attacker has the possibility to read leaked data from the upper digits, I assume the probability is high that they also have the possibility to read leaked data from the lower digits.

The usual mitigation of zeroing before free (or even that the hardened malloc zeros everything always) is in place in order to prevent leaking data from a critical cryptographical and hopefully safe part of an application to an potentially unsafe part of the application. Hardened allocators randomize the allocated addresses or segregate allocations in some kind of pools in order to prevent leaking data from one part to the other.

Zeroing the memory reduces the time in which the critical data is around. This is helpful if the data could leak in other ways (swapping or some more subtle channels?). But in the lower digits we still have the finished computation lying around, so this argument seems kind of moot. And because of swapping you should use locked memory for example. We are also not doing that by default.

For all these reasons I concluded that zeroing the upper digits is kind of useless from a security pov in contrast to zeroing before free.

A bit worse with system memset if the numbers are very large, but drops into insignificance if we assume that the testsuite comes closer to normal use than the autotune. You have the overhead of the function call and the checks&balances but memset is also highly optimized.
Did you have different results?

I don't think there will be a big difference. For small memsets we lose, for big ones we win using memset. But during profiling, using a memset call helps to see the associated costs. Otherwise the cost will just be assigned to the function doing the memset manually. This is what I meant. However as we've seen before at O3 loops are replaced by memset and vice versa (or by inline rep stosb or sth).

@czurnieden
Copy link
Contributor Author

However I would actually like to know what the reason is to do this zeroing.

Because we have absolutely no information about, safe influence on the environment, we have to assume the worst (within reasonable limits, of course)

But in the lower digits we still have the finished computation lying around, so this argument seems kind of moot

If you have an operation that tries to remove information, parts of that information can still reside in the upper digits.

Oversimplified example: 123456789 % 12345701 = 12345480 (overwriting: rem(a,b,a)) but the last 9 is still in memory and if that was a key or has information about the key you already got one ninth of it.
Yes, it is way more complicated in real life, I know.

But during profiling, using a memset call helps to see the associated costs.

Ah, now I see what you meant.

@minad
Copy link
Member

minad commented May 23, 2019

If you have an operation that tries to remove information, parts of that information can still reside in the upper digits.

Yes, probably this is a strong enough argument. But given the other facts about memory allocation, unlocked memory it might still not be worth it. But as I said, if we would remove zero_digits, we would make it optional so or so. But I won't do it any time soon since I don't think it is a pressing issue and the potential for breaking stuff is to great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants