Skip to content

Conversation

kkm000
Copy link
Contributor

@kkm000 kkm000 commented Jul 20, 2020

Closes #3192

@kkm000
Copy link
Contributor Author

kkm000 commented Jul 21, 2020

@jtrmal, if you have a few minutes, could you PTAL? I checked MKL, but I do not have Atlas handy, although changes there are minimal.

@kkm000 kkm000 requested a review from jtrmal July 21, 2020 08:33
@jtrmal
Copy link
Contributor

jtrmal commented Jul 21, 2020 via email

@kkm000
Copy link
Contributor Author

kkm000 commented Jul 21, 2020

@jtrmal Thanks!

@kkm000 kkm000 merged commit bff2d7e into kaldi-asr:master Jul 21, 2020
@kkm000 kkm000 deleted the 2007-configure-kill-threadedmath branch July 21, 2020 09:08
@psmit
Copy link
Contributor

psmit commented Aug 12, 2020

Unfortunately, this PR broke static MKL compilation. Originally link lines for static MKL contained -l/opt/intel/mkl/lib/intel64/libmkl_core.a and this has been changed to -lmkl. For static compilation, the search path is not set, so this fails. (Message: ***configure failed: Cannot validate the MKL switches ***)

Possible fixes are:

  1. Also set the search path (-L) for static compilation (like for non-static https://github.com/kaldi-asr/kaldi/blob/master/src/configure#L283). Not sure if that would solve the problem.
  2. Restore the old link line by changing https://github.com/kaldi-asr/kaldi/blob/master/src/configure#L295 from:
linkline+=" -l$file"

to

if ! $static; then
 linkline+=" -l$file"
else
  linkline+="-l$libfile"
fi

I'm not sure what is the best way to do static linking, but I'm happy to provide a PR if needed.

@danpovey
Copy link
Contributor

danpovey commented Aug 13, 2020 via email

@kkm000
Copy link
Contributor Author

kkm000 commented Aug 21, 2020

@psmit, I've seen your PR, thanks. I am wondering what benefits you are getting form linking to MKL statically, and what are the binary sizes? If you have an approximate idea by how much this increases the binary sizes? They are about 20GB on Linux x64 with gcc and dynamic MKL.

I think—better said, thought—that we should to switch to their "single runtime library" linking, or otherwise there are conflicts when calling Kaldi libraries from Python. But that is not available in static. Understanding your use case would be very helpful for us.

@daanzu
Copy link
Contributor

daanzu commented Aug 21, 2020

@kkm000 I haven't tested this PR or @psmit's yet, but I statically link MKL to simplify distribution. My project only needs to distribute (from Kaldi) one shared library (which is used by Python via cffi) and a few CLI programs, and doing so is far simpler when linking MKL & OpenFST statically. I would hate to lose that ability.

@psmit
Copy link
Contributor

psmit commented Aug 22, 2020

@kkm000 For me, the benefit is to be able to distribute very simple/lean/small docker images. During decoding, we only have a single binary application, and static compiling with MKL gives smaller binaries than binary + mkl shared libraries. Also, if we would install the shared MKL libraries through a package manager, it would include a lot of shared libs that aren't even needed.

That said, static MKL compilation is sometimes a pain. We link with a rust binary, and using link groups isn't really supported there (unless you dive deep into the cargo config, which is what we did).

@psmit
Copy link
Contributor

psmit commented Aug 22, 2020

One thing I now realize is that in my case I do the linking of the final binary outside the kaldi tree. Out of "habit" I specify static mkl with the configure command... I guess things should also work if I specify "static kaldi" + "shared mkl" during configure, and once I link my binary with the kaldi static libs, I specify static mkl myself...

(Only downside, I have to make sure I have a working shared MKL on my machine to pass the ./configure step).

I'll try to find some time next week to experiment with this. I agree that for compiling make all it rarely makes sense to use static mkl.

@kkm000
Copy link
Contributor Author

kkm000 commented Aug 22, 2020

For me, the benefit is to be able to distribute very simple/lean/small docker images. During decoding, we only have a single binary application, and static compiling with MKL gives smaller binaries than binary + mkl shared libraries.

Thanks, interesting, I did not expect that, a good observation. Are you sure that all possible kernels are linked in? I.e., AVX, AVX2 and AVX512 at the minimum? There is a chapter in the 1300-page MKL manual dedicated to this, I read it once around 2014, probably the time to get back to it... I should ask Mr. Ok Google to read it to me as I'm drifting off to sleep!

Also set the search path (-L) for static compilation (like for non-static https://github.com/kaldi-asr/kaldi/blob/master/src/configure#L283). Not sure if that would solve the problem.

As is, no. But there is a little known syntax -l: (the same lowercase el as -l) that you can use.

  • gcc          -Lmy/libdir -lone -ltwo is the same as gcc -Lmy/libdir -l:libone.so -l:libtwo.so
  • gcc --static -Lmy/libdir -lone -ltwo is the same as gcc -Lmy/libdir -l:libone.a  -l:libtwo.a

-l: is authoritative:

  • -lX means -l:libX.so if --static is absent, or -l:libX.a if --static is present, but
  • each of -l:libX.a and -l:libX.so means what it means.

Do not miss that -l adds the lib prefix in addition to the suffix, -l: wants a full file name part. If no directories are given with the -l:, all the rules are applied to the path given by -L (and implicit library paths) as usual.

I find it much more potable than

-Wl,-Bstatic -lone -Wl,-Bdynamic -ltwo

if we would install the shared MKL libraries through a package manager

Easy, but you're still looking at ~700MB. Ah, wait. Totally forgot that you can delete the *_mc{,2,3}.so and _mic.so kernels, unneeded threading layers (*_thread.so if you are building with libmkl_sequential.so as we do (or .a)) and the _ilp64.so matrix index interface layer. I don't do that, but should. That will get you down to about 250-270MB, much better!

Here's the complete recipe. The idea is to apt-get download only specific packages, and then force them into the image despite dpkg crying rivers of tears about broken dependencies. A complication is that names of 4 packages are known in advance, but the names of their 3 dependencies are patterned differently, so that some grepping from the control files of the first 4 deb packages is needed.

The Dockerfile below does these tricks in a debian image, and copies only the required part of MKL into an empty, non-runnable image with MKL alone:

https://github.com/burrmill/burrmill/blob/a42a135bade4ec36f7657976435ad7babfc38fb8/lib/build/mkl/Dockerfile

The script run in the Dockerfile may not be the easiest to follow, but I'm sure you'll figure it out:

https://github.com/burrmill/burrmill/blob/a42a135bade4ec36f7657976435ad7babfc38fb8/lib/build/mkl/build_mkl.sh

Basically, what it does is:

  • Download the 4 required files with the known names (lines 27-47). I figured out ones which we need. NB these are .so only.
  • Extract all dependencies, grep out the 3 named intel-comp-l-all-vars-..., intel-comp-nomcu-vars-... and intel-openmp-... (folded into one e-grep expression), sort them with -u and make sure there are exactly 3 (lines 49-59).
  • Download them too.
  • Install all (now 7) debs (line 73).

If you go back to the Dockerfile, you'll see that /opt from that image is copied in the next stage into an empty image FROM:scratch. Push and keep this image, it depends only on MKL version. You can't run a container from it, but that's ok. You can reuse it for 6 months to a year, until you get a CPU with a new architecture, basically; they tweak the kernels but skipping a release or two won't put you behind, performance wise.

Then I do the same thing with CUDA (look in lib/cuda if you want). Again, you can keep it unchanged until you get hold of a new GPUs that needs a later CUDA, or Kaldi is updated with a new arch. I call these images drones: they are unrunnable, but can be added to other images, like COPY --from mkl /opt /opt.

When it comes to making Kaldi, I augment the prepared toolchain container, going by name cxx below, with the MKL and CUDA. cxx is just a debian with apt-get installed toolchain and all lib*-dev package dependencies, a normal boring image.

The syntax here is likely unfamiliar, but you'll guess what's going on: the first 3 steps pull the 3 images in parallel, and, when all 3 are available, the 4th layers MKL and CUDA into cxx using this trivial Dockerfile.cudamkl.

One thing I now realize is that in my case I do the linking of the final binary outside the kaldi tree. Out of "habit" I specify static mkl with the configure command... I guess things should also work if I specify "static kaldi" + "shared mkl" during configure, and once I link my binary with the kaldi static libs, I specify static mkl myself...

I am not sure I grok the idea. Of course, unlike the .so case. you do not link a static Kaldi library, you only build an $AR archive of .o files. These contain references to whatever they call, essentially the MKL libs. The final magic only happens when you build the final binary. But I am not sure if the linker does not care if the reference libraries were pure object archives (.a) or DLLs (.so). You can easily see with readelf how do these unresolved symbols are referenced in the .o files. I do not remember if .a v. .so is baked into the .o file by the compiler. I guess it is.

(Only downside, I have to make sure I have a working shared MKL on my machine to pass the ./configure step).

Patch it out. A sleight of sed will take care of it as part of build. (see, e.g., this. The "sorry" comment applies to the debugging phase on local machine, when it mangles files in your Git workdir). The test is useless for you, since you are building for a target different than your machine.

Also, keep in mind that Kaldi's build is designed to allow a build by a non-privileged user. We "hardcode" paths into all binaries, but it does not mean they cannot be shuffled around: the rpath in the binary is only the default. If ld.so cannot find the .so file there, it searches system library paths. These are found in files /etc/ld.so.conf.d/*.conf. After all of them are in place, a single call to ldconfig reads them all and caches somewhere in /var/lib. ldd shows you the paths after this resolution; readelf shows the raw rpath directive in the .so. The file is dropped into the MKL image but is irrelevant at build time (Kaldi looks for MKL in /opt/intel/mkl, so this is a rare place where I did not have to pull any tricks.


This is the point where our paths diverge, but not entirely. After the build, I simply tarball binaries and kaldi-*.so libs (you can do that too, and COPY --from=previous_stage ... the same tarball, or only the libraries you need, into your target image). But when I build the VM image, I do essentially the same thing: pull MKL and CUDA docker images, create a container w/o running it, extract it into /opt (docker extract simply prints a tar file on its stdout) and kill the container. The crucially important thing to do is invoke ldconfig after all .so and ld.so.conf.d/*.conf files are extracted from tarballs and images. Then you should not worry about shuffled .so any more.

I do not do that in this script, because of a technicality: this is a separate disk, and ldconfig is called at boot time after the disk is mounted by the target VM. Since you want a working container, do that near end of build of the container instead.


I'll merge your change as is. I'm planning to work on configure anyway, and it's a perfectly valid emergency fix.

@psmit
Copy link
Contributor

psmit commented Aug 24, 2020

Thanks for merging and the extensive comment.

Regarding MKL, I'm pretty sure all the architectures are in the binary, but that is more a guess than that I checked (or read that manual).

Many of the techniques mentioned, like multi-stage docker files, are indeed what we are using. It took us a lot of time to get our image sizes down, and for us, this was back then with the static MKL.

In the near future, I'm looking to bench things more on our side, if this happens I'll share the results and observations also here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove all threaded BLAS options from configure
5 participants