Native code (Bindings, WebAssembly) #6
Description
An important part of the success of R and Python as data science tools is the ability to wrap native code (C/C++) in order to write memory/cpu critical algorithms in C so you can avoid overhead.
For a great introduction to the Python data science use cases, check out this talk by Rob Story. It's a whirlwind tour of different tools designed to address different problems. Most of them rely on a C/C++ component under the hood, but all of them expose an easy to use Python API.
To use native code from a dynamic language you have to write bindings, which usually means writing code in C/C++ that interfaces the native interface of your language with the code you want to use. These bindings have to be compiled in order to be used by users. Some people build the binaries when they publish new releases, that way users don't have to compile anything to use them. Others rely on their users to compile the bindings before they can use them. I'm going to talk about prebuilt binary use cases here, as it is the most user friendly option.
In Python there is conda which is designed specifically for distributing and installing prebuilt Python native bindings.
In Node there are node-pre-gyp and prebuild, both of which hook into npm install
and try to download prebuilt binaries from some server the maintainer specifies before falling back to compiling them if the prebuilts aren't available.
To actually write bindings in Node there are a couple third party modules you can use to make the process easier: bindings and nan. One big advantage of using nan is that it gives you a compatibility layer in C++ that lives between node and your code. nan focuses on backwards compatibility as much as possible, so when the Node.js C++ API makes breaking changes, nan will hopefully be able to avoid have to make any breaking changes. This means when new versions of node get released, and you used nan, you hopefully shouldn't have to rewrite any C++ code.
In practice writing native bindings for node is still quite low level, even with these helper utilities. For example the module mknod, which exists to wrap the mknod
syscall, is ~8 files and a couple hundred lines just to make this one line possible to call into from JS:
https://github.com/mafintosh/mknod/blob/master/mknod.cc#L16. Maintaining this module means making sure to compile new versions of it when new versions of nan are released, making any necessary code changes, and making sure to upload the prebuilt binaries.
WebAssembly
For browsers there is a proposal called WebAssembly (.wasm) that is trying to standardize a way to run compiled native bytecode in a browser, safely. This will hopefully bring similar advantages to browser JS apps that it brings to Node and Python -- the ability to drop down to a lower level language for performance critical use cases.
For an introduction to WebAssembly read the following:
- http://www.2ality.com/2015/06/web-assembly.html
- https://brendaneich.com/2015/06/from-asm-js-to-webassembly/
For data scientists I think .wasm will be:
- Ability to cross-compile native code (legacy libraries as well as new code) to work in the browser (e.g. how emscripten works today, but will be faster and more standardized).
- Ability to do things in JavaScript that were previously not possible, such as building compact and efficient low level memory structures (for use cases like ndarray etc). Some of this is addressed by the Typed Objects issue as well.
I should note that I am very much not an expert on the current state of WebAssembly, and haven't been following it too closely. If any of this is inaccurate, please clarify in the comments below.
Open questions
- Can we reuse the C/C++ components from things like the Pydata stack (e.g. Numpy ndarray)? This tweet is rather trolly but as history has shown, JavaScript will find a way.
- If you have a node native module using nan, will you be able to 'compile' it into .wasm? What is the subset of C/C++ that can be used in node as a native addon module and simultaneously in the browser as a .wasm module?
- Can we combine the previous two bullet points and share native code between all three environments (Python, Node, Browsers)?
- What is the future of prebuilt binaries? Much of the work maintainers have to do to set up node-pre-gyp and prebuild are not necessarily specific to node, but are the same problems maintainers in any language need to do to maintain a compiled binding. Can we find a way to share build farms etc between languages?
If you have comments, questions, clarifications or if I missed something important please leave a comment below.