Skip to content

Workers in Node.js based on multithreaded V8

AaronLee edited this page Mar 26, 2018 · 4 revisions

TLDR

We propose a system for asynchronous multithreading with an API confirming to Web Worker standard in Node.js. By using Web Worker API and extending V8 and Node.js, our system provide a lightweight worker(Lite Thread Worker) for running JavaScripts in multiple threads to take advantages of multi-core devices. V8 is modified to support multithreading, and cut down the memory usage by sharing the data which is not tied to context between threads. Node is modified to support the Worker API when using thread workers and support all built-in APIs of Node.js in the thread worker. In the initial phase, we reduce more than 60% memory usage of empty worker and 3X improvement in worker startup performance compared to the process worker provided by the cluster in Node.js.

Introduction

As power consumption of the devices has become a concern, exponential growth in single-processor performance ends in recent years, and parallelism is becoming ubiquitous. But JavaScript does not support multi-threading because it was intended for short, quick-running pieces of code before. The way people use JavaScript is changed, Javascript has been widely used in many applications, such as web page, web app, node.js, etc. Though the cluster module in Node.js provide us the ability to handle the load on multiple processes, there are many attempts to provide multi-threading capabilities in Node.js. They all want to find a lightweight approach for running tasks in parallel in Node.js.

Threads-a-gogo(TAGG) complement Node.js in CPU-bound tasks, with the capability of executing JavaScript in multiple threads and a lot of projects are forked from it in github. Napa.js provided by Microsoft which is designed to support multithreading, and implement parts of core modules of Node.js in napa worker which make napa worker more powerful. Some other attempts have something in common, they don’t support full API of node.js in worker thread and they all create a new V8 isolate in every worker thread to run JS code separately. 

Those implementation for the multi-threading make the thread worker heavy-weight and have a high start-up performance cost, and a high per-instance memory cost just like web worker announced on the specification.

We use node.js as basic component of our app engine and we need a lightweight implementation of thread worker on mobile devices which have limited hardware resources including CPU, memory, etc. We want to use Web Worker API, which is supported by most browsers now, to provide the same multi-threading capability in Node.js.  And we want to cut down the memory consumption to make the worker more lightweight. Furthermore the worker should support all Node.js(Core) API and we called it “Lite Thread Worker”.

Goals

  • Avoid performance regressions for code that dose not use multithreading.
  • Lite thread workers are isolated and every one has its own event loop. Use message passing between workers.
  • All built-in modules of Node.js are supported in lite thread worker.
  • Share immutable javascript objects(primitive objects), the data not tied to the specific context and some vm infrastructure which used in JS Engine between the multiple threads. 
  • Implement Worker API to support multi-threading in Node.js.

Works on Node.js

Node.js Multithreaded Event Loop Model

Node.js has an event-driven architecture capable of asynchronous I/O, it’s suited for processing a high volume of short requests. However, the “Don't' block the event loop” problem is inherent cause a single instance of Node.js (Node instance was mentioned in the old code) only runs in a single thread (Single threaded event loop model).

Figure 1.1. Single threaded event loop model

We sought a way to create multiple node instances in multiple threads(Multithreaded event loop model) in Node.js to make it suitable for handling variety of tasks on a multicore machine. Each thread has a node instance, the event loop included in node instance is running on each thread independently.  You can run tasks which requires fast response in main thread, and run long-time heavy work in other worker threads then.  All of built-in APIs are available in the worker thread. The worker in the background thread is called “Lite Thread Worker”. 

Figure 1.2. Multithreaded event loop model

We create a node environment for each thread accordingly, including setting up environment for libuv, setting up environment for V8 that includes construction and initialization of a V8 instance(isolate), loading environment that includes executing startup function in bootstrap_node.js. Then the event loop runs in the thread. The “require” is also isolated and you need to require the module again in the worker thread when you have required it in another thread. Threads can communicate with each other using message passing.

The runtime of multithreaded Node.js is shown in the Figure 1.3.

Figure 1.3. Multithreaded Node.js runtime and Node.js Architecture

V8: V8 engine is used to executes JavaScript code. Isolate represents an isolated instance of the V8 engine. We can create additional isolates and use them in parallel in multiple threads.

libuv:  libuv provides an event loop and callback based notifications of I/O and other activities. We can run multiple event loops as long as each runs in a different thread.

Module: It stands for discrete chunks of functionality in node.js like buffer, fs, http, etc. We should make sure it’s safe when used in multiple threads. The main works to make module threadsafe include global data analysis and handling the data accordingly. Module instance includes a set of the data that is related to execution of the module in a thread.

Actually, most of the data in node.js is not shared between multiple threads. The whole JS environment is isolated. We protect some shared native data from concurrent access by locks in node.js.

Thread Safety in Node.js

All the data in node.js is devided into the three parts to make the model work quickly:

  • Global mutable data

The data should be shared between threads and it’s mutable. We use locks to protect the access to the global shared data. 

  • Global immutable data

The data should be shared and it’s can’t be changed after its construction. Which means its’s read-only during later execution and of course thread-safe.

  • Local data

It’s local to each thread, and is thread-safe.

Node.js was written and tested as a single thread program that do make trouble. The data maybe placed at the unsuitable location but the system will work well cause it’s single threaded.

The data which was previously non-local in Node.js, including global variables and static variables, are classified into above three sets first, then we move the data to the correct location and take necessary action accordingly. For example, in src/string_search.h

- static int kBadCharShiftTable[kUC16AlphabetSize];
- static int kGoodSuffixShiftTable[kBMMaxShift + 1];
- static int kSuffixTable[kBMMaxShift + 1];
+ thread_local static int kBadCharShiftTable[kUC16AlphabetSize];
+ thread_local static int kGoodSuffixShiftTable[kBMMaxShift + 1];
+ thread_local static int kSuffixTable[kBMMaxShift + 1];

The above three tables are used by BoyerMoore algorithm for finding patterns in strings. They should be thread locals if used in multiple threads, or the search results may make you confused.

Though "require" is isolated and you should require a module first when you use it in a new thread, the shared libary can only be loaded once in the process. We use a list to cache the loaded shared objects and use a lock to protect dlopen procedure like below.

-  const bool is_dlopen_error = uv_dlopen(*filename, &lib);
-
-  node_module* const mp = modpending;
-  modpending = nullptr;
+  // try to find the addon module in list.
+  node_module* mp = get_addon_module(*filename);
+
+  if (mp == nullptr) {
+    Mutex::ScopedLock scoped_lock(modlist_addon_mutex);
+    mp = get_addon_module(*filename);
+    if (mp == nullptr) {
+      const bool is_dlopen_error = uv_dlopen(*filename, &lib);
     ...

It is a big work to make everything work correctly in Node.js on multiple threads and we choose a quick and easy way to implement multithreaded event loop model first. There are no existed tests multithreaded Node.js, we run existing test cases in parallel on 64 threads in our multi-threaded Node.js and we get the same result with the test result of original single threaded Node.js. ( We filter the test cases which will occupy the same system resource, such as HTTP ports, on the device if we simply run them in multiple threads meanwhile. )

Worker API in Node.js

  • Worker construction and destruction
var LiteWorker = require (“lite-thread-worker“);
var worker = new LiteWorker (“worker.js");

Create a new Worker to run the worker.js script in a new thread. You should require the module “lite-thread-worker” first. 

worker.terminate ();

Terminate a running worker from the main thread.

close ();

In the worker thread, worker close itself by calling "close()".  

  • postMessage and onMessage
worker.postMessage(messageObject, [arrObj1, arrObj2...]);

Send messge to worker from main thread, the second parameter is optional.

worker.onMessage(messageObject) = function(e) {
   //...
}

The onmessage handler will run when a message is received from the specific worker.

postMessage(messageObject, [arrObj1, arrObj2...]);

Send messge to main thread from worker.

onMessage(messageObject) = function(e) { ... }

The onmessage handler will run when a message is received.

The running threads in our system including master thread and workers communicate with each other via postMessage and onMessage. It's unnecessary to go into the details, all the worker api in our system conforms to web worker standard.

Works on V8

All the work we have done in v8 is to make it suitable for asynchronous multithreading. It can bring improvements to lite thread worker in Node.js and web worker in chrome.

Optimized Isolation Model

Workers mentioned above are all isolated in the sense that they don’t share the JavaScript global execution context, the only way to communicate with each other is message passing. (SharedArrayBuffer is an exception.) They are deliberately almost shared-nothing like actors in actor model, they can be highly parallel executed in multiple cores then. 

Figure 2.1. Isolation Model for execution in worker threads

We propose an optimized isolation model(OIM), a new multi-threaded execution model for running JS scripts on separate worker threads. And then provide a general framework for implementing this model in V8. OIM seeks to take advantage of the sharing contextless data to cut down on memory usage when provide multithreading ability. Though sharing mutable data requires synchronization, a data race has a low probability to occur because most shared data will not be updated after its birth by design. We found the overhead of synchronization here is negligible with optimistic locking.

We propose an abstraction called a Global Runtime in OIM to manage the creation, destruction and usage of the shared resources including some basic virtual machine infrastructures, objects and unoptimized code. All the shared resources have nothing to do with the thread context. We propose an abstraction called a Local Runtime to manage the private resources of the thread during the execution of the JS scripts. 

Figure 2.2. Optimized Isolation Model for execution in worker threads

New Architecture in V8 For Optimized Isolation Model

We propose a new architecture for Optimized Isolation Model, which is consists of two parts: Global Runtime Manager and Local Runtime Manager. Isolate represents a vm instance of the V8 engine. We divide it into two parts, global runtime manager include global area which is shared and local runtime manager include others.  A vm instance may have a instance of global runtime manager and at least one instance of local runtime during the execution. 

Figure 2.3 Architecture for Optimized Isolation Model in V8

Global runtime manager include:

  1. Global shared data
  • Non-optimized code, including bytecode of builtin functions, internal code stubs and user's js code.
  • Shared compilation cache, a container object to store shared function infos.
  • Shared string table, a container object to store shared strings.
  • Other shared objects.
  1. Shared vm infrastructure
  • Shared heap, the place to store shared objects.
  • Thread manager, control the amount of working threads needed for specific workload.
  • Interpreter, a execution engine for non-optimized code——bytecode compiled from source code.
  • Others

Local runtime manager has thread-related datas, code and modules, include unshared parts of “Isolate” in V8. Each thread has a instance of local runtime manager.

Shared Objects & Private Objects

We divide heap objects in JS heap into two categories: shared objects and private objects. Shared objects may be reachable from more than one thread and private objects are reachable from only one thread.

Shared Objects

  • Primitive Objects, such as String, Heap Number, etc. are immutable in their lifetimes.
  • Parts of the Script, Parts of the SharedFunctionInfo, Parts of the maps, etc. which have little opportunity for updating the data after their creation and  will not affected by thread context.
  • Shared Script List, Shared Script Compilation Cache, Shared String Table, etc. which used to keep shared objects.

Shared Objects should not be tied to the context. There is no problem to share primitive objects cause they are immutable. We do some trivial work to make the shared objects only contains purely common information. For example, we move the fields of the SharedFunctionInfo which contains  execution information such as counters and opt_count to feedback vector which carry the function information related to thread execution. 

All of the shared objects are allocated on the shared heap.

Private Objects

Private objects used only in one thread and are allocated on the private heap of the running thread. All of the JavaScript objects created by user code are private objects execept some primitive objects.

Object References

The following two constraints are established:

  •  Shared object can only refer to a shared object.
  •  Private object created in a thread may refer to a shared object or a private object created in the same thread.

It’s easy to understand the above constrains. If an object is reachable from multiple threads then all the objects referenced by the object are reachable too. All of the objects that referenced by a shared object are shared objects. 

Private object created in a thread can only be accessed by the same thread throughout the threads’s execution. A private object may have a reference to shared object if needed.

Sharing Non-optimized Code

Bytecode included in SharedFunctionInfo compiled from the JS source code is executed by ignition interpreter in V8. Unlike the optimized code, that compiled for hot functions by turbofan compiler which needs type information gathered in feedback vector, bytecode doesn’t need type information in compilation. Since the same source code string has the same compiled code for scripts, we share the bytecode compiled from scripts between the threads.

We add a shared compilation cache in the global runtime to keep shared function infos which are shared between the threads for compiled scripts. The shared function infos are looked up using the source string as the key. Use Safepoint MutexLock to protect access to the shared compilation cache. You will get the same SharedFunctionInfo from the function in the same position of the script. We use MonitorLock on SharedFunctionInfo object as compile locks to avoid redundant compilation for the same SharedFunctionInfo.

Memory Architecture

We propose a new heap architecture for multi-threaded JS engine: Shared Heap + Multi-Private-Heap. Shared objects are stored in shared heap and private objects which can only be used in the specific thread are stored in its own private heap.

Memory management mechanism for the private heap is as same as the implementation of original V8. If we focused on the behavior of single thread, new space in private heap is used to place young objects and old space is used to place long-life ones. We implement a multi-threaded allocator for shared heap. Since shared objects have little opportunity to be changed in its lifetime, shared objects are always long-life objects and shared heap dosen't have a new space.

All GC

All GC is used to collect garbages from whole heap, include both shared heap and all private heaps. It will occur when the size of the shared heap exceed the limit or there is not enough free space in the private heap after a local gc.

All gc require “Stop the world”(STW) which means that the js engine stop all the running threads to execute a "all gc", just like most JVM do. When stop-the-world occurs, every thread will pause their tasks and the interrupted tasks will resume only after all gc finished. We found shared objects are stable and always have a long-life time so we adjust the memory of shared heap and all gc rarely occurs after our tuning and the footprint of the shared heap.

Local GC

The purpose of local gc is to collect garbages in private heap of the specific thread. Most gc is belong to local gc cause shared objects are much more stable than private objects by design. The behavior of local gc is as same as before(gc in single-threaded v8). We use Copying Collector for new space in private heap and Mark-Sweep-Compact Collector for old space in private heap. 

Figure 2.4 Marking references during local gc

Local gc traverses the whole object graph in the specific private heap, starting from those thread roots and following references from the roots to other private objects, e.g. instance fields. Shared Object the gc visits is ignored and every private object the gc visits is marked as alive. After marking phase, we will remove unreachable objects in the private heap. 

We can see that local gc for the specific thread will not affect execution of other threads. Unlike other implementations of the virtual machine, who support multithreading, we needn’t stop the whole world during local gc.

Thread States and Safepoint

Stop-the-world will be mentioned in GC implementation of VM for other high level programming languages. It’s simple in single threaded implementation of JS engine cause all JS code run in one thread and the only thread is the whole world. When the GC is triggered,  JS code has no opportunity to run meanwhile. One thing you should know is that gc is only triggered at safepoints even there is only one thread running.

We have to add suspend / resume mechanism to control threads in multi-threading environments. And we add thread state to describe whether a thread is ready to run JS code. The thread state may be in one of the two types:

  • Runnable The thread is ready to run JS code currently or do the internal work which can instantly affect the JS World.
  • Non-Runnable The thread is doing the work which will not affect anything in the JS world immediately.   Maybe it’s running a native method, or it's suspended by a GC, or it’s blocked waiting on a lock. "All GC" need to stop the world which means the state of every thread must be non-runnable during All GC.

The thread state is updated only at safepoints. If a running thread need to be suspended, send a suspension request to the thread then the thread will make a suspend check at next safepoint, it will suspend itself and change its thread state from runnable to non-runnalbe. All gc needs that all the threads are suspended(All of the thread states are non-runnalbe ). The thread will be resumed after all suspension requests on it have been handled.  It is similar to the mechanism used in Java virtual machines.

It’s safe to collect garbage in heap at safepoints which means all live objects can be accounted and the object pointer must be held by a handle that holds the address of the object pointer and object pointer would be updated during Moving GC. The checkpoints in V8 are parts of safepoints which include the entry/exit points of JS function , loopback edge, etc. The entry/exit points of the API methods provided  by v8.h, are treated as safepoints too. We should change the thread states there, Runnable before entry and previous state after exit, cause invocation of those methods will affect the JS world.

Synchronization

Mutex Lock and Safepoint Lock

  • Mutex Lock is used when no heap allocation happened during the lock scope.
    We have not to change the thread state when we use a mutex lock.
  • Safepoint Lock is used when heap allocation may occur during the lock scope.
    Thread should hold safepoint lock at safepoint where gc may occur if allocation fails there. The state of the thread will be changed to non-runnable when it’s waiting for the lock and will be updated to runnable after the thread acquire the safepoint lock.

If we use mutex lock at safepoint, a deadlock may occur. When a thread is waiting for the lock A, gc may be triggered in another thread which has held the lock, then the first thread has no opportunity to response to the gc request from the second thread, a deadlock occurs, which is shown in left side of Figure 2.5.

Figure 2.5 Safepoint lock and suspend / resume mechanism

Synchronization for Shared Heap Objects

Primitive JS Objects are immutable and readonly after its creation, we need not to concern the thread safety when share them between threads. But some shared objects used only in internal implementation of V8 such as SharedFunctionInfo are shared between threads cause they are not tied to the thread context. Those shared objects are hidden to the users  and has very few types.  We add a header word to those shared heap objects to implement a thin lock scheme just like what hotspot does. Thin locks are cheap and will be inflated from keen competition which happens rarely in this system.

Result & Outcome

Node.js: Single Thread Performance

Multithreaded Node.js: Multithreaded Node.js(v8.9.0) with OIM V8 (6.1.561)
Node.js: Node.js(v8.9.0)

We have modified benchmark octane in V8 to enable it in node.js, and we find that the single thread performance is almost not affected by our work in V8.

Node.js: Lite Thread Worker VS Cluster

Lite Thread Worker: Multithreaded Node.js(v8.9.0) with OIM V8 (6.1.561)
Cluster Worker:        Node.js (v8.9.0)

Memory Usage

We share the data which not tied to thread context between main thread and worker threads based on OIM V8. It can save a lot of memory.

Startup Time  

The JS functions invoked during node startup period has been compiled in the main thread, lite thread worker reuses the shared resources and it will speed up the startup process.

  Message Passing

We use html structured algorithom currently. We can improve the performance by passing the reference of primitive objects instead of serialization and deserialization process.

Node.js: Lite Thread WorkerVS Other Thread Worker

Lite Thread Worker:Multithreaded Node.js(v8.9.0) with OIM V8 (6.1.561)
Thread Worker: Multithreaded Node.js(v8.9.0) (impl like others, use original V8)

Memory Usage

Our work on V8 can save 60%+ memory of a empty worker. When you use multiple workers to handle the same tasks with complex JavaScript code, you will get a lot of benefits of improving your memory.

Startup Time

We start a node instance in worker thread which will init the whole node environment and run startup functions of some builtin modules. Using OIM V8 can reuse the unoptimized code of JavaScript functions which has been compiled at boot-up period of the main thread. OIM V8 improve the performance of worker startup about 2X.

Chrome:Lite Web Worker VS Web Worker

Lite Web Worker:adapted dedicated web worker in chrome(106229fb) with OIM V8
Web Worker:dedicated web worker in chrome(106229fb)

Memory Usage

The empty worker's memory consumption is reduced by ~30% in chrome.

Startup Time

The small improvement on startup scene of web worker is stable in our tests.

Bencmark

Loading large amounts of data in a worker.(http://mourner.github.io/worker-data-load/

We compile JS Function once to get the bytecode and it's shared between multiple threads. It can save a lot of time on this benchmark.

Napa.js:Lite Napa Worker VS Napa Worker

Lite Napa Worker:adapted Napa.js(v0.1.7) + Node.js (v8.9.0) with OIM V8
Napa Worker:Napa.js(v0.1.7) + Node.js (v8.9.0)

Memory Usage:

We use OIM V8 in Napa.js(v0.1.7) and the memory consumption of worker in Napa.js is reduced by 15%.