Splitting into multiple lazily-loaded modules

### Motivation

See motivation here:
https://github.com/rustwasm/team/issues/52

### Proposed Solution

I have implemented a (limited/hacky) prototype, based on the following components:

- A `#[wasm_split(xyz)]` function attribute macro that serves to annotate a function as a *split point*.  `xyz` is an identifier for the module that this function should be "split off" into.  The same identifier can be used multiple times, in which case multiple functions will be "split off" into the same module.  In my prototype the function must be non-async, and this macro turns it into an async function, but it wouldn't be hard to support both sync and async split points.

For example, the macro converts:

```rust
#[wasm_split(zstd)]
fn get_zstd_decoder(
    encoded_reader: Pin<Box<dyn futures::io::AsyncBufRead>>,
) -> Pin<Box<dyn futures::io::AsyncRead>> {
    Box::pin(async_compression::futures::bufread::ZstdDecoder::new(
        encoded_reader,
    ))
}
```

into

```rust
async fn get_zstd_decoder(
    __wasm_split_arg_0: Pin<Box<dyn futures::io::AsyncBufRead>>,
) -> Pin<Box<dyn futures::io::AsyncRead>> {
    thread_local! {
        static ::wasm_split::LazySplitLoader> = unsafe { ::wasm_split::LazySplitLoader::new(__wasm_split_load_zstd) };
    }
    #[link(wasm_import_module = "./__wasm_split.js")]
    extern "C" {
        #[no_mangle]
        fn __wasm_split_load_zstd(
            callback: unsafe extern "C" fn(*const ::std::ffi::c_void, bool),
            data: *const ::std::ffi::c_void,
        ) -> ();
        #[allow(improper_ctypes)]
        #[no_mangle]
        fn __wasm_split_00zstd00_import_56925a789e8e525628ef50b9c566f070_get_zstd_decoder(
            encoded_reader: Pin<Box<dyn futures::io::AsyncBufRead>>,
        ) -> Pin<Box<dyn futures::io::AsyncRead>>;
    }
    #[allow(improper_ctypes_definitions)]
    #[no_mangle]
    pub extern "C" fn __wasm_split_00zstd00_export_56925a789e8e525628ef50b9c566f070_get_zstd_decoder(
        encoded_reader: Pin<Box<dyn futures::io::AsyncBufRead>>,
    ) -> Pin<Box<dyn futures::io::AsyncRead>> {
        Box::pin(async_compression::futures::bufread::ZstdDecoder::new(encoded_reader))
    }
    ::wasm_split::ensure_loaded(&__wasm_split_loader).await.unwrap();
    unsafe {
        __wasm_split_00zstd00_import_56925a789e8e525628ef50b9c566f070_get_zstd_decoder(
            __wasm_split_arg_0,
        )
    }
}
```

Note that the real body of the function is moved to a separate exported function (`__wasm_split_00zstd00_export_56925a789e8e525628ef50b9c566f070_get_zstd_decoder`) that is never called.  The original function body is replaced by code that ensures the module is asynchronously loaded, and then calls a separate imported function (`__wasm_split_00zstd00_import_56925a789e8e525628ef50b9c566f070_get_zstd_decoder`).  In a post-processing step, `__wasm_split_00zstd00_import_56925a789e8e525628ef50b9c566f070_get_zstd_decoder` will be changed to refer to a function that does an indirect call of `__wasm_split_00zstd00_export_56925a789e8e525628ef50b9c566f070_get_zstd_decoder`.

This effectively disconnects the call graph at this split point, which is important for the post-processing.

Then we compile and link the program using `-Clink-args=--emit-relocs`.

The post-processing reads in the linked `.wasm` file (before running wasm-bindgen, since wasm-bindgen does not preserve relocation information), identifies the split points based on the symbol names, and then determines the dependency graph of all symbols based on the relocation information.

Note that the dependency graph includes both functions and data symbols, since data symbols such as vtables refer to functions via the indirect function table.

We then compute the contents of the "main" module as the transitive dependencies of:
- The start function
- Any exported function

For each split module, we then compute the transitive dependencies of the real implementation function (such as `__wasm_split_00zstd00_export_56925a789e8e525628ef50b9c566f070_get_zstd_decoder`) for each split point assigned to the module.  When computing transitive dependencies here, we can stop once we encounter a symbol that is assigned to the main module.

Symbols that are uniquely in the transitive dependencies of a single split module are assigned to that split module.  Symbols that are in the transitive dependencies of more than one split module are assigned to a separate "chunk" module identified by the set of two or more split modules that have the symbol as a transitive dependency.  Thus we may in general produce a large number of chunk modules.  Various heuristics could be used to combine them.

The split point implementation functions, and any function that is called from more than one module, gets added to the `__indirect_function_table`.

We then emit each module, using the relocation information to remap functions.  In the prototype, although we compute dependencies as if data symbols are split out, in fact all of the data segments remain in the main module, but it should be feasible to split the data as well.  Calls to functions defined in other modules are replaced by calls to a stub function that does an indirect call.  Each split module has no `start` function but has an active element that initializes a portion of the `__indirect_function_table`.

The support javascript for loading the module looks something like:

```javascript
import { initSync } from "./main.js";

export async function __wasm_split_load_zstd(callback_index, callback_data) {
  let mainExports = undefined;
  try {
    const response = await fetch(new URL("./zstd.wasm", import.meta.url));
    mainExports = initSync(undefined, undefined);
    const imports = {
      env: {
        memory: mainExports.memory,
      },
      __wasm_split: {
        __indirect_function_table: mainExports.__indirect_function_table,
        __stack_pointer: mainExports.__stack_pointer,
        __tls_base: mainExports.__tls_base,
      },
    };
    const module = await WebAssembly.instantiateStreaming(response, imports);
    mainExports.__indirect_function_table.get(callback_index)(
      callback_data,
      true,
    );
  } catch (e) {
    console.error("Failed to load zstd", e);
    if (mainExports === undefined) {
      mainExports = initSync(undefined, undefined);
    }
    mainExports.__indirect_function_table.get(callback_index)(
      callback_data,
      false,
    );
  }
}
```

### Alternatives

This implementation was inspired by the description of the emscripten wasm-split tool (https://emscripten.org/docs/optimizing/Module-Splitting.html#module-splitting).  The emscripten wasm-split tool differs in the following ways:
- Only splits into one main module and one secondary module.  The secondary module is loaded synchronously on demand.
- The split is determined based on profiling rather than explicit annotations in the code.

While the emscripten wasm-split approach could presumably be adapted to rust fairly easily, I think there are a lot of advantages to explicitly-annotated, asynchronously-loaded split points.

Another alternative would be to provide something closer to `dlopen`, which I think may be along the lines of what is being proposed for a webassembly dynamic linking mechanism.   The advantage of what I'm proposing here over a dlopen-style interface is:
- split points can be inserted at arbitrary locations, rather than only at crate boundaries,
- the `wasm_split` macro provides a very ergonomic interface
- code and data is automatically deduplicated across modules.

### Additional Context

The current prototype implementation is basically independent of wasm-bindgen --- it works with an unmodified wasm-bindgen but the module loading depends slightly on implementation details of wasm-bindgen.

Ultimately, though, as a feature for which I think there is quite a lot of interest in the community, it would probably be better to integrate this into wasm-bindgen itself --- that would allow the javascript code to be split along with the wasm module.

Towards that goal, I'd appreciate some guidance on whether this feature would likely be accepted, and if so, any comments on how best to integrate it.

The current prototype implementation uses wasm_encoder and wasmparser directly.  I initially attempted to use walrus but found that its abstractions didn't work very well given the need to make use of the relocation information.  Possibly walrus could be modified to provide the necessary functionality.  Alternatively, the splitting could be done first using wasm_encoder and wasmparser directly, and then the remaining wasm-bindgen processing could be done using walrus.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Splitting into multiple lazily-loaded modules #3939

Motivation

Proposed Solution

Alternatives

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Splitting into multiple lazily-loaded modules #3939

Description

Motivation

Proposed Solution

Alternatives

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions