Skip to content

Investigate WASM as a HAL executable format #2863

Open

Description

There's a bunch of reasons to be interested in wasm as a distribution format for executables. This issue will track notes on the feasibility of this approach, how it could be implemented in IREE, and some open questions.

At a high level we can treat each HAL executable as a WASM binary with multiple entry points (exports) as we do with dylibs for the LLVM AOT backend, store the wasm binary embedded in the module as we do with all other executables, and have a host-derived HAL backend that uses a wasm runtime library to load/cache/invoke the wasm. With this approach we can likely reuse the current LLVM AOT compiler target with different target/linker options almost verbatim.

IREE Implementation

Compiler

We can reuse the existing LLVM target backend with changes only to how we setup the compiler and serialize the resulting binary - LLVMAOTTarget and LLVMIRTarget are two examples that already exist.

  • Add a new LLVMWASMTarget based on LLVMAOTTarget that configures LLVM, links to a wasm module, and embeds it in an executable
  • Likely no need for a custom flatbuffer schema as the .wasm module contains all of the information we need (exports list, etc)
    • May still be useful if we need to configure/choose between wasm engines and define whether certain features are used (SIMD, wasm64, etc)

Work to link multiple executables together (#1587) such that we ideally end up with a single executable per module with multiple entry points will be very useful here to reduce overhead (only one wasm runtime needed, etc).

Runtime

A majority of the runtime work is identical to the existing dylib, llvmjit, and vmla HAL drivers. All of these share code in iree/hal/host/ for things like work scheduling.

  • Add a new wasm backend based on llvmjit or dylib
    • May want to nest like iree/hal/wasm/[runtime]/, as we'll definitely be ending up with multiple runtimes (at least JavaScriptCore for iOS and something for embedded, likely)
  • iree::hal::ExecutableCache can be implemented to support offline preparation, caching of intermediate (bitcode/native) binaries, etc
  • The iree::hal::HostExecutable subclass can hold the handles to the wasm runtime and the exported symbols
    • Modules can be initialized once directly from the provided buffers (that are mmapped or otherwise already in memory)
  • Dispatches are prepared via PrepareDispatch once per dispatch and as demonstrated here can have state that is shared across all tiles
    • This amortizes arg marshaling into the runtime as all tiles receive the same args (besides workgroup xyz)
  • Dispatches (per tile) are invoked with the shared args (buffer bindings/push constants) and the workgroup xyz of the tile as here
    • Multiple tiles from the same dispatch are dispatched concurrently from multiple threads
    • We can implement shared memory semantics in that the wasm memory space can have a region shared across tiles and use atomic ops to safely work across them; otherwise we should rely on the stack to keep each tile independent (this matches GPU behavior, so something we need to be solving there anyway)

A custom iree::hal::Allocator will be required as wasm runtimes can only access a single contiguous memory address range and we need to suballocate within that if we want to ensure zero-copy behavior. This most closely aligns with the DEVICE_LOCAL|HOST_VISIBLE memory type in that the device here (the wasm runtime) can cheaply manipulate the memory and that HOST_LOCAL|DEVICE_VISIBLE memory would require a copy to use. There's likely some other gotchas here to play with. See open questions below. See the WAMR example.

Toys

WASM Runtime Notes

There are a few big WASM-specific runtimes that have various levels of build time, runtime, and architecture support. This is excluding any that simply wrap v8/JSC, as we don't need arbitrary JS, WASI, and other complex bridging layers and instead just need access to the global memory and export table. Directly using system libraries (such as JavaScriptCore) may be the only option in some environments while in others on platforms with the ability to allocate executable pages have a lot more freedom.

There's a lot of runtimes: https://github.com/appcypher/awesome-wasm-runtimes
Many experimental or specialty (a blockchain wasm vm, etc). I've listed the most popular/relevant/still-active ones here and excluded any (such as WAVM) that require LLVM in their deployment.

v8

Much more full-featured than we need, but also one of the fastest/most ubiquitous runtimes. Not sure if there's a minimal build that only includes that required for wasm - the runtime+jit+etc can be several MB.

JavaScriptCore

The only real option (besides interpreted) on iOS. Supports WebAssembly on device and in simulator. Can't find a signal as to when SIMD will be supported (likely after first spec MVP published).

See open questions below; unclear how JIT is supported on appstore releases.

wasmtime

One of the bigger/more complete runtimes. Currently only targets x86-64 and aarch64 (on linux). They claim new backends are planned but it's unclear the timeline.

Wasmer

WAMR

Focused on breadth of architectures and small size, looking pretty similar to our needs: x86-32/64, armv7, aarch64, mips, with/without MMU. Recompiles the WASM to a custom target-specific AOT format that can be done either automatically or offline. Would work well with our pipeline cache model (translate and cache AOT binary, load that via mmmap for execution).

wasm3

A pure interpreter using a custom bitcode format and threaded interpreter (like IREE's VM). (Mostly) pure C with no executable page requirement so it'll run just about anywhere. Take the performance breakdown with a big grain of salt (from beginning of the year).

  • Less developed than WAMR
  • Interpreter only (using fast threaded dispatch, so pretty good, but ~= WAMR)
  • No SIMD or other useful optimizations (mutable globals, bulk memory ops)

Open Questions

SIMD spec op availability

SIMD spec: https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md
We should confirm we have access to the core NN-friendly ops that are required. There's a proposal to add integer dotprod instructions but it looks like @bjacob commented on the spec here noting that the most useful dotprod op form is still missing: WebAssembly/simd#127 (comment)

iOS

It's extremely difficult to tell but it seems like JavaScriptCore on iOS when used by an application can JIT and load WebAssembly. Whether this requires special entitlements is unclear (oh Apple). Recent issues indicate that the global context has a WebAssembly object that works and that the iOS simulator supports it as well (https://trac.webkit.org/changeset/264801/webkit). Workarounds that involve using WebKit (WKWebView) are a no-go as they run JSC out of process, cannot share memory, and can only marshal strings across the API.

Multiple memory spaces

WASM was defined to support multiple memory spaces (linear regions of addressable memory) - think x86 segments (what's old is new again!). This is interesting to us as the actual fixed-size heap required for wasm can then be fixed to a maximum of our shared memory size (accessible from multiple concurrent invocations) and buffers can be passed in/out via other memories.

Unfortunately this isn't supported in MVP (or AFAICT any current runtime), though the multi-memory spec proposal is active and extends things to support what we'd need.

Without this we must ensure that all buffers are allocated from the single wasm memory region. This is not difficult to accomplish (via a custom iree::hal::Allocator) and since the same behavior is needed for GPUs it's possible we can share code (something like VMA, if not VMA itself). The scheduling code we emit in the compiler for allocations can help here as the same behavior we'll want for devices with discrete memory (out-of-process CPU/GPU/TPU) we'll want for WASM, so for example the ringbuffer used for most transients can be allocated directly from wasm memory.

wasm64

Though provisionally speced, 64-bit wasm is not really making traction just yet. The major limitation there is a 4GB address space (or possibly smaller depending on the runtime, which may use some bits for tracking). multi-memory would alleviate some of the pressure here as we could add multiple chunks, but at the point that we are streaming through GB of data in a single dispatch we've probably got other issues. Since SPIR-V also has 32-bit limitations I think this is fine.

Metadata

Labels

enhancement ➕New feature or requesthal/cpuRuntime Host/CPU-based HAL backendnext-gen ✨Forward-looking work built on IREE's foundationruntimeRelating to the IREE runtime library

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions