-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Feature
Current wasmtime maps the WebAssembly memory instructions (t.load, t.store etc.) directly to Cranelift IR memory instructions (load, store, uloadN, etc.).
This causes problems on big-endian platforms, because the Cranelift IR instruction are implemented as native load and store instructions using the machine byte order, while the WebAssembly memory instructions are specified to use little-endian byte order always.
Now, I initially thought that one way to solve this problem could be to treat Cranelift IR memory instructions also as always little-endian by specification. However, that does not work, because there are many other uses of these instructions that do require native byte order.
Some examples of those include:
-
Memory accesses added by platform ABI code (implicit pointers for argument or return values), in particular if this needs to be compatible with native code.
-
Memory accesses to values prepared by trampoline code at the boundaries of VM native code and JITted code.
-
Memory accesses to parts of the VMContext that is also accessed by VM native code.
In addition, there are cases where -while not strictly necessary for correctness- it is preferable for performance reasons to use native byte order, e.g. for spill code, for accessing variables on the stack, when implementing code such as inlined copies for small memcpy etc.
So, I believe we need some way of representing both always-little-endian memory operations (used to translate
the WebAssembly instructions), and native memory operation (used for everything else).
Benefit
Enabling support for Wasmtime on big-endian platforms like IBM Z.
Implementation
My current implementation of this approach simply duplicates all Cranelift IR memory instructions to create always-LE versions. So in addition to "load" there is "load_le" etc. The full list is:
load_le
load_le_complex
store_le
store_le_complex
uload16_le
uload16_le_complex
sload16_le
sload16_le_complex
istore16_le
istore16_le_complex
uload32_le
uload32_le_complex
sload32_le
sload32_le_complex
istore32_le
istore32_le_complex
Advantages of this approach include:
- Most code that creates load/store instructions can remain unchanged, the WebAssembly translator simply always uses the new instructions.
- It's already implemented and working :-)
But there are disadvantages:
- All back-ends must implement all the new instructions (usually by just mapping them back to normal loads/stores), or else the target will stop working.
- Middle-end code changes operating on loads/stores (e.g. the code that recognizes and creates _complex operations) should be adapted or else we can get performance regressions.
Alternatives
There's various alternative ways this could be implemented:
A) Add an additional flag argument to load/store instructions that
specifies the requested byte order. A detail question is whether the flag is
little-endian vs. native
little-endian vs. big-endian
little-endian vs. big-endian vs. native
Advantages:
- no new IR instructions required
- existing back-ends could simply ignore the flag
Disadvantages:
- it's still an IR change as that flag must be considered part of the IR (e.g. parsing IR, serialization ...)
- All creators of loads/stores (including outside of cranelift, and possibly even outside of wasmtime!) must be updated. If there is no "native" flag setting, all those updates must include finding out the native byte order somehow.
B) Add an additional flag bit to the existing MemFlags
Advantages:
- no new IR (should be covered by existing serialization ...)
- can be ignored by existing back-ends
- no change to (most) creators of loads/stores necessary
Disadvantages:
- MemFlags can no longer be dropped, it becomes required for correctness to always preserve in in the middle end
C) Open-code the conversion in the WebAssembly translator
Only emit a plain "load" if the target is little-endian. Otherwise emit a load followed by a byte-swap (possibly followed by an extension). Vice-versa for stores. This would probably require addition of a new "bswap" Cranelift IR instruction, unless we want to open-code bswap itself as well (possible, but a bit tedious).
Advantages:
- Only "bswap" as new IR element, can be ignored by back-ends for little-endian architectures and everywhere else.
Disadvantages:
- No major ones I can see - this would be my preferred approach.