From b7b54d596796c813a334a69478ffcd4240165aaa Mon Sep 17 00:00:00 2001 From: Edoardo Vacchi Date: Sat, 9 Mar 2024 08:39:11 +0800 Subject: [PATCH] wazevo(docs): optimizing compiler (#2065) Signed-off-by: Edoardo Vacchi --- site/content/docs/_index.md | 3 +- .../_index.md | 131 +++++ .../appendix.md | 185 +++++++ .../backend.md | 507 ++++++++++++++++++ .../frontend.md | 371 +++++++++++++ 5 files changed, 1196 insertions(+), 1 deletion(-) create mode 100644 site/content/docs/how_the_optimizing_compiler_works/_index.md create mode 100644 site/content/docs/how_the_optimizing_compiler_works/appendix.md create mode 100644 site/content/docs/how_the_optimizing_compiler_works/backend.md create mode 100644 site/content/docs/how_the_optimizing_compiler_works/frontend.md diff --git a/site/content/docs/_index.md b/site/content/docs/_index.md index e04a20d7bc..e00d8e3681 100644 --- a/site/content/docs/_index.md +++ b/site/content/docs/_index.md @@ -143,7 +143,8 @@ Notably, the interpreter and compiler in wazero's [Runtime configuration][Runtim In wazero, a compiler is a runtime configured to compile modules to platform-specific machine code ahead of time (AOT) during the creation of [CompiledModule][CompiledModule]. This means your WebAssembly functions execute natively at runtime of the embedding Go program. Compiler is faster than Interpreter, often by order of -magnitude (10x) or more, and therefore enabled by default whenever available. +magnitude (10x) or more, and therefore enabled by default whenever available. You can read more about wazero's +[optimizing compiler in the detailed documentation]({{< relref "/how_the_optimizing_compiler_works" >}}). #### Interpreter diff --git a/site/content/docs/how_the_optimizing_compiler_works/_index.md b/site/content/docs/how_the_optimizing_compiler_works/_index.md new file mode 100644 index 0000000000..9ba1e7df4d --- /dev/null +++ b/site/content/docs/how_the_optimizing_compiler_works/_index.md @@ -0,0 +1,131 @@ ++++ +title = "How the Optimizing Compiler Works" +layout = "single" ++++ + +wazero supports two modes of execution: interpreter mode and compilation mode. +The interpreter mode is a fallback mode for platforms where compilation is not +supported. Compilation mode is otherwise the default mode of execution: it +translates Wasm modules to native code to get the best run-time performance. + +Translating Wasm bytecode into machine code can take multiple forms. wazero +1.0 performs a straightforward translation from a given instruction to a native +instruction. wazero 2.0 introduces an optimizing compiler that is able to +perform nontrivial optimizing transformations, such as constant folding or +dead-code elimination, and it makes better use of the underlying hardware, such +as CPU registers. This document digs deeper into what we mean when we say +"optimizing compiler", and explains how it is implemented in wazero. + +This document is intended for maintainers, researchers, developers and in +general anyone interested in understanding the internals of wazero. + +What is an Optimizing Compiler? +------------------------------- + +Wazero supports an _optimizing_ compiler in the style of other optimizing +compilers such as LLVM's or V8's. Traditionally an optimizing +compiler performs compilation in a number of steps. + +Compare this to the **old compiler**, where compilation happens in one step or +two, depending on how you count: + + +```goat + Input +---------------+ +---------------+ + Wasm Binary ---->| DecodeModule |---->| CompileModule |----> wazero IR + +---------------+ +---------------+ +``` + +That is, the module is (1) validated then (2) translated to an Intermediate +Representation (IR). The wazero IR can then be executed directly (in the case +of the interpreter) or it can be further processed and translated into native +code by the compiler. This compiler performs a straightforward translation from +the IR to native code, without any further passes. The wazero IR is not intended +for further processing beyond immediate execution or straightforward +translation. + +```goat + +---- wazero IR ----+ + | | + v v + +--------------+ +--------------+ + | Compiler | | Interpreter |- - - executable + +--------------+ +--------------+ + | + +----------+---------+ + | | + v v ++---------+ +---------+ +| ARM64 | | AMD64 | +| Backend | | Backend | - - - - - - - - - executable ++---------+ +---------+ +``` + + +Validation and translation to an IR in a compiler are usually called the +**front-end** part of a compiler, while code-generation occurs in what we call +the **back-end** of a compiler. The front-end is the part of a compiler that is +closer to the input, and it generally indicates machine-independent processing, +such as parsing and static validation. The back-end is the part of a compiler +that is closer to the output, and it generally includes machine-specific +procedures, such as code-generation. + +In the **optimizing** compiler, we still decode and translate Wasm binaries to +an intermediate representation in the front-end, but we use a textbook +representation called an **SSA** or "Static Single-Assignment Form", that is +intended for further transformation. + +The benefit of choosing an IR that is meant for transformation is that a lot of +optimization passes can apply directly to the IR, and thus be +machine-independent. Then the back-end can be relatively simpler, in that it +will only have to deal with machine-specific concerns. + +The wazero optimizing compiler implements the following compilation passes: + +* Front-End: + - Translation to SSA + - Optimization + - Block Layout + - Control Flow Analysis + +* Back-End: + - Instruction Selection + - Registry Allocation + - Finalization and Encoding + +```goat + Input +-------------------+ +-------------------+ + Wasm Binary --->| DecodeModule |----->| CompileModule |--+ + +-------------------+ +-------------------+ | + +----------------------------------------------------------+ + | + | +---------------+ +---------------+ + +->| Front-End |----------->| Back-End | + +---------------+ +---------------+ + | | + v v + SSA Instruction Selection + | | + v v + Optimization Registry Allocation + | | + v v + Block Layout Finalization/Encoding +``` + +Like the other engines, the implementation can be found under `engine`, specifically +in the `wazevo` sub-package. The entry-point is found under `internal/engine/wazevo/engine.go`, +where the implementation of the interface `wasm.Engine` is found. + +All the passes can be dumped to the console for debugging, by enabling, the build-time +flags under `internal/engine/wazevo/wazevoapi/debug_options.go`. The flags are disabled +by default and should only be enabled during debugging. These may also change in the future. + +In the following we will assume all paths to be relative to the `internal/engine/wazevo`, +so we will omit the prefix. + +## Index + +- [Front-End](frontend/) +- [Back-End](backend/) +- [Appendix](appendix/) diff --git a/site/content/docs/how_the_optimizing_compiler_works/appendix.md b/site/content/docs/how_the_optimizing_compiler_works/appendix.md new file mode 100644 index 0000000000..c66115c2a2 --- /dev/null +++ b/site/content/docs/how_the_optimizing_compiler_works/appendix.md @@ -0,0 +1,185 @@ ++++ +title = "Appendix: Trampolines" +layout = "single" ++++ + +Trampolines are used to interface between the Go runtime and the generated +code, in two cases: + +- when we need to **enter the generated code** from the Go runtime. +- when we need to **leave the generated code** to invoke a host function + (written in Go). + +In this section we want to complete the picture of how a Wasm function gets +translated from Wasm to executable code in the optimizing compiler, by +describing how to jump into the execution of the generated code at run-time. + +## Entering the Generated Code + +At run-time, user space invokes a Wasm function through the public +`api.Function` interface, using methods `Call()` or `CallWithStack()`. The +implementation of this method, in turn, eventually invokes an ASM +**trampoline**. The signature of this trampoline in Go code is: + +```go +func entrypoint( + preambleExecutable, functionExecutable *byte, + executionContextPtr uintptr, moduleContextPtr *byte, + paramResultStackPtr *uint64, + goAllocatedStackSlicePtr uintptr) +``` + +- `preambleExecutable` is a pointer to the generated code for the preamble (see + below) +- `functionExecutable` is a pointer to the generated code for the function (as + described in the previous sections). +- `executionContextPtr` is a raw pointer to the `wazevo.executionContext` + struct. This struct is used to save the state of the Go runtime before +entering or leaving the generated code. It also holds shared state between the +Go runtime and the generated code, such as the exit code that is used to +terminate execution on failure, or suspend it to invoke host functions. +- `moduleContextPtr` is a pointer to the `wazevo.moduleContextOpaque` struct. + This struct Its contents are basically the pointers to the module instance, +specific objects as well as functions. This is sometimes called "VMContext" in +other Wasm runtimes. +- `paramResultStackPtr` is a pointer to the slice where the arguments and + results of the function are passed. +- `goAllocatedStackSlicePtr` is an aligned pointer to the Go-allocated stack + for holding values and call frames. For further details refer to + [Backend § Prologue and Epilogue](../backend/#prologue-and-epilogue) + +The trampoline can be found in`backend/isa//abi_entry_.s`. + +For each given architecture, the trampoline: +- moves the arguments to specific registers to match the behavior of the entry preamble or trampoline function, and +- finally, it jumps into the execution of the generated code for the preamble + +The **preamble** that will be jumped from `entrypoint` function is generated per function signature. + +This is implemented in `machine.CompileEntryPreamble(*ssa.Signature)`. + +The preamble sets the fields in the `wazevo.executionContext`. + +At the beginning of the preamble: + +- Set a register to point to the `*wazevo.executionContext` struct. +- Save the stack pointers, frame pointers, return addresses, etc. to that + struct. +- Update the stack pointer to point to `paramResultStackPtr`. + +The generated code works in concert with the assumption that the preamble has +been entered through the aforementioned trampoline. Thus, it assumes that the +arguments can be found in some specific registers. + +The preamble then assigns the arguments pointed at by `paramResultStackPtr` to +the registers and stack location that the generated code expects. + +Finally, it invokes the generated code for the function. + +The epilogue reverses part of the process, finally returning control to the +caller of the `entrypoint()` function, and the Go runtime. The caller of +`entrypoint()` is also responsible for completing the cleaning up procedure by +invoking `afterGoFunctionCallEntrypoint()` (again, implemented in +backend-specific ASM). which will restore the stack pointers and return +control to the caller of the function. + +The arch-specific code can be found in +`backend/isa//abi_entry_preamble.go`. + +[wazero-engine-stack]: https://github.com/tetratelabs/wazero/blob/095b49f74a5e36ce401b899a0c16de4eeb46c054/internal/engine/compiler/engine.go#L77-L132 +[abi-arm64]: https://tip.golang.org/src/cmd/compile/abi-internal#arm64-architecture +[abi-amd64]: https://tip.golang.org/src/cmd/compile/abi-internal#amd64-architecture +[abi-cc]: https://tip.golang.org/src/cmd/compile/abi-internal#function-call-argument-and-result-passing + + +## Leaving the Generated Code + +In "[How do compiler functions work?][how-do-compiler-functions-work]", we +already outlined how _leaving_ the generated code works with the help of a +function. We will complete here the picture by briefly describing the code that +is generated. + +When the generated code needs to return control to the Go runtime, it inserts a +meta-instruction that is called `exitSequence` in both `amd64` and `arm64` +backends. This meta-instruction sets the `exitCode` in the +`wazevo.executionContext` struct, restore the stack pointers and then returns +control to the caller of the `entrypoint()` function described above. + +As described in "[How do compiler functions +work?][how-do-compiler-functions-work]", the mechanism is essentially the same +when invoking a host function or raising an error. However, when a function is +invoked the `exitCode` also indicates the identifier of the host function to be +invoked. + +The magic really happens in the `backend.Machine.CompileGoFunctionTrampoline()` +method. This method is actually invoked when host modules are being +instantiated. It generates a trampoline that is used to invoke such functions +from the generated code. + +This trampoline implements essentially the same prologue as the `entrypoint()`, +but it also reserves space for the arguments and results of the function to be +invoked. + +A host function has the signature: + +``` +func(ctx context.Context, stack []uint64) +``` + +the function arguments in the `stack` parameter are copied over to the reserved +slots of the real stack. For instance, on `arm64` the stack layout would look +as follows (on `amd64` it would be similar): + +```goat + (high address) + SP ------> +-----------------+ <----+ + | ....... | | + | ret Y | | + | ....... | | + | ret 0 | | + | arg X | | size_of_arg_ret + | ....... | | + | arg 1 | | + | arg 0 | <----+ <-------- originalArg0Reg + | size_of_arg_ret | + | ReturnAddress | + +-----------------+ <----+ + | xxxx | | ;; might be padded to make it 16-byte aligned. + +--->| arg[N]/ret[M] | | + sliceSize| | ............ | | goCallStackSize + | | arg[1]/ret[1] | | + +--->| arg[0]/ret[0] | <----+ <-------- arg0ret0AddrReg + | sliceSize | + | frame_size | + +-----------------+ + (low address) +``` + +Finally, the trampoline jumps into the execution of the host function using the +`exitSequence` meta-instruction. + +Upon return, the process is reversed. + +## Code + +- The trampoline to enter the generated function is implemented by the + `backend.Machine.CompileEntryPreamble()` method. +- The trampoline to return traps and invoke host functions is generated by + `backend.Machine.CompileGoFunctionTrampoline()` method. + +You can find arch-specific implementations in +`backend/isa//abi_go_call.go`, +`backend/isa//abi_entry_preamble.go`, etc. The trampolines are found +under `backend/isa//abi_entry_.s`. + +## Further References + +- Go's [internal ABI documentation][abi-internal] details the calling convention similar to the one we use in both arm64 and amd64 backend. +- Raphael Poss's [The Go low-level calling convention on + x86-64][go-call-conv-x86] is also an excellent reference for `amd64`. + +[abi-internal]: https://tip.golang.org/src/cmd/compile/abi-internal +[go-call-conv-x86]: https://dr-knz.net/go-calling-convention-x86-64.html +[proposal-register-cc]: https://go.googlesource.com/proposal/+/master/design/40724-register-calling.md#background +[how-do-compiler-functions-work]: ../../how_do_compiler_functions_work/ + diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md new file mode 100644 index 0000000000..76a8786551 --- /dev/null +++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md @@ -0,0 +1,507 @@ ++++ +title = "How the Optimizing Compiler Works: Back-End" +layout = "single" ++++ + +In this section we will discuss the phases in the back-end of the optimizing +compiler: + +- [Instruction Selection](#instruction-selection) +- [Register Allocation](#register-allocation) +- [Finalization and Encoding](#finalization-and-encoding) + +Each section will include a brief explanation of the phase, references to the +code that implements the phase, and a description of the debug flags that can +be used to inspect that phase. Please notice that, since the implementation of +the back-end is architecture-specific, the code might be different for each +architecture. + +### Code + +The higher-level entry-point to the back-end is the +`backend.Compiler.Compile(context.Context)` method. This method executes, in +turn, the following methods in the same type: + +- `backend.Compiler.Lower()` (instruction selection) +- `backend.Compiler.RegAlloc()` (register allocation) +- `backend.Compiler.Finalize(context.Context)` (finalization and encoding) + +## Instruction Selection + +The instruction selection phase is responsible for mapping the higher-level SSA +instructions to arch-specific instructions. Each SSA instruction is translated +to one or more machine instructions. + +Each target architecture comes with a different number of registers, some of +them are general purpose, others might be specific to certain instructions. In +general, we can expect to have a set of registers for integer computations, +another set for floating point computations, a set for vector (SIMD) +computations, and some specific special-purpose registers (e.g. stack pointers, +program counters, status flags, etc.) + +In addition, some registers might be reserved by the Go runtime or the +Operating System for specific purposes, so they should be handled with special +care. + +At this point in the compilation process we do not want to deal with all that. +Instead, we assume that we have a potentially infinite number of *virtual +registers* of each type at our disposal. The next phase, the register +allocation phase, will map these virtual registers to the actual registers of +the target architecture. + +### Operands and Addressing Modes + +As a rule of thumb, we want to map each `ssa.Value` to a virtual register, and +then use that virtual register as one of the arguments of the machine +instruction that we will generate. However, usually instructions are able to +address more than just registers: an *operand* might be able to represent a +memory address, or an immediate value (i.e. a constant value that is encoded as +part of the instruction itself). + +For these reasons, instead of mapping each `ssa.Value` to a virtual register +(`regalloc.VReg`), we map each `ssa.Value` to an architecture-specific +`operand` type. + +During lowering of an `ssa.Instruction`, for each `ssa.Value` that is used as +an argument of the instruction, in the simplest case, the `operand` might be +mapped to a virtual register, in other cases, the `operand` might be mapped to +a memory address, or an immediate value. Sometimes this makes it possible to +replace several SSA instructions with a single machine instruction, by folding +the addressing mode into the instruction itself. + +For instance, consider the following SSA instructions: + +``` + v4:i32 = Const 0x9 + v6:i32 = Load v5, 0x4 + v7:i32 = Iadd v6, v4 +``` + +In the `amd64` architecture, the `add` instruction adds the second operand to +the first operand, and assigns the result to the second operand. So assuming +that `r4`, `v5`, `v6`, and `v7` are mapped respectively to the virtual +registers `r4?`, `r5?`, `r6?`, and `r7?`, the lowering of the `Iadd` +instruction on `amd64` might look like this: + +```asm + ;; AT&T syntax + add $4(%r5?), %r4? ;; add the value at memory address [`r5?` + 4] to `r4?` + mov %r4?, %r7? ;; move the result from `r4?` to `r7?` +``` + +Notice how the load from memory has been folded into an operand of the `add` +instruction. This transformation is possible when the value produced by the +instruction being folded is not referenced by other instructions and the +instructions belong to the same `InstructionGroupID` (see [Front-End: +Optimization](../frontend/#optimization)). + +### Example + +At the end of the instruction selection phase, the basic blocks of our `abs` +function will look as follows (for `arm64`): + +```asm +L1 (SSA Block: blk0): + mov x130?, x2 + subs wzr, w130?, #0x0 + b.ge L2 +L3 (SSA Block: blk1): + mov x136?, xzr + sub w134?, w136?, w130? + mov x135?, x134? + b L4 +L2 (SSA Block: blk2): + mov x135?, x130? +L4 (SSA Block: blk3): + mov x0, x135? + ret +``` + +Notice the introduction of the new identifiers `L1`, `L3`, `L2`, and `L4`. +These are labels that are used to mark the beginning of each basic block, and +they are the target for branching instructions such as `b` and `b.ge`. + +### Code + +`backend.Machine` is the interface to the backend. It has a methods to +translate (lower) the IR to machine code. Again, as seen earlier in the +front-end, the term *lowering* is used to indicate translation from a +higher-level representation to a lower-level representation. + +`backend.Machine.LowerInstr(*ssa.Instruction)` is the method that translates an +SSA instruction to machine code. Machine-specific implementations of this +method can be found in package `backend/isa/` where `` is either +`amd64` or `arm64`. + +### Debug Flags + +`wazevoapi.PrintSSAToBackendIRLowering` prints the basic blocks with the +lowered arch-specific instructions. + +## Register Allocation + +The register allocation phase is responsible for mapping the potentially +infinite number of virtual registers to the real registers of the target +architecture. Because the number of real registers is limited, the register +allocation phase might need to "spill" some of the virtual registers to memory; +that is, it might store their content, and then load them back into a register +when they are needed. + +For a given function `f` the register allocation procedure +`regalloc.Allocator.DoAllocation(f)` is implemented in sub-phases: + +- `livenessAnalysis(f)` collects the "liveness" information for each virtual + register. The algorithm is described in [Chapter 9.2 of The SSA +Book][ssa-book]. + +- `alloc(f)` allocates registers for the given function. The algorithm is + derived from [the Go compiler's +allocator][go-regalloc] + +At the end of the allocation procedure, we also record the set of registers +that are **clobbered** by the body of the function. A register is clobbered +if its value is overwritten by the function, and it is not saved by the +callee. This information is used in the finalization phase to determine which +registers need to be saved in the prologue and restored in the epilogue. +to register allocation in a textbook meaning, but it is a necessary step +for the finalization phase. + +### Liveness Analysis + +Intuitively, a variable or name binding can be considered _live_ at a certain +point in a program, if its value will be used in the future. + +For instance: + +``` +1| int f(int x) { +2| int y = 2 + x; +3| int z = x + y; +4| return z; +5| } +``` + +Variable `x` and `y` are both live at line 4, because they are used in the +expression `x + y` on line 3; variable `z` is live at line 4, because it is +used in the return statement. However, variables `x` and `y` can be considered +_not_ live at line 4 because they are not used anywhere after line 3. + +Statically, _liveness_ can be approximated by following paths backwards on the +control-flow graph, connecting the uses of a given variable to its definitions +(or its *unique* definition, assuming SSA form). + +In practice, while liveness is a property of each name binding at any point in +the program, it is enough to keep track of liveness at the boundaries of basic +blocks: + +- the _live-in_ set for a given basic block is the set of all bindings that are + live at the entry of that block. +- the _live-out_ set for a given basic block is the set of all bindings that + are live at the exit of that block. A binding is live at the exit of a block +if it is live at the entry of a successor. + +Because the CFG is a connected graph, it is enough to keep track of either +live-in or live-out sets, and then propagate the liveness information backward +or forward, respectively. In our case, we keep track of live-in sets per block; +live-outs are derived from live-ins of the successor blocks when a block is +allocated. + +### Allocation + +We implemented a variant of the linear scan register allocation algorithm +described in [the Go compiler's allocator][go-regalloc]. + +Each basic block is allocated registers in a linear scan order, and the +allocation state is propagated from a given basic block to its successors. +Then, each block continues allocation from that initial state. + +#### Merge States + +Special care has to be taken when a block has multiple predecessors. We call +this *fixing merge states*: for instance, consider the following: + +```goat { width="30%" } + .---. .---. +| BB0 | | BB1 | + '-+-' '-+-' + +----+----+ + | + v + .---. + | BB2 | + '---' +``` + +if the live-out set of a given block `BB0` is different from the live-out set +of a given block `BB1` and both are predecessors of a block `BB2`, then we need +to adjust `BB0` and `BB1` to ensure consistency with `BB2`. In practice, +abstract values in `BB0` and `BB1` might be passed to `BB2` either via registers +or via stack; fixing merge states ensures that registers and stack are used +consistently to pass values across the involved states. + +#### Spilling + +If the register allocator cannot find a free register for a given virtual +(live) register, it needs to "spill" the value to the stack to get a free +register, *i.e.,* stash it temporarily to stack. When that virtual register is +reused later, we will have to insert instructions to reload the value into a +real register. + +While the procedure proceeds with allocation, the procedure also records all +the virtual registers that transition to the "spilled" state, and inserts the +reload instructions when those registers are reused later. + +The spill instructions are actually inserted at the end of the register +allocation, after all the allocations and the merge states have been fixed. At +this point, all the other potential sources of instability have been resolved, +and we know where all the reloads happen. + +We insert the spills in the block that is the lowest common ancestor of all the +blocks that reload the value. + +#### Clobbered Registers + +At the end of the allocation procedure, the `determineCalleeSavedRealRegs(f)` +method iterates over the set of the allocated registers and compares them +to a set of architecture-specific set `CalleeSavedRegisters`. If a register +has been allocated, and it is present in this set, the register is marked as +"clobbered", i.e., we now know that the register allocator will overwrite +that value. Thus, these values will have to be spilled in the prologue. + +#### References + +Register allocation is a complex problem, possibly the most complicated +part of the backend. The following references were used to implement the +algorithm: + +- https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/17/Slides17.pdf +- https://en.wikipedia.org/wiki/Chaitin%27s_algorithm +- https://llvm.org/ProjectsWithLLVM/2004-Fall-CS426-LS.pdf +- https://pfalcon.github.io/ssabook/latest/book-full.pdf: Chapter 9. for liveness analysis. +- https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go + +We suggest to refer to them to dive deeper in the topic. + +### Example + +At the end of the register allocation phase, the basic blocks of our `abs` +function look as follows (for `arm64`): + +```asm +L1 (SSA Block: blk0): + mov x2, x2 + subs wzr, w2, #0x0 + b.ge L2 +L3 (SSA Block: blk1): + mov x8, xzr + sub w8, w8, w2 + mov x8, x8 + b L4 +L2 (SSA Block: blk2): + mov x8, x2 +L4 (SSA Block: blk3): + mov x0, x8 + ret +``` + +Notice how the virtual registers have been all replaced by real registers, i.e. +no register identifier is suffixed with `?`. This example is quite simple, and +it does not require any spill. + +### Code + +The algorithm (`regalloc/regalloc.go`) can work on any ISA by implementing the +interfaces in `regalloc/api.go`. + +Essentially: + +- each architecture exposes iteration over basic blocks of a function + (`regalloc.Function` interface) +- each arch-specific basic block exposes iteration over instructions + (`regalloc.Block` interface) +- each arch-specific instruction exposes the set of registers it defines and + uses (`regalloc.Instr` interface) + +By defining these interfaces, the register allocation algorithm can assign real +registers to virtual registers without dealing specifically with the target +architecture. + +In practice, each interface is usually implemented by instantiating a common +generic struct that comes already with an implementation of all or most of the +required methods. For instance,`regalloc.Function`is implemented by +`backend.RegAllocFunction[*arm64.instruction, *arm64.machine]`. + +`backend/isa//abi.go` (where `` is either `arm64` or `amd64`) +contains the instantiation of the `regalloc.RegisterInfo` struct, which +declares, among others +- the set of registers that are available for allocation, excluding, for + instance, those that might be reserved by the runtime or the OS +(`AllocatableRegisters`) +- the registers that might be saved by the callee to the stack + (`CalleeSavedRegisters`) + +### Debug Flags + +- `wazevoapi.RegAllocLoggingEnabled` logs detailed logging of the register + allocation procedure. +- `wazevoapi.PrintRegisterAllocated` prints the basic blocks with the register + allocation result. + +## Finalization and Encoding + +At the end of the register allocation phase, we have enough information to +finally generate machine code (_encoding_). We are only missing the prologue +and epilogue of the function. + +### Prologue and Epilogue + +As usual, the **prologue** is executed before the main body of the function, +and the **epilogue** is executed at the return. The prologue is responsible for +setting up the stack frame, and the epilogue is responsible for cleaning up the +stack frame and returning control to the caller. + +Generally, this means, at the very least: +- saving the return address +- a base pointer to the stack; or, equivalently, the height of the stack at the + beginning of the function + +For instance, on `amd64`, `RBP` is the base pointer, `RSP` is the stack +pointer: + +```goat {width="100%" height="250"} + (high address) (high address) + RBP ----> +-----------------+ +-----------------+ + | `...` | | `...` | + | ret Y | | ret Y | + | `...` | | `...` | + | ret 0 | | ret 0 | + | arg X | | arg X | + | `...` | ====> | `...` | + | arg 1 | | arg 1 | + | arg 0 | | arg 0 | + | Return Addr | | Return Addr | + RSP ----> +-----------------+ | Caller_RBP | + (low address) +-----------------+ <----- RSP, RBP +``` + +While, on `arm64`, there is only a stack pointer `SP`: + + +```goat {width="100%" height="300"} + (high address) (high address) + SP ---> +-----------------+ +------------------+ <----+ + | `...` | | `...` | | + | ret Y | | ret Y | | + | `...` | | `...` | | + | ret 0 | | ret 0 | | + | arg X | | arg X | | size_of_arg_ret. + | `...` | ====> | `...` | | + | arg 1 | | arg 1 | | + | arg 0 | | arg 0 | <----+ + +-----------------+ | size_of_arg_ret | + | return address | + +------------------+ <---- SP + (low address) (low address) +``` + +However, the prologue and epilogue might also be responsible for saving and +restoring the state of registers that might be overwritten by the function +("clobbered"); and, if spilling occurs, prologue and epilogue are also +responsible for reserving and releasing the space for the spilled values. + +For clarity, we make a distinction between the space reserved for the clobbered +registers and the space reserved for the spilled values: + +- Spill slots are used to temporarily store the values that needs spilling as + determined by the register allocator. This section must have a fix height, +but its contents will change over time, as registers are being spilled and +reloaded. +- Clobbered registers are, similarly, determined by the register allocator, but + they are stashed in the prologue and then restored in the epilogue. + +The procedure happens after the register allocation phase because at +this point we have collected enough information to know how much space we need +to reserve, and which registers are clobbered. + +Regardless of the architecture, after allocating this space, the stack will +look as follows: + +```goat {height="350"} + (high address) + +-----------------+ + | `...` | + | ret Y | + | `...` | + | ret 0 | + | arg X | + | `...` | + | arg 1 | + | arg 0 | + | (arch-specific) | + +-----------------+ + | clobbered M | + | ............ | + | clobbered 1 | + | clobbered 0 | + | spill slot N | + | ............ | + | spill slot 0 | + +-----------------+ + (low address) +``` + +Note: the prologue might also introduce a check of the stack bounds. If there +is no sufficient space to allocate the stack frame, the function will exit the +execution and will try to grow it from the Go runtime. + +The epilogue simply reverses the operations of the prologue. + +### Other Post-RegAlloc Logic + +The `backend.Machine.PostRegAlloc` method is invoked after the register +allocation procedure; while its main role is to define the prologue and +epilogue of the function, it also serves as a hook to perform other, +arch-specific duty, that has to happen after the register allocation phase. + +For instance, on `amd64`, the constraints for some instructions are hard to +express in a meaningful way for the register allocation procedure (for +instance, the `div` instruction implicitly use registers `rdx`, `rax`). +Instead, they are lowered with ad-hoc logic as part of the implementation +`backend.Machine.PostRegAlloc` method. + +### Encoding + +The final stage of the backend encodes the machine instructions into bytes and +writes them to the target buffer. Before proceeding with the encoding, relative +addresses in branching instructions or addressing modes are resolved. + +The procedure encodes the instructions in the order they appear in the +function. + +### Code + +- The prologue and epilogue are set up as part of the + `backend.Machine.PostRegAlloc` method. +- The encoding is done by the `backend.Machine.Encode` method. + +### Debug Flags + +- `wazevoapi.PrintFinalizedMachineCode` prints the assembly code of the + function after the finalization phase. +- `wazevoapi.printMachineCodeHexPerFunctionUnmodified` prints a hex + representation of the function generated code as it is. +- `wazevoapi.PrintMachineCodeHexPerFunctionDisassemblable` prints a hex + representation of the function generated code that can be disassembled. + +The reason for the distinction between the last two flags is that the generated +code in some cases might not be disassemblable. +`PrintMachineCodeHexPerFunctionDisassemblable` flag prints a hex encoding of +the generated code that can be disassembled, but cannot be executed. + +
+ +* Previous Section: [Front-End](../frontend/) +* Next Section: [Appendix: Trampolines](../appendix/) + +[ssa-book]: https://pfalcon.github.io/ssabook/latest/book-full.pdf +[go-regalloc]: https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go diff --git a/site/content/docs/how_the_optimizing_compiler_works/frontend.md b/site/content/docs/how_the_optimizing_compiler_works/frontend.md new file mode 100644 index 0000000000..f64e04d661 --- /dev/null +++ b/site/content/docs/how_the_optimizing_compiler_works/frontend.md @@ -0,0 +1,371 @@ ++++ +title = "How the Optimizing Compiler Works: Front-End" +layout = "single" ++++ + +In this section we will discuss the phases in the front-end of the optimizing compiler: + +- [Translation to SSA](#translation-to-ssa) +- [Optimization](#optimization) +- [Block Layout](#block-layout) + +Every section includes an explanation of the phase; the subsection **Code** +will include high-level pointers to functions and packages; the subsection **Debug Flags** +indicates the flags that can be used to enable advanced logging of the phase. + +## Translation to SSA + +We mentioned earlier that wazero uses an internal representation called an "SSA" +form or "Static Single-Assignment" form, but we never explained what that is. + +In short terms, every program, or, in our case, every Wasm function, can be +translated in a control-flow graph. The control-flow graph is a directed graph where +each node is a sequence of statements that do not contain a control flow instruction, +called a **basic block**. Instead, control-flow instructions are translated into edges. + +For instance, take the following implementation of the `abs` function: + +```wasm +(module + (func (;0;) (param i32) (result i32) + (if (result i32) (i32.lt_s (local.get 0) (i32.const 0)) + (then + (i32.sub (i32.const 0) (local.get 0))) + (else + (local.get 0)) + ) + ) + (export "f" (func 0)) +) +``` + +This is translated to the following block diagram: + +```goat {width="100%" height="500"} + +---------------------------------------------+ + |blk0: (exec_ctx:i64, module_ctx:i64, v2:i32) | + | v3:i32 = Iconst_32 0x0 | + | v4:i32 = Icmp lt_s, v2, v3 | + | Brz v4, blk2 | + | Jump blk1 | + +---------------------------------------------+ + | + | + +---`(v4 != 0)`-+-`(v4 == 0)`---+ + | | + v v + +---------------------------+ +---------------------------+ + |blk1: () <-- (blk0) | |blk2: () <-- (blk0) | + | v6:i32 = Iconst_32 0x0 | | Jump blk3, v2 | + | v7:i32 = Isub v6, v2 | | | + | Jump blk3, v7 | | | + +---------------------------+ +---------------------------+ + | | + | | + +-`{v5 := v7}`--+--`{v5 := v2}`-+ + | + v + +------------------------------+ + |blk3: (v5:i32) <-- (blk1,blk2)| + | Jump blk_ret, v5 | + +------------------------------+ + | + {return v5} + | + v +``` + +We use the ["block argument" variant of SSA][ssa-blocks], which is also the same +representation [used in LLVM's MLIR][llvm-mlir]. In this variant, each block +takes a list of arguments. Each block ends with a branching instruction (Branch, Return, +Jump, etc...) with an optional list of arguments; these arguments are assigned +to the target block's arguments like a function. + +Consider the first block `blk0`. + +``` +blk0: (exec_ctx:i64, module_ctx:i64, v2:i32) + v3:i32 = Iconst_32 0x0 + v4:i32 = Icmp lt_s, v2, v3 + Brz v4, blk2 + Jump blk1 +``` + +You will notice that, compared to the original function, it takes two extra +parameters (`exec_ctx` and `module_ctx`): + +1. `exec_ctx` is a pointer to `wazevo.executionContext`. This is used to exit the execution + in the face of traps or for host function calls. +2. `module_ctx`: pointer to `wazevo.moduleContextOpaque`. This is used, among other things, + to access memory. + +It then takes one parameter `v2`, corresponding to the function parameter, and +it defines two variables `v3`, `v4`. `v3` is the constant 0, `v4` is the result of +comparing `v2` to `v3` using the `i32.lt_s` instruction. Then, it branches to +`blk2` if `v4` is zero, otherwise it jumps to `blk1`. + +You might also have noticed that the instructions do not correspond strictly to +the original Wasm opcodes. This is because, similarly to the wazero IR used by +the old compiler, this is a custom IR. + +You will also notice that, _on the right-hand side of the assignments_ of any statement, +no name occurs _twice_: this is why this form is called **single-assignment**. + +Finally, notice how `blk1` and `blk2` end with a jump to the last block `blk3`. + +``` +blk1: () + ... + Jump blk3, v7 + +blk2: () + Jump blk3, v2 + +blk3: (v5:i32) + ... +``` + +`blk3` takes an argument `v5`: `blk1` jumps to `bl3` with `v7` and `blk2` jumps +to `blk3` with `v2`, meaning `v5` is effectively a rename of `v5` or `v7`, +depending on the originating block. If you are familiar with the traditional +representation of an SSA form, you will recognize that the role of block +arguments is equivalent to the role of the *Phi (Φ) function*, a special +function that returns a different value depending on the incoming edge; e.g., in +this case: `v5 := Φ(v7, v2)`. + +### Code + +The relevant APIs can be found under sub-package `ssa` and `frontend`. +In the code, the terms *lower* or *lowering* are often used to indicate a mapping or a translation, +because such transformations usually correspond to targeting a lower abstraction level. + +- Basic Blocks are represented by the type `ssa.Block`. +- The SSA form is constructed using an `ssa.Builder`. The `ssa.Builder` is instantiated + in the context of `wasm.Engine.CompileModule()`, more specifically in the method + `frontend.Compiler.LowerToSSA()`. +- The mapping between Wasm opcodes and the IR happens in `frontend/lower.go`, + more specifically in the method `frontend.Compiler.lowerCurrentOpcode()`. +- Because they are semantically equivalent, in the code, basic block parameters + are sometimes referred to as "Phi values". + +#### Instructions and Values + +An `ssa.Instruction` is a single instruction in the SSA form. Each instruction might +consume zero or more `ssa.Value`s, and it usually produces a single `ssa.Value`; some +instructions may not produce any value (for instance, a `Jump` instruction). +An `ssa.Value` is an abstraction that represents a typed name binding, and it is used +to represent the result of an instruction, or the input to an instruction. + +For instance: + +``` +blk1: () <-- (blk0) + v6:i32 = Iconst_32 0x0 + v7:i32 = Isub v6, v2 + Jump blk3, v7 +``` + +`Iconst_32` takes no input value and produce value `v6`; `Isub` takes two input values (`v6`, `v2`) +and produces value `v7`; `Jump` takes one input value (`v7`) and produces no value. All +such values have the `i32` type. The wazero SSA's type system (`ssa.Type`) allows the following types: + +- `i32`: 32-bit integer +- `i64`: 64-bit integer +- `f32`: 32-bit floating point +- `f64`: 64-bit floating point +- `v128`: 128-bit SIMD vector + +For simplicity, we don't have a dedicated type for pointers. Instead, we use the `i64` +type to represent pointer values since we only support 64-bit architectures, +unlike traditional compilers such as LLVM. + +Values and instructions are both allocated from pools to minimize memory allocations. + +### Debug Flags + +- `wazevoapi.PrintSSA` dumps the SSA form to the console. +- `wazevoapi.FrontEndLoggingEnabled` dumps progress of the translation between Wasm + opcodes and SSA instructions to the console. + +## Optimization + +The SSA form makes it easier to perform a number of optimizations. For instance, +we can perform constant propagation, dead code elimination, and common +subexpression elimination. These optimizations either act upon the instructions +within a basic block, or they act upon the control-flow graph as a whole. + +On a high, level, consider the following basic block, derived from the previous +example: + +``` +blk0: (exec_ctx:i64, module_ctx:i64) + v2:i32 = Iconst_32 -5 + v3:i32 = Iconst_32 0 + v4:i32 = Icmp lt_s, v2, v3 + Brz v4, blk2 + Jump blk1 +``` + +It is pretty easy to see that the comparison in `v4` can be replaced by a +constant `1`, because the comparison is between two constant values (-5, 0). +Therefore, the block can be rewritten as such: + +``` +blk0: (exec_ctx:i64, module_ctx:i64) + v4:i32 = Iconst_32 1 + Brz v4, blk2 + Jump blk1 +``` + +However, we can now also see that the branch is always taken, and that the block +`blk2` is never executed, so even the branch instruction and the constant +definition `v4` can be removed: + +``` +blk0: (exec_ctx:i64, module_ctx:i64) + Jump blk1 +``` + +This is a simple example of constant propagation and dead code elimination +occurring within a basic block. However, now `blk2` is unreachable, because +there is no other edge in the edge that points to it; thus it can be removed +from the control-flow graph. This is an example of dead-code elimination that +occurs at the control-flow graph level. + +In practice, because WebAssembly is a compilation target, these simple +optimizations are often unnecessary. The optimization passes implemented in +wazero are also work-in-progress and, at the time of writing, further work is +expected to implement more advanced optimizations. + +### Code + +Optimization passes are implemented by `ssa.Builder.RunPasses()`. An optimization +pass is just a function that takes a ssa builder as a parameter. + +Passes iterate over the basic blocks, and, for each basic block, they iterate +over the instructions. Each pass may mutate the basic block by modifying the instructions +it contains, or it might change the entire shape of the control-flow graph (e.g. by removing +blocks). + +Currently, there are two dead-code elimination passes: + +- `passDeadBlockEliminationOpt` acting at the block-level. +- `passDeadCodeEliminationOpt` acting at instruction-level. + +Notably, `passDeadCodeEliminationOpt` also assigns an `InstructionGroupID` to each +instruction. This is used to determine whether a sequence of instructions can be +replaced by a single machine instruction during the back-end phase. For more details, +see also the relevant documentation in `ssa/instructions.go` + +There are also simple constant folding passes such as `passNopInstElimination`, which +folds and delete instructions that are essentially no-ops (e.g. shifting by a 0 amount). + +### Debug Flags + +`wazevoapi.PrintOptimizedSSA` dumps the SSA form to the console after optimization. + + +## Block Layout + +As we have seen earlier, the SSA form instructions are contained within basic +blocks, and the basic blocks are connected by edges of the control-flow graph. +However, machine code is not laid out in a graph, but it is just a linear +sequence of instructions. + +Thus, the last step of the front-end is to lay out the basic blocks in a linear +sequence. Because each basic block, by design, ends with a control-flow +instruction, one of the goals of the block layout phase is to maximize the number of +**fall-through opportunities**. A fall-through opportunity occurs when a block ends +with a jump instruction whose target is exactly the next block in the +sequence. In order to maximize the number of fall-through opportunities, the +block layout phase might reorder the basic blocks in the control-flow graph, +and transform the control-flow instructions. For instance, it might _invert_ +some branching conditions. + +The end goal is to effectively minimize the number of jumps and branches in +the machine code that will be generated later. + + +### Critical Edges + +Special attention must be taken when a basic block has multiple predecessors, +i.e., when it has multiple incoming edges. In particular, an edge between two +basic blocks is called a **critical edge** when, at the same time: +- the predecessor has multiple successors **and** +- the successor has multiple predecessors. + +For instance, in the example below the edge between `BB0` and `BB3` +is a critical edge. + +```goat { width="300" } +┌───────┐ ┌───────┐ +│ BB0 │━┓ │ BB1 │ +└───────┘ ┃ └───────┘ + │ ┃ │ + ▼ ┃ ▼ +┌───────┐ ┃ ┌───────┐ +│ BB2 │ ┗━▶│ BB3 │ +└───────┘ └───────┘ +``` + +In these cases the critical edge is split by introducing a new basic block, +called a **trampoline**, where the critical edge was. + +```goat { width="300" } +┌───────┐ ┌───────┐ +│ BB0 │──────┐ │ BB1 │ +└───────┘ ▼ └───────┘ + │ ┌──────────┐ │ + │ │trampoline│ │ + ▼ └──────────┘ ▼ +┌───────┐ │ ┌───────┐ +│ BB2 │ └────▶│ BB3 │ +└───────┘ └───────┘ +``` + +For more details on critical edges read more at + +- https://en.wikipedia.org/wiki/Control-flow_graph +- https://nickdesaulniers.github.io/blog/2023/01/27/critical-edge-splitting/ + +### Example + +At the end of the block layout phase, the laid out SSA for the `abs` function +looks as follows: + +``` +blk0: (exec_ctx:i64, module_ctx:i64, v2:i32) + v3:i32 = Iconst_32 0x0 + v4:i32 = Icmp lt_s, v2, v3 + Brz v4, blk2 + Jump fallthrough + +blk1: () <-- (blk0) + v6:i32 = Iconst_32 0x0 + v7:i32 = Isub v6, v2 + Jump blk3, v7 + +blk2: () <-- (blk0) + Jump fallthrough, v2 + +blk3: (v5:i32) <-- (blk1,blk2) + Jump blk_ret, v5 +``` + +### Code + +`passLayoutBlocks` implements the block layout phase. + +### Debug Flags + +- `wazevoapi.PrintBlockLaidOutSSA` dumps the SSA form to the console after block layout. +- `wazevoapi.SSALoggingEnabled` logs the transformations that are applied during this phase, + such as inverting branching conditions or splitting critical edges. + +
+ +* Previous Section: [How the Optimizing Compiler Works](../) +* Next Section: [Back-End](../backend/) + +[ssa-blocks]: https://en.wikipedia.org/wiki/Static_single-assignment_form#Block_arguments +[llvm-mlir]: https://mlir.llvm.org/docs/Rationale/Rationale/#block-arguments-vs-phi-nodes