From b7b54d596796c813a334a69478ffcd4240165aaa Mon Sep 17 00:00:00 2001
From: Edoardo Vacchi <evacchi@users.noreply.github.com>
Date: Sat, 9 Mar 2024 08:39:11 +0800
Subject: [PATCH] wazevo(docs): optimizing compiler (#2065)

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
---
 site/content/docs/_index.md                   |   3 +-
 .../_index.md                                 | 131 +++++
 .../appendix.md                               | 185 +++++++
 .../backend.md                                | 507 ++++++++++++++++++
 .../frontend.md                               | 371 +++++++++++++
 5 files changed, 1196 insertions(+), 1 deletion(-)
 create mode 100644 site/content/docs/how_the_optimizing_compiler_works/_index.md
 create mode 100644 site/content/docs/how_the_optimizing_compiler_works/appendix.md
 create mode 100644 site/content/docs/how_the_optimizing_compiler_works/backend.md
 create mode 100644 site/content/docs/how_the_optimizing_compiler_works/frontend.md

diff --git a/site/content/docs/_index.md b/site/content/docs/_index.md
index e04a20d7bc..e00d8e3681 100644
--- a/site/content/docs/_index.md
+++ b/site/content/docs/_index.md
@@ -143,7 +143,8 @@ Notably, the interpreter and compiler in wazero's [Runtime configuration][Runtim
 In wazero, a compiler is a runtime configured to compile modules to platform-specific machine code ahead of time (AOT)
 during the creation of [CompiledModule][CompiledModule]. This means your WebAssembly functions execute
 natively at runtime of the embedding Go program. Compiler is faster than Interpreter, often by order of
-magnitude (10x) or more, and therefore enabled by default whenever available.
+magnitude (10x) or more, and therefore enabled by default whenever available. You can read more about wazero's
+[optimizing compiler in the detailed documentation]({{< relref "/how_the_optimizing_compiler_works" >}}).
 
 #### Interpreter
 
diff --git a/site/content/docs/how_the_optimizing_compiler_works/_index.md b/site/content/docs/how_the_optimizing_compiler_works/_index.md
new file mode 100644
index 0000000000..9ba1e7df4d
--- /dev/null
+++ b/site/content/docs/how_the_optimizing_compiler_works/_index.md
@@ -0,0 +1,131 @@
++++
+title = "How the Optimizing Compiler Works"
+layout = "single"
++++
+
+wazero supports two modes of execution: interpreter mode and compilation mode.
+The interpreter mode is a fallback mode for platforms where compilation is not
+supported. Compilation mode is otherwise the default mode of execution: it
+translates Wasm modules to native code to get the best run-time performance.
+
+Translating Wasm bytecode into machine code can take multiple forms.  wazero
+1.0 performs a straightforward translation from a given instruction to a native
+instruction. wazero 2.0 introduces an optimizing compiler that is able to
+perform nontrivial optimizing transformations, such as constant folding or
+dead-code elimination, and it makes better use of the underlying hardware, such
+as CPU registers. This document digs deeper into what we mean when we say
+"optimizing compiler", and explains how it is implemented in wazero.
+
+This document is intended for maintainers, researchers, developers and in
+general anyone interested in understanding the internals of wazero.
+
+What is an Optimizing Compiler?
+-------------------------------
+
+Wazero supports an _optimizing_ compiler in the style of other optimizing
+compilers such as LLVM's or V8's. Traditionally an optimizing
+compiler performs compilation in a number of steps.
+
+Compare this to the **old compiler**, where compilation happens in one step or
+two, depending on how you count:
+
+
+```goat
+    Input         +---------------+     +---------------+
+ Wasm Binary ---->| DecodeModule  |---->| CompileModule |----> wazero IR
+                  +---------------+     +---------------+
+```
+
+That is, the module is (1) validated then (2) translated to an Intermediate
+Representation (IR). The wazero IR can then be executed directly (in the case
+of the interpreter) or it can be further processed and translated into native
+code by the compiler. This compiler performs a straightforward translation from
+the IR to native code, without any further passes. The wazero IR is not intended
+for further processing beyond immediate execution or straightforward
+translation.
+
+```goat
+                +----   wazero IR    ----+
+                |                        |
+                v                        v
+        +--------------+         +--------------+
+        |   Compiler   |         | Interpreter  |- - -  executable
+        +--------------+         +--------------+
+                |
+     +----------+---------+
+     |                    |
+     v                    v
++---------+          +---------+
+|  ARM64  |          |  AMD64  |
+| Backend |          | Backend |    - - - - - - - - -   executable
++---------+          +---------+
+```
+
+
+Validation and translation to an IR in a compiler are usually called the
+**front-end** part of a compiler, while code-generation occurs in what we call
+the **back-end** of a compiler. The front-end is the part of a compiler that is
+closer to the input, and it generally indicates machine-independent processing,
+such as parsing and static validation. The back-end is the part of a compiler
+that is closer to the output, and it generally includes machine-specific
+procedures, such as code-generation.
+
+In the **optimizing** compiler, we still decode and translate Wasm binaries to
+an intermediate representation in the front-end, but we use a textbook
+representation called an **SSA** or "Static Single-Assignment Form", that is
+intended for further transformation.
+
+The benefit of choosing an IR that is meant for transformation is that a lot of
+optimization passes can apply directly to the IR, and thus be
+machine-independent. Then the back-end can be relatively simpler, in that it
+will only have to deal with machine-specific concerns.
+
+The wazero optimizing compiler implements the following compilation passes:
+
+* Front-End:
+  - Translation to SSA
+  - Optimization
+  - Block Layout
+  - Control Flow Analysis
+
+* Back-End:
+  - Instruction Selection
+  - Registry Allocation
+  - Finalization and Encoding
+
+```goat
+     Input          +-------------------+      +-------------------+
+  Wasm Binary   --->|   DecodeModule    |----->|   CompileModule   |--+
+                    +-------------------+      +-------------------+  |
+           +----------------------------------------------------------+
+           |
+           |  +---------------+            +---------------+
+           +->|   Front-End   |----------->|   Back-End    |
+              +---------------+            +---------------+
+                      |                            |
+                      v                            v
+                     SSA                 Instruction Selection
+                      |                            |
+                      v                            v
+                Optimization              Registry Allocation
+                      |                            |
+                      v                            v
+                Block Layout             Finalization/Encoding
+```
+
+Like the other engines, the implementation can be found under `engine`, specifically
+in the `wazevo` sub-package. The entry-point is found under `internal/engine/wazevo/engine.go`,
+where the implementation of the interface `wasm.Engine` is found.
+
+All the passes can be dumped to the console for debugging, by enabling, the build-time
+flags under `internal/engine/wazevo/wazevoapi/debug_options.go`. The flags are disabled
+by default and should only be enabled during debugging. These may also change in the future.
+
+In the following we will assume all paths to be relative to the `internal/engine/wazevo`,
+so we will omit the prefix.
+
+## Index
+
+- [Front-End](frontend/)
+- [Back-End](backend/)
+- [Appendix](appendix/)
diff --git a/site/content/docs/how_the_optimizing_compiler_works/appendix.md b/site/content/docs/how_the_optimizing_compiler_works/appendix.md
new file mode 100644
index 0000000000..c66115c2a2
--- /dev/null
+++ b/site/content/docs/how_the_optimizing_compiler_works/appendix.md
@@ -0,0 +1,185 @@
++++
+title = "Appendix: Trampolines"
+layout = "single"
++++
+
+Trampolines are used to interface between the Go runtime and the generated
+code, in two cases:
+
+- when we need to **enter the generated code** from the Go runtime.
+- when we need to **leave the generated code** to invoke a host function
+  (written in Go).
+
+In this section we want to complete the picture of how a Wasm function gets
+translated from Wasm to executable code in the optimizing compiler, by
+describing how to jump into the execution of the generated code at run-time.
+
+## Entering the Generated Code
+
+At run-time, user space invokes a Wasm function through the public
+`api.Function` interface, using methods `Call()` or `CallWithStack()`.  The
+implementation of this method, in turn, eventually invokes an ASM
+**trampoline**. The signature of this trampoline in Go code is:
+
+```go
+func entrypoint(
+	preambleExecutable, functionExecutable *byte,
+	executionContextPtr uintptr, moduleContextPtr *byte,
+	paramResultStackPtr *uint64,
+	goAllocatedStackSlicePtr uintptr)
+```
+
+- `preambleExecutable` is a pointer to the generated code for the preamble (see
+  below)
+- `functionExecutable` is a pointer to the generated code for the function (as
+  described in the previous sections).
+- `executionContextPtr` is a raw pointer to the `wazevo.executionContext`
+  struct. This struct is used to save the state of the Go runtime before
+entering or leaving the generated code. It also holds shared state between the
+Go runtime and the generated code, such as the exit code that is used to
+terminate execution on failure, or suspend it to invoke host functions.
+- `moduleContextPtr` is a pointer to the `wazevo.moduleContextOpaque` struct.
+  This struct Its contents are basically the pointers to the module instance,
+specific objects as well as functions. This is sometimes called "VMContext" in
+other Wasm runtimes.
+- `paramResultStackPtr` is a pointer to the slice where the arguments and
+  results of the function are passed.
+- `goAllocatedStackSlicePtr` is an aligned pointer to the Go-allocated stack
+  for holding values and call frames. For further details refer to
+  [Backend § Prologue and Epilogue](../backend/#prologue-and-epilogue)
+
+The trampoline can be found in`backend/isa/<arch>/abi_entry_<arch>.s`.
+
+For each given architecture, the trampoline:
+- moves the arguments to specific registers to match the behavior of the entry preamble or trampoline function, and
+- finally, it jumps into the execution of the generated code for the preamble
+
+The **preamble** that will be jumped from `entrypoint` function is generated per function signature.
+
+This is implemented in `machine.CompileEntryPreamble(*ssa.Signature)`.
+
+The preamble sets the fields in the `wazevo.executionContext`.
+
+At the beginning of the preamble:
+
+- Set a register to point to the `*wazevo.executionContext` struct.
+- Save the stack pointers, frame pointers, return addresses, etc. to that
+  struct.
+- Update the stack pointer to point to `paramResultStackPtr`.
+
+The generated code works in concert with the assumption that the preamble has
+been entered through the aforementioned trampoline. Thus, it assumes that the
+arguments can be found in some specific registers.
+
+The preamble then assigns the arguments pointed at by `paramResultStackPtr` to
+the registers and stack location that the generated code expects.
+
+Finally, it invokes the generated code for the function.
+
+The epilogue reverses part of the process, finally returning control to the
+caller of the `entrypoint()` function, and the Go runtime. The caller of
+`entrypoint()` is also responsible for completing the cleaning up procedure by
+invoking `afterGoFunctionCallEntrypoint()` (again, implemented in
+backend-specific ASM).  which will restore the stack pointers and return
+control to the caller of the function.
+
+The arch-specific code can be found in
+`backend/isa/<arch>/abi_entry_preamble.go`.
+
+[wazero-engine-stack]: https://github.com/tetratelabs/wazero/blob/095b49f74a5e36ce401b899a0c16de4eeb46c054/internal/engine/compiler/engine.go#L77-L132
+[abi-arm64]: https://tip.golang.org/src/cmd/compile/abi-internal#arm64-architecture
+[abi-amd64]: https://tip.golang.org/src/cmd/compile/abi-internal#amd64-architecture
+[abi-cc]: https://tip.golang.org/src/cmd/compile/abi-internal#function-call-argument-and-result-passing
+
+
+## Leaving the Generated Code
+
+In "[How do compiler functions work?][how-do-compiler-functions-work]", we
+already outlined how _leaving_ the generated code works with the help of a
+function. We will complete here the picture by briefly describing the code that
+is generated.
+
+When the generated code needs to return control to the Go runtime, it inserts a
+meta-instruction that is called `exitSequence` in both `amd64` and `arm64`
+backends.  This meta-instruction sets the `exitCode` in the
+`wazevo.executionContext` struct, restore the stack pointers and then returns
+control to the caller of the `entrypoint()` function described above.
+
+As described in "[How do compiler functions
+work?][how-do-compiler-functions-work]", the mechanism is essentially the same
+when invoking a host function or raising an error. However, when a function is
+invoked the `exitCode` also indicates the identifier of the host function to be
+invoked.
+
+The magic really happens in the `backend.Machine.CompileGoFunctionTrampoline()`
+method.  This method is actually invoked when host modules are being
+instantiated.  It generates a trampoline that is used to invoke such functions
+from the generated code.
+
+This trampoline implements essentially the same prologue as the `entrypoint()`,
+but it also reserves space for the arguments and results of the function to be
+invoked.
+
+A host function has the signature:
+
+```
+func(ctx context.Context, stack []uint64)
+```
+
+the function arguments in the `stack` parameter are copied over to the reserved
+slots of the real stack. For instance, on `arm64` the stack layout would look
+as follows (on `amd64` it would be similar):
+
+```goat
+                  (high address)
+    SP ------> +-----------------+  <----+
+               |     .......     |       |
+               |      ret Y      |       |
+               |     .......     |       |
+               |      ret 0      |       |
+               |      arg X      |       |  size_of_arg_ret
+               |     .......     |       |
+               |      arg 1      |       |
+               |      arg 0      |  <----+ <-------- originalArg0Reg
+               | size_of_arg_ret |
+               |  ReturnAddress  |
+               +-----------------+ <----+
+               |      xxxx       |      |  ;; might be padded to make it 16-byte aligned.
+          +--->|  arg[N]/ret[M]  |      |
+ sliceSize|    |   ............  |      | goCallStackSize
+          |    |  arg[1]/ret[1]  |      |
+          +--->|  arg[0]/ret[0]  | <----+ <-------- arg0ret0AddrReg
+               |    sliceSize    |
+               |   frame_size    |
+               +-----------------+
+                  (low address)
+```
+
+Finally, the trampoline jumps into the execution of the host function using the
+`exitSequence` meta-instruction.
+
+Upon return, the process is reversed.
+
+## Code
+
+- The trampoline to enter the generated function is implemented by the
+  `backend.Machine.CompileEntryPreamble()` method.
+- The trampoline to return traps and invoke host functions is generated by
+  `backend.Machine.CompileGoFunctionTrampoline()` method.
+
+You can find arch-specific implementations in
+`backend/isa/<arch>/abi_go_call.go`,
+`backend/isa/<arch>/abi_entry_preamble.go`, etc. The trampolines are found
+under `backend/isa/<arch>/abi_entry_<arch>.s`.
+
+## Further References
+
+- Go's [internal ABI documentation][abi-internal] details the calling convention similar to the one we use in both arm64 and amd64 backend.
+- Raphael Poss's [The Go low-level calling convention on
+  x86-64][go-call-conv-x86] is also an excellent reference for `amd64`.
+
+[abi-internal]: https://tip.golang.org/src/cmd/compile/abi-internal
+[go-call-conv-x86]: https://dr-knz.net/go-calling-convention-x86-64.html
+[proposal-register-cc]: https://go.googlesource.com/proposal/+/master/design/40724-register-calling.md#background
+[how-do-compiler-functions-work]: ../../how_do_compiler_functions_work/
+
diff --git a/site/content/docs/how_the_optimizing_compiler_works/backend.md b/site/content/docs/how_the_optimizing_compiler_works/backend.md
new file mode 100644
index 0000000000..76a8786551
--- /dev/null
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -0,0 +1,507 @@
++++
+title = "How the Optimizing Compiler Works: Back-End"
+layout = "single"
++++
+
+In this section we will discuss the phases in the back-end of the optimizing
+compiler:
+
+- [Instruction Selection](#instruction-selection)
+- [Register Allocation](#register-allocation)
+- [Finalization and Encoding](#finalization-and-encoding)
+
+Each section will include a brief explanation of the phase, references to the
+code that implements the phase, and a description of the debug flags that can
+be used to inspect that phase.  Please notice that, since the implementation of
+the back-end is architecture-specific, the code might be different for each
+architecture.
+
+### Code
+
+The higher-level entry-point to the back-end is the
+`backend.Compiler.Compile(context.Context)` method.  This method executes, in
+turn, the following methods in the same type:
+
+- `backend.Compiler.Lower()` (instruction selection)
+- `backend.Compiler.RegAlloc()` (register allocation)
+- `backend.Compiler.Finalize(context.Context)` (finalization and encoding)
+
+## Instruction Selection
+
+The instruction selection phase is responsible for mapping the higher-level SSA
+instructions to arch-specific instructions. Each SSA instruction is translated
+to one or more machine instructions.
+
+Each target architecture comes with a different number of registers, some of
+them are general purpose, others might be specific to certain instructions. In
+general, we can expect to have a set of registers for integer computations,
+another set for floating point computations, a set for vector (SIMD)
+computations, and some specific special-purpose registers (e.g. stack pointers,
+program counters, status flags, etc.)
+
+In addition, some registers might be reserved by the Go runtime or the
+Operating System for specific purposes, so they should be handled with special
+care.
+
+At this point in the compilation process we do not want to deal with all that.
+Instead, we assume that we have a potentially infinite number of *virtual
+registers* of each type at our disposal. The next phase, the register
+allocation phase, will map these virtual registers to the actual registers of
+the target architecture.
+
+### Operands and Addressing Modes
+
+As a rule of thumb, we want to map each `ssa.Value` to a virtual register, and
+then use that virtual register as one of the arguments of the machine
+instruction that we will generate. However, usually instructions are able to
+address more than just registers: an *operand* might be able to represent a
+memory address, or an immediate value (i.e. a constant value that is encoded as
+part of the instruction itself).
+
+For these reasons, instead of mapping each `ssa.Value` to a virtual register
+(`regalloc.VReg`), we map each `ssa.Value` to an architecture-specific
+`operand` type.
+
+During lowering of an `ssa.Instruction`, for each `ssa.Value` that is used as
+an argument of the instruction, in the simplest case, the `operand` might be
+mapped to a virtual register, in other cases, the `operand` might be mapped to
+a memory address, or an immediate value. Sometimes this makes it possible to
+replace several SSA instructions with a single machine instruction, by folding
+the addressing mode into the instruction itself.
+
+For instance, consider the following SSA instructions:
+
+```
+    v4:i32 = Const 0x9
+    v6:i32 = Load v5, 0x4
+    v7:i32 = Iadd v6, v4
+```
+
+In the `amd64` architecture, the `add` instruction adds the second operand to
+the first operand, and assigns the result to the second operand. So assuming
+that `r4`, `v5`, `v6`, and `v7` are mapped respectively to the virtual
+registers `r4?`, `r5?`, `r6?`, and `r7?`, the lowering of the `Iadd`
+instruction on `amd64` might look like this:
+
+```asm
+    ;; AT&T syntax
+    add $4(%r5?), %r4? ;; add the value at memory address [`r5?` + 4] to `r4?`
+    mov %r4?, %r7?     ;; move the result from `r4?` to `r7?`
+```
+
+Notice how the load from memory has been folded into an operand of the `add`
+instruction. This transformation is possible when the value produced by the
+instruction being folded is not referenced by other instructions and the
+instructions belong to the same `InstructionGroupID` (see [Front-End:
+Optimization](../frontend/#optimization)).
+
+### Example
+
+At the end of the instruction selection phase, the basic blocks of our `abs`
+function will look as follows (for `arm64`):
+
+```asm
+L1 (SSA Block: blk0):
+	mov x130?, x2
+	subs wzr, w130?, #0x0
+	b.ge L2
+L3 (SSA Block: blk1):
+	mov x136?, xzr
+	sub w134?, w136?, w130?
+	mov x135?, x134?
+	b L4
+L2 (SSA Block: blk2):
+	mov x135?, x130?
+L4 (SSA Block: blk3):
+	mov x0, x135?
+	ret
+```
+
+Notice the introduction of the new identifiers `L1`, `L3`, `L2`, and `L4`.
+These are labels that are used to mark the beginning of each basic block, and
+they are the target for branching instructions such as `b` and `b.ge`.
+
+### Code
+
+`backend.Machine` is the interface to the backend. It has a methods to
+translate (lower) the IR to machine code.  Again, as seen earlier in the
+front-end, the term *lowering* is used to indicate translation from a
+higher-level representation to a lower-level representation.
+
+`backend.Machine.LowerInstr(*ssa.Instruction)` is the method that translates an
+SSA instruction to machine code.  Machine-specific implementations of this
+method can be found in package `backend/isa/<arch>` where `<arch>` is either
+`amd64` or `arm64`.
+
+### Debug Flags
+
+`wazevoapi.PrintSSAToBackendIRLowering` prints the basic blocks with the
+lowered arch-specific instructions.
+
+## Register Allocation
+
+The register allocation phase is responsible for mapping the potentially
+infinite number of virtual registers to the real registers of the target
+architecture. Because the number of real registers is limited, the register
+allocation phase might need to "spill" some of the virtual registers to memory;
+that is, it might store their content, and then load them back into a register
+when they are needed.
+
+For a given function `f` the register allocation procedure
+`regalloc.Allocator.DoAllocation(f)` is implemented in sub-phases:
+
+- `livenessAnalysis(f)` collects the "liveness" information for each virtual
+  register. The algorithm is described in [Chapter 9.2 of The SSA
+Book][ssa-book].
+
+- `alloc(f)` allocates registers for the given function. The algorithm is
+  derived from [the Go compiler's
+allocator][go-regalloc]
+
+At the end of the allocation procedure, we also record the set of registers
+that are **clobbered** by the body of the function. A register is clobbered
+if its value is overwritten by the function, and it is not saved by the
+callee. This information is used in the finalization phase to determine which
+registers need to be saved in the prologue and restored in the epilogue.
+to register allocation in a textbook meaning, but it is a necessary step
+for the finalization phase.
+
+### Liveness Analysis
+
+Intuitively, a variable or name binding can be considered _live_ at a certain
+point in a program, if its value will be used in the future.
+
+For instance:
+
+```
+1| int f(int x) {
+2|   int y = 2 + x;
+3|   int z = x + y;
+4|   return z;
+5| }
+```
+
+Variable `x` and `y` are both live at line 4, because they are used in the
+expression `x + y` on line 3; variable `z` is live at line 4, because it is
+used in the return statement.  However, variables `x` and `y` can be considered
+_not_ live at line 4 because they are not used anywhere after line 3.
+
+Statically, _liveness_ can be approximated by following paths backwards on the
+control-flow graph, connecting the uses of a given variable to its definitions
+(or its *unique* definition, assuming SSA form).
+
+In practice, while liveness is a property of each name binding at any point in
+the program, it is enough to keep track of liveness at the boundaries of basic
+blocks:
+
+- the _live-in_ set for a given basic block is the set of all bindings that are
+  live at the entry of that block.
+- the _live-out_ set for a given basic block is the set of all bindings that
+  are live at the exit of that block. A binding is live at the exit of a block
+if it is live at the entry of a successor.
+
+Because the CFG is a connected graph, it is enough to keep track of either
+live-in or live-out sets, and then propagate the liveness information backward
+or forward, respectively. In our case, we keep track of live-in sets per block;
+live-outs are derived from live-ins of the successor blocks when a block is
+allocated.
+
+### Allocation
+
+We implemented a variant of the linear scan register allocation algorithm
+described in [the Go compiler's allocator][go-regalloc].
+
+Each basic block is allocated registers in a linear scan order, and the
+allocation state is propagated from a given basic block to its successors.
+Then, each block continues allocation from that initial state.
+
+#### Merge States
+
+Special care has to be taken when a block has multiple predecessors. We call
+this *fixing merge states*: for instance, consider the following:
+
+```goat { width="30%" }
+ .---.     .---.
+| BB0 |   | BB1 |
+ '-+-'     '-+-'
+   +----+----+
+        |
+        v
+      .---.
+     | BB2 |
+      '---'
+```
+
+if the live-out set of a given block `BB0` is different from the live-out set
+of a given block `BB1` and both are predecessors of a block `BB2`, then we need
+to adjust `BB0` and `BB1` to ensure consistency with `BB2`. In practice,
+abstract values in `BB0` and `BB1` might be passed to `BB2` either via registers
+or via stack; fixing merge states ensures that registers and stack are used
+consistently to pass values across the involved states.
+
+#### Spilling
+
+If the register allocator cannot find a free register for a given virtual
+(live) register, it needs to "spill" the value to the stack to get a free
+register, *i.e.,* stash it temporarily to stack.  When that virtual register is
+reused later, we will have to insert instructions to reload the value into a
+real register.
+
+While the procedure proceeds with allocation, the procedure also records all
+the virtual registers that transition to the "spilled" state, and inserts the
+reload instructions when those registers are reused later.
+
+The spill instructions are actually inserted at the end of the register
+allocation, after all the allocations and the merge states have been fixed. At
+this point, all the other potential sources of instability have been resolved,
+and we know where all the reloads happen.
+
+We insert the spills in the block that is the lowest common ancestor of all the
+blocks that reload the value.
+
+#### Clobbered Registers
+
+At the end of the allocation procedure, the `determineCalleeSavedRealRegs(f)`
+method iterates over the set of the allocated registers and compares them
+to a set of architecture-specific set `CalleeSavedRegisters`. If a register
+has been allocated, and it is present in this set, the register is marked as
+"clobbered", i.e., we now know that the register allocator will overwrite
+that value. Thus, these values will have to be spilled in the prologue.
+
+#### References
+
+Register allocation is a complex problem, possibly the most complicated
+part of the backend. The following references were used to implement the
+algorithm:
+
+- https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/17/Slides17.pdf
+- https://en.wikipedia.org/wiki/Chaitin%27s_algorithm
+- https://llvm.org/ProjectsWithLLVM/2004-Fall-CS426-LS.pdf
+- https://pfalcon.github.io/ssabook/latest/book-full.pdf: Chapter 9. for liveness analysis.
+- https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go
+
+We suggest to refer to them to dive deeper in the topic.
+
+### Example
+
+At the end of the register allocation phase, the basic blocks of our `abs`
+function look as follows (for `arm64`):
+
+```asm
+L1 (SSA Block: blk0):
+	mov x2, x2
+	subs wzr, w2, #0x0
+	b.ge L2
+L3 (SSA Block: blk1):
+	mov x8, xzr
+	sub w8, w8, w2
+	mov x8, x8
+	b L4
+L2 (SSA Block: blk2):
+	mov x8, x2
+L4 (SSA Block: blk3):
+	mov x0, x8
+	ret
+```
+
+Notice how the virtual registers have been all replaced by real registers, i.e.
+no register identifier is suffixed with `?`. This example is quite simple, and
+it does not require any spill.
+
+### Code
+
+The algorithm (`regalloc/regalloc.go`) can work on any ISA by implementing the
+interfaces in `regalloc/api.go`.
+
+Essentially:
+
+- each architecture exposes iteration over basic blocks of a function
+  (`regalloc.Function` interface)
+- each arch-specific basic block exposes iteration over instructions
+  (`regalloc.Block` interface)
+- each arch-specific instruction exposes the set of registers it defines and
+  uses  (`regalloc.Instr` interface)
+
+By defining these interfaces, the register allocation algorithm can assign real
+registers to virtual registers without dealing specifically with the target
+architecture.
+
+In practice, each interface is usually implemented by instantiating a common
+generic struct that comes already with an implementation of all or most of the
+required methods.  For instance,`regalloc.Function`is implemented by
+`backend.RegAllocFunction[*arm64.instruction, *arm64.machine]`.
+
+`backend/isa/<arch>/abi.go` (where `<arch>` is either `arm64` or `amd64`)
+contains the instantiation of the `regalloc.RegisterInfo` struct, which
+declares, among others
+- the set of registers that are available for allocation, excluding, for
+  instance, those that might be reserved by the runtime or the OS
+(`AllocatableRegisters`)
+- the registers that might be saved by the callee to the stack
+  (`CalleeSavedRegisters`)
+
+### Debug Flags
+
+- `wazevoapi.RegAllocLoggingEnabled` logs detailed logging of the register
+  allocation procedure.
+- `wazevoapi.PrintRegisterAllocated` prints the basic blocks with the register
+  allocation result.
+
+## Finalization and Encoding
+
+At the end of the register allocation phase, we have enough information to
+finally generate machine code (_encoding_). We are only missing the prologue
+and epilogue of the function.
+
+### Prologue and Epilogue
+
+As usual, the **prologue** is executed before the main body of the function,
+and the **epilogue** is executed at the return. The prologue is responsible for
+setting up the stack frame, and the epilogue is responsible for cleaning up the
+stack frame and returning control to the caller.
+
+Generally, this means, at the very least:
+- saving the return address
+- a base pointer to the stack; or, equivalently, the height of the stack at the
+  beginning of the function
+
+For instance, on `amd64`, `RBP` is the base pointer, `RSP` is the stack
+pointer:
+
+```goat {width="100%" height="250"}
+                (high address)                     (high address)
+    RBP ----> +-----------------+                +-----------------+
+              |      `...`      |                |      `...`      |
+              |      ret Y      |                |      ret Y      |
+              |      `...`      |                |      `...`      |
+              |      ret 0      |                |      ret 0      |
+              |      arg X      |                |      arg X      |
+              |      `...`      |     ====>      |      `...`      |
+              |      arg 1      |                |      arg 1      |
+              |      arg 0      |                |      arg 0      |
+              |   Return Addr   |                |   Return Addr   |
+    RSP ----> +-----------------+                |    Caller_RBP   |
+                 (low address)                   +-----------------+ <----- RSP, RBP
+```
+
+While, on `arm64`, there is only a stack pointer `SP`:
+
+
+```goat {width="100%" height="300"}
+            (high address)                    (high address)
+  SP ---> +-----------------+               +------------------+ <----+
+          |      `...`      |               |      `...`       |      |
+          |      ret Y      |               |      ret Y       |      |
+          |      `...`      |               |      `...`       |      |
+          |      ret 0      |               |      ret 0       |      |
+          |      arg X      |               |      arg X       |      |  size_of_arg_ret.
+          |      `...`      |     ====>     |      `...`       |      |
+          |      arg 1      |               |      arg 1       |      |
+          |      arg 0      |               |      arg 0       | <----+
+          +-----------------+               |  size_of_arg_ret |
+                                            |  return address  |
+                                            +------------------+ <---- SP
+             (low address)                     (low address)
+```
+
+However, the prologue and epilogue might also be responsible for saving and
+restoring the state of registers that might be overwritten by the function
+("clobbered"); and, if spilling occurs, prologue and epilogue are also
+responsible for reserving and releasing the space for the spilled values.
+
+For clarity, we make a distinction between the space reserved for the clobbered
+registers and the space reserved for the spilled values:
+
+- Spill slots are used to temporarily store the values that needs spilling as
+  determined by the register allocator. This section must have a fix height,
+but its contents will change over time, as registers are being spilled and
+reloaded.
+- Clobbered registers are, similarly, determined by the register allocator, but
+  they are stashed in the prologue and then restored in the epilogue.
+
+The procedure happens after the register allocation phase because at
+this point we have collected enough information to know how much space we need
+to reserve, and which registers are clobbered.
+
+Regardless of the architecture, after allocating this space, the stack will
+look as follows:
+
+```goat {height="350"}
+    (high address)
+  +-----------------+
+  |      `...`      |
+  |      ret Y      |
+  |      `...`      |
+  |      ret 0      |
+  |      arg X      |
+  |      `...`      |
+  |      arg 1      |
+  |      arg 0      |
+  | (arch-specific) |
+  +-----------------+
+  |    clobbered M  |
+  |   ............  |
+  |    clobbered 1  |
+  |    clobbered 0  |
+  |   spill slot N  |
+  |   ............  |
+  |   spill slot 0  |
+  +-----------------+
+     (low address)
+```
+
+Note: the prologue might also introduce a check of the stack bounds. If there
+is no sufficient space to allocate the stack frame, the function will exit the
+execution and will try to grow it from the Go runtime.
+
+The epilogue simply reverses the operations of the prologue.
+
+### Other Post-RegAlloc Logic
+
+The `backend.Machine.PostRegAlloc` method is invoked after the register
+allocation procedure; while its main role is to define the prologue and
+epilogue of the function, it also serves as a hook to perform other,
+arch-specific duty, that has to happen after the register allocation phase.
+
+For instance, on `amd64`, the constraints for some instructions are hard to
+express in a meaningful way for the register allocation procedure (for
+instance, the `div` instruction implicitly use registers `rdx`, `rax`).
+Instead, they are lowered with ad-hoc logic as part of the implementation
+`backend.Machine.PostRegAlloc` method.
+
+### Encoding
+
+The final stage of the backend encodes the machine instructions into bytes and
+writes them to the target buffer. Before proceeding with the encoding, relative
+addresses in branching instructions or addressing modes are resolved.
+
+The procedure encodes the instructions in the order they appear in the
+function.
+
+### Code
+
+- The prologue and epilogue are set up as part of the
+  `backend.Machine.PostRegAlloc` method.
+- The encoding is done by the `backend.Machine.Encode` method.
+
+### Debug Flags
+
+- `wazevoapi.PrintFinalizedMachineCode` prints the assembly code of the
+  function after the finalization phase.
+- `wazevoapi.printMachineCodeHexPerFunctionUnmodified` prints a hex
+  representation of the function generated code as it is.
+- `wazevoapi.PrintMachineCodeHexPerFunctionDisassemblable` prints a hex
+  representation of the function generated code that can be disassembled.
+
+The reason for the distinction between the last two flags is that the generated
+code in some cases might not be disassemblable.
+`PrintMachineCodeHexPerFunctionDisassemblable` flag prints a hex encoding of
+the generated code that can be disassembled, but cannot be executed.
+
+<hr>
+
+* Previous Section: [Front-End](../frontend/)
+* Next Section: [Appendix: Trampolines](../appendix/)
+
+[ssa-book]: https://pfalcon.github.io/ssabook/latest/book-full.pdf
+[go-regalloc]: https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go
diff --git a/site/content/docs/how_the_optimizing_compiler_works/frontend.md b/site/content/docs/how_the_optimizing_compiler_works/frontend.md
new file mode 100644
index 0000000000..f64e04d661
--- /dev/null
+++ b/site/content/docs/how_the_optimizing_compiler_works/frontend.md
@@ -0,0 +1,371 @@
++++
+title = "How the Optimizing Compiler Works: Front-End"
+layout = "single"
++++
+
+In this section we will discuss the phases in the front-end of the optimizing compiler:
+
+- [Translation to SSA](#translation-to-ssa)
+- [Optimization](#optimization)
+- [Block Layout](#block-layout)
+
+Every section includes an explanation of the phase; the subsection **Code**
+will include high-level pointers to functions and packages; the subsection **Debug Flags**
+indicates the flags that can be used to enable advanced logging of the phase.
+
+## Translation to SSA
+
+We mentioned earlier that wazero uses an internal representation called an "SSA"
+form or "Static Single-Assignment" form, but we never explained what that is.
+
+In short terms, every program, or, in our case, every Wasm function, can be
+translated in a control-flow graph. The control-flow graph is a directed graph where
+each node is a sequence of statements that do not contain a control flow instruction,
+called a **basic block**. Instead, control-flow instructions are translated into edges.
+
+For instance, take the following implementation of the `abs` function:
+
+```wasm
+(module
+  (func (;0;) (param i32) (result i32)
+     (if (result i32) (i32.lt_s (local.get 0) (i32.const 0))
+        (then
+            (i32.sub (i32.const 0) (local.get 0)))
+        (else
+            (local.get 0))
+     )
+  )
+  (export "f" (func 0))
+)
+```
+
+This is translated to the following block diagram:
+
+```goat {width="100%" height="500"}
+               +---------------------------------------------+
+               |blk0: (exec_ctx:i64, module_ctx:i64, v2:i32) |
+               |    v3:i32 = Iconst_32 0x0                   |
+               |    v4:i32 = Icmp lt_s, v2, v3               |
+               |    Brz v4, blk2                             |
+               |    Jump blk1                                |
+               +---------------------------------------------+
+                                      |
+                                      |
+                      +---`(v4 != 0)`-+-`(v4 == 0)`---+
+                      |                               |
+                      v                               v
+        +---------------------------+   +---------------------------+
+        |blk1: () <-- (blk0)        |   |blk2: () <-- (blk0)        |
+        |    v6:i32 = Iconst_32 0x0 |   |    Jump blk3, v2          |
+        |    v7:i32 = Isub v6, v2   |   |                           |
+        |    Jump blk3, v7          |   |                           |
+        +---------------------------+   +---------------------------+
+                      |                               |
+                      |                               |
+                      +-`{v5 := v7}`--+--`{v5 := v2}`-+
+                                      |
+                                      v
+                      +------------------------------+
+                      |blk3: (v5:i32) <-- (blk1,blk2)|
+                      |    Jump blk_ret, v5          |
+                      +------------------------------+
+                                      |
+                                 {return v5}
+                                      |
+                                      v
+```
+
+We use the ["block argument" variant of SSA][ssa-blocks], which is also the same
+representation [used in LLVM's MLIR][llvm-mlir]. In this variant, each block
+takes a list of arguments. Each block ends with a branching instruction (Branch, Return,
+Jump, etc...) with an optional list of arguments; these arguments are assigned
+to the target block's arguments like a function.
+
+Consider the first block `blk0`.
+
+```
+blk0: (exec_ctx:i64, module_ctx:i64, v2:i32)
+    v3:i32 = Iconst_32 0x0
+    v4:i32 = Icmp lt_s, v2, v3
+    Brz v4, blk2
+    Jump blk1
+```
+
+You will notice that, compared to the original function, it takes two extra
+parameters (`exec_ctx` and `module_ctx`):
+
+1. `exec_ctx` is a pointer to `wazevo.executionContext`. This is used to exit the execution
+   in the face of traps or for host function calls.
+2. `module_ctx`: pointer to `wazevo.moduleContextOpaque`. This is used, among other things,
+   to access memory.
+
+It then takes one parameter `v2`, corresponding to the function parameter, and
+it defines two variables `v3`, `v4`. `v3` is the constant 0, `v4` is the result of
+comparing `v2` to `v3` using the `i32.lt_s` instruction. Then, it branches to
+`blk2` if `v4` is zero, otherwise it jumps to `blk1`.
+
+You might also have noticed that the instructions do not correspond strictly to
+the original Wasm opcodes. This is because, similarly to the wazero IR used by
+the old compiler, this is a custom IR.
+
+You will also notice that, _on the right-hand side of the assignments_ of any statement,
+no name occurs _twice_: this is why this form is called **single-assignment**.
+
+Finally, notice how `blk1` and `blk2` end with a jump to the last block `blk3`.
+
+```
+blk1: ()
+    ...
+	Jump blk3, v7
+
+blk2: ()
+	Jump blk3, v2
+
+blk3: (v5:i32)
+    ...
+```
+
+`blk3` takes an argument `v5`: `blk1` jumps to `bl3` with `v7` and `blk2` jumps
+to `blk3` with `v2`, meaning `v5` is effectively a rename of `v5` or `v7`,
+depending on the originating block. If you are familiar with the traditional
+representation of an SSA form, you will recognize that the role of block
+arguments is equivalent to the role of the *Phi (Φ) function*, a special
+function that returns a different value depending on the incoming edge; e.g., in
+this case: `v5 := Φ(v7, v2)`.
+
+### Code
+
+The relevant APIs can be found under sub-package `ssa` and `frontend`.
+In the code, the terms *lower* or *lowering* are often used to indicate a mapping or a translation,
+because such transformations usually correspond to targeting a lower abstraction level.
+
+- Basic Blocks are represented by the type `ssa.Block`.
+- The SSA form is constructed using an `ssa.Builder`. The `ssa.Builder` is instantiated
+  in the context of `wasm.Engine.CompileModule()`, more specifically in the method
+  `frontend.Compiler.LowerToSSA()`.
+- The mapping between Wasm opcodes and the IR happens in `frontend/lower.go`,
+  more specifically in the method `frontend.Compiler.lowerCurrentOpcode()`.
+- Because they are semantically equivalent, in the code, basic block parameters
+  are sometimes referred to as "Phi values".
+
+#### Instructions and Values
+
+An `ssa.Instruction` is a single instruction in the SSA form. Each instruction might
+consume zero or more `ssa.Value`s, and it usually produces a single `ssa.Value`; some
+instructions may not produce any value (for instance, a `Jump` instruction).
+An `ssa.Value` is an abstraction that represents a typed name binding, and it is used
+to represent the result of an instruction, or the input to an instruction.
+
+For instance:
+
+```
+blk1: () <-- (blk0)
+    v6:i32 = Iconst_32 0x0
+    v7:i32 = Isub v6, v2
+    Jump blk3, v7
+```
+
+`Iconst_32` takes no input value and produce value `v6`; `Isub` takes two input values (`v6`, `v2`)
+and produces value `v7`; `Jump` takes one input value (`v7`) and produces no value. All
+such values have the `i32` type. The wazero SSA's type system (`ssa.Type`) allows the following types:
+
+- `i32`: 32-bit integer
+- `i64`: 64-bit integer
+- `f32`: 32-bit floating point
+- `f64`: 64-bit floating point
+- `v128`: 128-bit SIMD vector
+
+For simplicity, we don't have a dedicated type for pointers. Instead, we use the `i64`
+type to represent pointer values since we only support 64-bit architectures,
+unlike traditional compilers such as LLVM.
+
+Values and instructions are both allocated from pools to minimize memory allocations.
+
+### Debug Flags
+
+- `wazevoapi.PrintSSA` dumps the SSA form to the console.
+- `wazevoapi.FrontEndLoggingEnabled` dumps progress of the translation between Wasm
+  opcodes and SSA instructions to the console.
+
+## Optimization
+
+The SSA form makes it easier to perform a number of optimizations. For instance,
+we can perform constant propagation, dead code elimination, and common
+subexpression elimination. These optimizations either act upon the instructions
+within a basic block, or they act upon the control-flow graph as a whole.
+
+On a high, level, consider the following basic block, derived from the previous
+example:
+
+```
+blk0: (exec_ctx:i64, module_ctx:i64)
+    v2:i32 = Iconst_32 -5
+    v3:i32 = Iconst_32  0
+    v4:i32 = Icmp lt_s, v2, v3
+    Brz v4, blk2
+    Jump blk1
+```
+
+It is pretty easy to see that the comparison in `v4` can be replaced by a
+constant `1`, because the comparison is between two constant values (-5, 0).
+Therefore, the block can be rewritten as such:
+
+```
+blk0: (exec_ctx:i64, module_ctx:i64)
+    v4:i32 = Iconst_32 1
+    Brz v4, blk2
+    Jump blk1
+```
+
+However, we can now also see that the branch is always taken, and that the block
+`blk2` is never executed, so even the branch instruction and the constant
+definition `v4` can be removed:
+
+```
+blk0: (exec_ctx:i64, module_ctx:i64)
+    Jump blk1
+```
+
+This is a simple example of constant propagation and dead code elimination
+occurring within a basic block. However, now  `blk2` is unreachable, because
+there is no other edge in the edge that points to it; thus it can be removed
+from the control-flow graph. This is an example of dead-code elimination that
+occurs at the control-flow graph level.
+
+In practice, because WebAssembly is a compilation target, these simple
+optimizations are often unnecessary. The optimization passes implemented in
+wazero are also work-in-progress and, at the time of writing, further work is
+expected to implement more advanced optimizations.
+
+### Code
+
+Optimization passes are implemented by `ssa.Builder.RunPasses()`. An optimization
+pass is just a function that takes a ssa builder as a parameter.
+
+Passes iterate over the basic blocks, and, for each basic block, they iterate
+over the instructions. Each pass may mutate the basic block by modifying the instructions
+it contains, or it might change the entire shape of the control-flow graph (e.g. by removing
+blocks).
+
+Currently, there are two dead-code elimination passes:
+
+- `passDeadBlockEliminationOpt` acting at the block-level.
+- `passDeadCodeEliminationOpt` acting at instruction-level.
+
+Notably, `passDeadCodeEliminationOpt` also assigns an `InstructionGroupID` to each
+instruction. This is used to determine whether a sequence of instructions can be
+replaced by a single machine instruction during the back-end phase. For more details,
+see also the relevant documentation in `ssa/instructions.go`
+
+There are also simple constant folding passes such as `passNopInstElimination`, which
+folds and delete instructions that are essentially no-ops (e.g. shifting by a 0 amount).
+
+### Debug Flags
+
+`wazevoapi.PrintOptimizedSSA` dumps the SSA form to the console after optimization.
+
+
+## Block Layout
+
+As we have seen earlier, the SSA form instructions are contained within basic
+blocks, and the basic blocks are connected by edges of the control-flow graph.
+However, machine code is not laid out in a graph, but it is just a linear
+sequence of instructions.
+
+Thus, the last step of the front-end is to lay out the basic blocks in a linear
+sequence. Because each basic block, by design, ends with a control-flow
+instruction, one of the goals of the block layout phase is to maximize the number of
+**fall-through opportunities**. A fall-through opportunity occurs when a block ends
+with a jump instruction whose target is exactly the next block in the
+sequence. In order to maximize the number of fall-through opportunities, the
+block layout phase might reorder the basic blocks in the control-flow graph,
+and transform the control-flow instructions. For instance, it might _invert_
+some branching conditions.
+
+The end goal is to effectively minimize the number of jumps and branches in
+the machine code that will be generated later.
+
+
+### Critical Edges
+
+Special attention must be taken when a basic block has multiple predecessors,
+i.e., when it has multiple incoming edges. In particular, an edge between two
+basic blocks is called a **critical edge** when, at the same time:
+- the predecessor has multiple successors **and**
+- the successor has multiple predecessors.
+
+For instance, in the example below the edge between `BB0` and `BB3`
+is a critical edge.
+
+```goat { width="300" }
+┌───────┐    ┌───────┐
+│  BB0  │━┓  │  BB1  │
+└───────┘ ┃  └───────┘
+    │     ┃      │
+    ▼     ┃      ▼
+┌───────┐ ┃  ┌───────┐
+│  BB2  │ ┗━▶│  BB3  │
+└───────┘    └───────┘
+```
+
+In these cases the critical edge is split by introducing a new basic block,
+called a **trampoline**, where the critical edge was.
+
+```goat  { width="300" }
+┌───────┐            ┌───────┐
+│  BB0  │──────┐     │  BB1  │
+└───────┘      ▼     └───────┘
+    │    ┌──────────┐    │
+    │    │trampoline│    │
+    ▼    └──────────┘    ▼
+┌───────┐      │     ┌───────┐
+│  BB2  │      └────▶│  BB3  │
+└───────┘            └───────┘
+```
+
+For more details on critical edges read more at
+
+- https://en.wikipedia.org/wiki/Control-flow_graph
+- https://nickdesaulniers.github.io/blog/2023/01/27/critical-edge-splitting/
+
+### Example
+
+At the end of the block layout phase, the laid out SSA for the `abs` function
+looks as follows:
+
+```
+blk0: (exec_ctx:i64, module_ctx:i64, v2:i32)
+	v3:i32 = Iconst_32 0x0
+	v4:i32 = Icmp lt_s, v2, v3
+	Brz v4, blk2
+	Jump fallthrough
+
+blk1: () <-- (blk0)
+	v6:i32 = Iconst_32 0x0
+	v7:i32 = Isub v6, v2
+	Jump blk3, v7
+
+blk2: () <-- (blk0)
+	Jump fallthrough, v2
+
+blk3: (v5:i32) <-- (blk1,blk2)
+	Jump blk_ret, v5
+```
+
+### Code
+
+`passLayoutBlocks` implements the block layout phase.
+
+### Debug Flags
+
+- `wazevoapi.PrintBlockLaidOutSSA` dumps the SSA form to the console after block layout.
+- `wazevoapi.SSALoggingEnabled` logs the transformations that are applied during this phase,
+  such as inverting branching conditions or splitting critical edges.
+
+<hr>
+
+* Previous Section: [How the Optimizing Compiler Works](../)
+* Next Section: [Back-End](../backend/)
+
+[ssa-blocks]: https://en.wikipedia.org/wiki/Static_single-assignment_form#Block_arguments
+[llvm-mlir]: https://mlir.llvm.org/docs/Rationale/Rationale/#block-arguments-vs-phi-nodes