| 
 | 1 | +# Summary  | 
 | 2 | +[summary]: #summary  | 
 | 3 | + | 
 | 4 | +This RFC proposes to improve control flow integrity for compiled WebAssembly code by utilizing two  | 
 | 5 | +technologies from the Arm instruction set architecture - Pointer Authentication and Branch Target  | 
 | 6 | +Identification.  | 
 | 7 | + | 
 | 8 | +# Motivation  | 
 | 9 | +[motivation]: #motivation  | 
 | 10 | + | 
 | 11 | +The [security model of WebAssembly](https://webassembly.org/docs/security/) ensures that Wasm  | 
 | 12 | +modules execute in a sandboxed environment isolated from the host runtime. One aspect of that model  | 
 | 13 | +is that it provides implicit control flow integrity (CFI) by forcing all function call targets to  | 
 | 14 | +specify a valid entry in the function index space, by using a protected call stack that is not  | 
 | 15 | +affected by buffer overflows in the module heap, and so on. As a result, in some Wasm applications  | 
 | 16 | +the runtime is able to execute untrusted code safely. However, that places the burden of ensuring  | 
 | 17 | +that the security properties are upheld on the compiler to a large extent.  | 
 | 18 | + | 
 | 19 | +On the other hand, a further aspect of the WebAssembly design is efficient execution (close to  | 
 | 20 | +native speed), which leads to a natural tendency towards sophisticated optimizing compilers.  | 
 | 21 | +Unfortunately, the additional complexity increases the risk of implementation problems and in  | 
 | 22 | +particular compromises of the security properties. For example, Cranelift has been affected by  | 
 | 23 | +issues such as CVE-2021-32629 [cve] that could make it possible to access the protected call stack  | 
 | 24 | +or memory that is private to the host runtime.  | 
 | 25 | + | 
 | 26 | +We are trying to tackle the challenge of ensuring compiler correctness with initiatives such as  | 
 | 27 | +expanding fuzzing and making it possible to apply formal verification to at least some parts of the  | 
 | 28 | +compilation process. However, it is also reasonable to consider a defense in depth strategy and to  | 
 | 29 | +evaluate mitigations for potential future issues.  | 
 | 30 | + | 
 | 31 | +Finally, Wasmtime can be used as a library and in particular embedded into an application that is  | 
 | 32 | +implemented in languages that lack some of the hardening provided by Rust such as C and C++. In that  | 
 | 33 | +case the compiled WebAssembly code could provide convenient instruction sequences for attacks that  | 
 | 34 | +subvert normal control flow and that originate from the embedder's code, even if Cranelift and  | 
 | 35 | +Wasmtime themselves lack any defects.  | 
 | 36 | + | 
 | 37 | +[cve]: https://github.com/bytecodealliance/wasmtime/security/advisories/GHSA-hpqh-2wqx-7qp5  | 
 | 38 | + | 
 | 39 | +# Proposal  | 
 | 40 | +[proposal]: #proposal  | 
 | 41 | + | 
 | 42 | +Currently this proposal focuses on the AArch64 execution environment.  | 
 | 43 | + | 
 | 44 | +## Background  | 
 | 45 | + | 
 | 46 | +The Pointer Authentication (PAuth) extension to the Arm architecture protects function returns, i.e.  | 
 | 47 | +provides back-edge CFI. It is described in section D5.1.5 of  | 
 | 48 | +[the Arm Architecture Reference Manual][arm-arm]. Some of the PAuth operations act as `NOP`  | 
 | 49 | +instructions when executed by a processor that does not support the extension.  | 
 | 50 | + | 
 | 51 | +The Branch Target Identification (BTI) extension protects other kinds of indirect branches, that is  | 
 | 52 | +provides forward-edge CFI and is described in section D5.4.4. A processor implementation with BTI  | 
 | 53 | +would support PAuth as well, but not necessarily vice versa. Whether BTI applies to an executable  | 
 | 54 | +memory page or not is controlled by a dedicated page attribute. Note that the `BTI` "landing pad"  | 
 | 55 | +for indirect branches acts as a `NOP` instruction when the extension is not active (e.g. for  | 
 | 56 | +processors that do not support BTI).  | 
 | 57 | + | 
 | 58 | +Both extensions are applicable only to the AArch64 execution state and are optional, so each CFI  | 
 | 59 | +technique would be employed only if the target environment provides the necessary ISA support.  | 
 | 60 | +Wasmtime embedders need to consider a subtlety - if they cache the result of the check, that may  | 
 | 61 | +happen to be located in memory that could be potentially accessible to an attacker, so the latter  | 
 | 62 | +could disable the use of PAuth and BTI in subsequent code generation. Mitigating this issue is  | 
 | 63 | +outside the scope of this proposal.  | 
 | 64 | + | 
 | 65 | +The article [*Code reuse attacks: The compiler story*][code-reuse-attacks] provides an introduction  | 
 | 66 | +to the technologies.  | 
 | 67 | + | 
 | 68 | +[arm-arm]: https://developer.arm.com/documentation/ddi0487/gb/?lang=en  | 
 | 69 | +[code-reuse-attacks]: https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/code-reuse-attacks-the-compiler-story  | 
 | 70 | + | 
 | 71 | +## Improved back-edge CFI with PAuth  | 
 | 72 | + | 
 | 73 | +The proposed implementation will add the `PACIASP` instruction to the beginning of every function  | 
 | 74 | +compiled by Cranelift and would replace the final return with the `RETAA` instruction.  | 
 | 75 | + | 
 | 76 | +In environments that use the DWARF format for unwinding the implementation would be modified to  | 
 | 77 | +apply the `DW_CFA_AARCH64_negate_ra_state` operation immediately after the `PACIASP` instruction.  | 
 | 78 | + | 
 | 79 | +These steps can be skipped for simple leaf functions that do not construct frame records on the  | 
 | 80 | +stack.  | 
 | 81 | + | 
 | 82 | +## Enhanced forward-edge CFI with BTI  | 
 | 83 | + | 
 | 84 | +The proposed implementation will add the `BTI j` instruction to the beginning of every basic block  | 
 | 85 | +that is the target of an indirect branch and that is not a function prologue. Note that in the  | 
 | 86 | +AArch64 backend generated function calls always target function prologues and indirect branches that  | 
 | 87 | +do not act like function calls appear only in the implementation of the `br_table` IR operation.  | 
 | 88 | +Function prologues would be covered by the pointer authentication instructions, which also act as  | 
 | 89 | +landing pads - as discussed before, BTI support implies Pauth.  | 
 | 90 | + | 
 | 91 | +During development one simple way to create a working prototype is to add the landing pads to the  | 
 | 92 | +beginning of every basic block, irrespective of whether it is the target of an indirect branch or  | 
 | 93 | +not. In this way it can be checked if BTI causes any issue with the rest of the runtime.  | 
 | 94 | + | 
 | 95 | +## CFI improvements to assembly, C, C++, and Rust code  | 
 | 96 | + | 
 | 97 | +Improving CFI for compiled C, C++, and Rust code with the same technologies is outside the scope of  | 
 | 98 | +this proposal, but in general it should be achievable by passing the appropriate parameters to the  | 
 | 99 | +respective compiler.  | 
 | 100 | + | 
 | 101 | +Functions implemented in assembly will get a similar treatment as generated code, i.e. they will  | 
 | 102 | +start with the `PACIASP` instruction. However, the regular return will be preserved and instead will  | 
 | 103 | +be preceded by the `AUTIASP` instruction. The reason is that both `AUTIASP` and `PACIASP` act as  | 
 | 104 | +`NOP` instructions when executed by a processor that does not support PAuth, thus making the  | 
 | 105 | +assembly code generic.  | 
 | 106 | + | 
 | 107 | +# Rationale and alternatives  | 
 | 108 | +[rationale-and-alternatives]: #rationale-and-alternatives  | 
 | 109 | + | 
 | 110 | +Since the existing implementation already uses the standard back-edge CFI techniques that are  | 
 | 111 | +preferred in the absence of special hardware support (i.e. a separate protected stack that is not  | 
 | 112 | +used for buffers that could be accessed out of bounds), the alternative is not to implement the  | 
 | 113 | +proposal, so the rationale is based mainly on the overhead being insignificant. In terms of code  | 
 | 114 | +size the impact of the back-edge CFI improvements is an additional instruction per function, or 2  | 
 | 115 | +for functions implemented in assembly.  | 
 | 116 | + | 
 | 117 | +The [Clang CFI design][clang-cfi-design] provides an idea for an alternative implementation of the  | 
 | 118 | +forward-edge CFI mechanism that is enabled by BTI. It involves instrumenting every indirect branch  | 
 | 119 | +to check if its destination is permitted. While the overhead of this approach can be reduced by  | 
 | 120 | +using efficient data structures for the destination address lookup and optionally limiting the  | 
 | 121 | +checks only to indirect function calls, it is still significantly larger than the worst-case BTI  | 
 | 122 | +overhead of one instruction per basic block per function. On the other hand, it does not require any  | 
 | 123 | +special hardware support, so it could be applied to all supported platforms.  | 
 | 124 | + | 
 | 125 | +[clang-cfi-design]: https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html  | 
 | 126 | + | 
 | 127 | +# Open questions  | 
 | 128 | +[open-questions]: #open-questions  | 
 | 129 | + | 
 | 130 | +- What is the performance overhead of the proposal?  | 
 | 131 | +- What technologies are available in other instruction set architectures to achieve the same goals?  | 
 | 132 | +- What hardening approaches are applicable to the fiber implementation? The fiber switching code  | 
 | 133 | +saves the values of all callee-saved registers on the stack, i.e. memory that is potentially  | 
 | 134 | +accessible to an attacker. Some of those values could be code addresses that would be used by  | 
 | 135 | +indirect branches, so should we devise a scheme to authenticate them? While the regular pointer  | 
 | 136 | +authentication instructions assume that they are operating on valid virtual addresses (which implies  | 
 | 137 | +that the most significant bits are redundant and could be repurposed), PAuth provides operations to  | 
 | 138 | +authenticate arbitrary data, which could be used in this case.  | 
 | 139 | +- Should we generate the operations that act as `NOP` instructions unconditionally instead (while  | 
 | 140 | +still choosing the shorter alternative sequences if the target supports them)? That would  | 
 | 141 | +especially help the ahead of time compilation use case, and could arguably reduce the amount of  | 
 | 142 | +testing, i.e. no need to check both with and without CFI enhancements.  | 
0 commit comments