Description
Forgive me if I'm missing it, but is there a discussion of how the Unicode UTR 36: UTF-8 Exploits are addressed by the component model strings?
From what I can tell looking at the CanonicalABI it looks like the string lift operation is responsible for validation and trapping on "Unicode Errors".
I'm wondering what guarantees I have as a component author that lowered strings are valid UTF-8 strings from the security perspective of that report. For example, overlong string encodings in the cited UTR 36 document and in the UTF-8 Wikipedia: Invalid Sequences and Error Handling topic are specifically described as being the cause of security issues in web services (a relevant use-case for WASM components) and potentially overlooked by decoders (WASM component authors).
Some concrete questions:
-
Are there guarantees that can be documented for component authors about strings lowered into a component and errors raised for improperly formatting sequences for strings lifted?
-
Are there compliance tests for tooling / host runtimes around expected UTF-8 validation (particularly as the security issues there are relevant to server applications of WASM components).
-
The canon_lower topic has a discussion point on efficient trampolines:
Since any cross-component call necessarily transits through a statically-known canon_lower+canon_lift call pair, an AOT compiler can fuse canon_lift and canon_lower into a single, efficient trampoline.
Is there a discussion of the validation expectations of such efficient trampoline optimizations? I'd assume you would still need to run the validation passes associated with a lift on a UTF-8 string to prevent issues like overlong encoding being overlooked.
My goal in the end is to make sure I'm not doing that work twice. If there are strong guarantees clearly described about what validation is done on strings I can skip doing that work or conversely make sure it is done.