Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 59 additions & 59 deletions proposals/stringref/Overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -202,11 +202,11 @@ value reaches any instruction in this proposal. The one exception is
### Creating strings

```
(string.new_utf8 $memory ptr:address bytes:i32)
(string.decode_from_utf8 $memory ptr:address bytes:i32)
-> str:stringref
(string.new_lossy_utf8 $memory ptr:address bytes:i32)
(string.decode_from_lossy_utf8 $memory ptr:address bytes:i32)
-> str:stringref
(string.new_wtf8 $memory ptr:address bytes:i32)
(string.decode_from_wtf8 $memory ptr:address bytes:i32)
-> str:stringref
```
Create a new string from the *`bytes`* bytes in memory at *`ptr`*.
Expand All @@ -215,22 +215,22 @@ Out-of-bounds access will trap. The maximum value for *`bytes`* is

These three instructions decode the bytes in three different ways:

* `string.new_utf8` decodes using a strict UTF-8 decoder. If the
* `string.decode_from_utf8` decodes using a strict UTF-8 decoder. If the
bytes are not valid UTF-8, trap.

* `string.new_lossy_utf8` decodes using a sloppy UTF-8 decoder: all
* `string.decode_from_lossy_utf8` decodes using a sloppy UTF-8 decoder: all
maximal subparts of an invalid subsequence are decoded as if they
were `U+FFFD` (the replacement character) instead. This instruction
will never trap due to a decoding error. See the section entitled
"U+FFFD Substitution of Maximal Subparts" in the Unicode standard,
version 14.0.0, page 126.

* `string.new_wtf8` decodes using a strict WTF-8 decoder, which is like
* `string.decode_from_wtf8` decodes using a strict WTF-8 decoder, which is like
UTF-8 but also allows isolated surrogates. If the bytes are not
valid WTF-8, trap.

Copy link
Author

@lygstate lygstate Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the bytes are not valid WTF-8, trap

If this is a case. so we may want
string.decode_from_lossy_wtf8, and for invalid characters, replace it with 0xEF 0xBF 0xBD, that's the UTF-8 encoding of U+FFFD (the replacement character)

```
(string.new_wtf16 $memory ptr:address codeunits:i32)
(string.decode_from_wtf16 $memory ptr:address codeunits:i32)
-> str:stringref
```
Create a new string from the *`codeunits`* code units encoded in memory at
Expand All @@ -240,14 +240,14 @@ is 2<sup>30</sup>–1; passing a higher value traps. Each code unit is
read from memory as if with `i32.load16`, and is therefore decoded
using little-endian byte order.

#### `string.new` size limits
#### `string.decode_from_*` size limits

Creating a string is a form of dynamic allocation and can fail. The
same implementation running on different machines can have different
behaviors. The specification can only say that byte/code-unit sizes
above a certain limit *must* fail; but for sizes within the limits, the
allocations *may* fail. If an allocation fails, the implementation must
trap. Fallible `string.new` is a possible future extension.
trap. Fallible `string.decode_from_*` is a possible future extension.

### String literals

Expand Down Expand Up @@ -281,7 +281,7 @@ string literal section as a future extension.

The maximum size for the WTF-8 encoding of an individual string literal
is 2<sup>31</sup>–1 bytes. Embeddings may impose their own limits which
are more restricted. But similarly to `string.new_wtf8`, instantiating
are more restricted. But similarly to `string.decode_from_wtf8`, instantiating
a module with string literals may fail due to lack of memory resources,
even if the string size is formally within the limits. However
`string.const` itself never traps when passed a valid literal offset.
Expand Down Expand Up @@ -331,7 +331,7 @@ is 2<sup>30</sup>-1. If an encoding would require more code units than
the limit, the result is -1.

```
(string.encode_utf8 $memory str:stringref ptr:address)
(string.encode_to_utf8 $memory str:stringref ptr:address)
-> codeunits:i32
```
Encode the contents of the string *`str`* as UTF-8 to memory at *ptr*.
Expand All @@ -340,11 +340,11 @@ written, which will be the same as returned by the corresponding
`string.measure_utf8`.

The maximum number of bytes that can be encoded at once by
`string.encode` is 2<sup>31</sup>-1. If an encoding would require more
`string.encode_to_utf8` is 2<sup>31</sup>-1. If an encoding would require more
bytes, it is as if the codepoints can't be encoded (a trap).

```
(string.encode_lossy_utf8 $memory str:stringref ptr:address)
(string.encode_to_lossy_utf8 $memory str:stringref ptr:address)
-> codeunits:i32
```
Encode the contents of the string *`str`* as UTF-8 to memory at *`ptr`*.
Expand All @@ -353,23 +353,23 @@ character) instead. Return the number of code units written, which will
be the same as returned by the corresponding `string.measure_wtf8`.

The maximum number of bytes that can be encoded at once by
`string.encode` is 2<sup>31</sup>-1. If an encoding would require more
`string.encode_to_lossy_utf8` is 2<sup>31</sup>-1. If an encoding would require more
bytes, it is as if the codepoints can't be encoded (a trap).

```
(string.encode_wtf8 $memory str:stringref ptr:address)
(string.encode_to_wtf8 $memory str:stringref ptr:address)
-> codeunits:i32
```
Encode the contents of the string *`str`* as WTF-8 to memory at *`ptr`*.
Return the number of code units written, which will be the same as
returned by the corresponding `string.measure_wtf8`.

The maximum number of bytes that can be encoded at once by
`string.encode` is 2<sup>31</sup>-1. If an encoding would require more
`string.encode_to_wtf8` is 2<sup>31</sup>-1. If an encoding would require more
bytes, it is as if the codepoints can't be encoded (a trap).

```
(string.encode_wtf16 $memory str:stringref ptr:address)
(string.encode_to_wtf16 $memory str:stringref ptr:address)
-> codeunits:i32
```
Encode the contents of the string *`str`* as WTF-16 to memory at
Expand All @@ -380,7 +380,7 @@ Each code unit is written to memory as if stored by `i32.store16`, so
WTF-16 code units are in little-endian byte order.

The maximum number of bytes that can be encoded at once by
`string.encode` is 2<sup>31</sup>-1. If an encoding would require more
`string.encode_to_wtf16` is 2<sup>31</sup>-1. If an encoding would require more
bytes, it is as if the codepoints can't be encoded (a trap).

### Concatenation
Expand Down Expand Up @@ -603,26 +603,26 @@ The instructions below shall be available in WebAssembly implementations
that support both GC and stringrefs.

```
(string.new_utf8_array codeunits:$t start:i32 end:i32)
(string.decode_from_utf8_array codeunits:$t start:i32 end:i32)
if expand($t) => array i8
-> str:stringref
(string.new_lossy_utf8_array codeunits:$t start:i32 end:i32)
(string.decode_from_lossy_utf8_array codeunits:$t start:i32 end:i32)
if expand($t) => array i8
-> str:stringref
(string.new_wtf8_array codeunits:$t start:i32 end:i32)
(string.decode_from_wtf8_array codeunits:$t start:i32 end:i32)
if expand($t) => array i8
-> str:stringref
```
Create a new string from a subsequence of the *`codeunits`* bytes in a
GC-managed array, starting from offset *`start`* and continuing to but
not including *`end`*. If *`end`* is less than *`start`* or is greater
than the array length, trap. The bytes are decoded in the same way as
`string.new_utf8`, `string.new_lossy_utf8`, and `string.new_wtf8`,
`string.decode_from_utf8`, `string.decode_from_lossy_utf8`, and `string.decode_from_wtf8`,
respectively. The maximum value for *`end`*–*`start`* is
2<sup>31</sup>–1; passing a higher value traps.

```
(string.new_wtf16_array codeunits:$t start:i32 end:i32)
(string.decode_from_wtf16_array codeunits:$t start:i32 end:i32)
if expand($t) => array i16
-> str:stringref
```
Expand All @@ -634,16 +634,16 @@ for *`end`*–*`start`* is 2<sup>30</sup>–1; passing a higher value
traps.

```
(string.encode_utf8_array str:stringref array:$t start:i32)
(string.encode_to_utf8_array str:stringref array:$t start:i32)
if expand($t) => array (mut i8)
-> codeunits:i32
(string.encode_lossy_utf8_array str:stringref array:$t start:i32)
(string.encode_to_lossy_utf8_array str:stringref array:$t start:i32)
if expand($t) => array (mut i8)
-> codeunits:i32
(string.encode_wtf8_array str:stringref array:$t start:i32)
(string.encode_to_wtf8_array str:stringref array:$t start:i32)
if expand($t) => array (mut i8)
-> codeunits:i32
(string.encode_wtf16_array str:stringref array:$t start:i32)
(string.encode_to_wtf16_array str:stringref array:$t start:i32)
if expand($t) => array (mut i16)
-> codeunits:i32
```
Expand All @@ -655,8 +655,8 @@ same as the result of a the corresponding `string.measure_wtf8` or
code units in the array, trap. Note that no `NUL` terminator is ever
written.

For `string.encode_utf8_array`, trap if an isolated surrogate is seen.
For `string.encode_lossy_utf8_array`, replace isolated surrogates with
For `string.encode_to_utf8_array`, trap if an isolated surrogate is seen.
For `string.encode_to_lossy_utf8_array`, replace isolated surrogates with
`U+FFFD`.

## Binary encoding
Expand All @@ -669,21 +669,21 @@ reftype ::= ...
| 0x61 ⇒ stringview_iter ; SLEB128(-0x1f)

instr ::= ...
| 0xfb 0x80:u32 $mem:u32 ⇒ string.new_utf8 $mem
| 0xfb 0x81:u32 $mem:u32 ⇒ string.new_wtf16 $mem
| 0xfb 0x80:u32 $mem:u32 ⇒ string.decode_from_utf8 $mem
| 0xfb 0x81:u32 $mem:u32 ⇒ string.decode_from_wtf16 $mem
| 0xfb 0x82:u32 $idx:u32 ⇒ string.const $idx
| 0xfb 0x83:u32 ⇒ string.measure_utf8
| 0xfb 0x84:u32 ⇒ string.measure_wtf8
| 0xfb 0x85:u32 ⇒ string.measure_wtf16
| 0xfb 0x86:u32 $mem:u32 ⇒ string.encode_utf8 $mem
| 0xfb 0x87:u32 $mem:u32 ⇒ string.encode_wtf16 $mem
| 0xfb 0x86:u32 $mem:u32 ⇒ string.encode_to_utf8 $mem
| 0xfb 0x87:u32 $mem:u32 ⇒ string.encode_to_wtf16 $mem
| 0xfb 0x88:u32 ⇒ string.concat
| 0xfb 0x89:u32 ⇒ string.eq
| 0xfb 0x8a:u32 ⇒ string.is_usv_sequence
| 0xfb 0x8b:u32 $mem:u32 ⇒ string.new_lossy_utf8 $mem
| 0xfb 0x8c:u32 $mem:u32 ⇒ string.new_wtf8 $mem
| 0xfb 0x8d:u32 $mem:u32 ⇒ string.encode_lossy_utf8 $mem
| 0xfb 0x8e:u32 $mem:u32 ⇒ string.encode_wtf8 $mem
| 0xfb 0x8b:u32 $mem:u32 ⇒ string.decode_from_lossy_utf8 $mem
| 0xfb 0x8c:u32 $mem:u32 ⇒ string.decode_from_wtf8 $mem
| 0xfb 0x8d:u32 $mem:u32 ⇒ string.encode_to_lossy_utf8 $mem
| 0xfb 0x8e:u32 $mem:u32 ⇒ string.encode_to_wtf8 $mem
| 0xfb 0x90:u32 ⇒ string.as_wtf8
| 0xfb 0x91:u32 ⇒ stringview_wtf8.advance
| 0xfb 0x92:u32 $mem:u32 ⇒ stringview_wtf8.encode_utf8 $mem
Expand All @@ -700,14 +700,14 @@ instr ::= ...
| 0xfb 0xa2:u32 ⇒ stringview_iter.advance
| 0xfb 0xa3:u32 ⇒ stringview_iter.rewind
| 0xfb 0xa4:u32 ⇒ stringview_iter.slice
| 0xfb 0xb0:u32 [gc] ⇒ string.new_utf8_array
| 0xfb 0xb1:u32 [gc] ⇒ string.new_wtf16_array
| 0xfb 0xb2:u32 [gc] ⇒ string.encode_utf8_array
| 0xfb 0xb3:u32 [gc] ⇒ string.encode_wtf16_array
| 0xfb 0xb4:u32 [gc] ⇒ string.new_lossy_utf8_array
| 0xfb 0xb5:u32 [gc] ⇒ string.new_wtf8_array
| 0xfb 0xb6:u32 [gc] ⇒ string.encode_lossy_utf8_array
| 0xfb 0xb7:u32 [gc] ⇒ string.encode_wtf8_array
| 0xfb 0xb0:u32 [gc] ⇒ string.decode_from_utf8_array
| 0xfb 0xb1:u32 [gc] ⇒ string.decode_from_wtf16_array
| 0xfb 0xb2:u32 [gc] ⇒ string.encode_to_utf8_array
| 0xfb 0xb3:u32 [gc] ⇒ string.encode_to_wtf16_array
| 0xfb 0xb4:u32 [gc] ⇒ string.decode_from_lossy_utf8_array
| 0xfb 0xb5:u32 [gc] ⇒ string.decode_from_wtf8_array
| 0xfb 0xb6:u32 [gc] ⇒ string.encode_to_lossy_utf8_array
| 0xfb 0xb7:u32 [gc] ⇒ string.encode_to_wtf8_array

;; New section. If present, must be present only once, and right before
;; the globals section (or where the globals section would be). Each
Expand All @@ -733,11 +733,11 @@ operand allows you to elide the memory, in which case it defaults to 0.
local.get $ptr
local.get $ptr
call $strlen
string.new_utf8)
string.decode_from_utf8)
```

If the bytes being decoded aren't actually valid UTF-8, this function
will trap. Use `string.new_lossy_utf8` in contexts where replacing
will trap. Use `string.decode_from_lossy_utf8` in contexts where replacing
invalid data with `U+FFFD` is a better strategy than trapping.

### Make string from an array of WTF-8 code units in memory
Expand All @@ -746,20 +746,20 @@ invalid data with `U+FFFD` is a better strategy than trapping.
(func $string-from-wtf8n (param $ptr i32) (param $len i32) (result stringref)
local.get $ptr
local.get $len
string.new_wtf8)
string.decode_from_wtf8)
```

Note that `string.new_wtf8` (and `string.new_wtf8_array`) are always
Note that `string.decode_from_wtf8` (and `string.decode_from_wtf8_array`) are always
strict decoders: if the bytes are not valid WTF-8, the instruction
traps.

### Make string from UTF-16 in memory
### Make string from WTF-16 in memory

```wasm
(func $string-from-utf16 (param $ptr i32) (param $units i32) (result stringref)
(func $string-from-wtf16n (param $ptr i32) (param $units i32) (result stringref)
local.get $ptr
local.get $units
string.new_wtf16)
string.decode_from_wtf16)
```

This proposal doesn't distinguish between UTF-16 and WTF-16 at all;
Expand Down Expand Up @@ -971,7 +971,7 @@ open to considering adding more instructions.

local.get $str
local.get $ptr
string.encode_utf8 ;; push bytes written, same as $len
string.encode_to_utf8 ;; push bytes written, same as $len

local.get $ptr
i32.add
Expand All @@ -986,8 +986,8 @@ Using `string.measure_utf8` ensures that the encoded string is a valid
unicode scalar value sequence. How to handle invalid UTF-8 is up to the
user; instead of `unreachable` we could throw an exception.

Note that in this case, the subsequent `string.encode_utf8` could just
as well have been `string.encode_lossy_utf8` or `string.encode_wtf8`, as
Note that in this case, the subsequent `string.encode_to_utf8` could just
as well have been `string.encode_to_lossy_utf8` or `string.encode_to_wtf8`, as
these instructions are all the same for strings that do not contain
isolated surrogates, and we checked that there were none.

Expand All @@ -1012,7 +1012,7 @@ will encode isolated surrogates as WTF-8.
local.get $cursor
global.get $buf
i32.const 1024
string.encode_wtf8 ;; push bytes written
string.encode_to_wtf8 ;; push bytes written
local.tee $bytes
(if i32.eqz (then return)) ;; if no bytes encoded, done
local.get $bytes
Expand Down Expand Up @@ -1445,7 +1445,7 @@ faster than `externref`+imports:
predictable performance than e.g. an encoder implemented in JS (for
web embeddings).
4. Reading string contents, either via
`string.encode_wtf8`-then-process-inline or via `stringview_wtf16`,
`string.encode_to_wtf8`-then-process-inline or via `stringview_wtf16`,
is likely faster than calling out to JavaScript to read code units
one at a time. WebAssembly-to-JavaScript calls are cheap but not
free.
Expand Down Expand Up @@ -1506,8 +1506,8 @@ concrete adapter function specialized to the data representations used
by the caller and the callee. The instruction set in this proposal can
be used to implement the adapter function for passing a `stringref` as a
string; assuming that the adapter function is generated in such a way
that it has access to the target memory, `string.encode_wtf8` can
implement the copy and validation at the same time. `string.new_wtf8`
that it has access to the target memory, `string.encode_to_wtf8` can
implement the copy and validation at the same time. `string.decode_from_wtf8`
would be the implementation of getting a `stringref` from an
interface-typed string value, again assuming UTF-8 encoding for these
values.
Expand Down