Skip to content

Buffer.toString('utf8') appears to use wtf-8 #23280

Closed
@AljoschaMeyer

Description

The byte sequence 237, 166, 164 is not valid utf8, since it encodes a surrogate code point, which is not a valid unicode scalar value. So Buffer.from([237, 166, 164]).toString('utf8') should error. But instead, it returns a string, effectively implementing wtf-8 rather than utf-8.

Or does Buffer.toString simply not provide any validity guarantees at all, returning garbage strings if the buffer contains invalid input? In that case, please document this as expected behavior, since it makes the function completely useless for a bunch of use cases.

node -v: v10.11.0
uname -a: Linux aljoscha-laptop 4.18.10-arch1-1-ARCH #1 SMP PREEMPT Wed Sep 26 09:48:22 UTC 2018 x86_64 GNU/Linux

See also rust-lang/rust#54845

edit: This also leaks into JSON.parse, which can accept garbage strings even though ECMA-404 (the json standard prescribed for JSON.parse as defined in ECMAScript) only allows valid utf8 input.

Metadata

Assignees

No one assigned

    Labels

    bufferIssues and PRs related to the buffer subsystem.docIssues and PRs related to the documentations.encodingIssues and PRs related to the TextEncoder and TextDecoder APIs.help wantedIssues that need assistance from volunteers or PRs that need help to proceed.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions