Skip to content

The behavior for unassigned codepoint of Shift_JIS is incompatible with WHATWG spec #43962

Closed
@cola119

Description

@cola119

Version

v18.5.0

Platform

No response

Subsystem

No response

What steps will reproduce the bug?

const decoder = new TextDecoder('Shift_JIS');
const s = decoder.decode(new Uint8Array([255]));

How often does it reproduce? Is there a required condition?

Always

What is the expected behavior?

const decoder = new TextDecoder('Shift_JIS');
const s = decoder.decode(new Uint8Array([255]));
console.log(s) // '�' === '\ufffd'

According to WHATWG spec, any decoder should use �(U+FFFD) when an unassigned codepoint is found during decoding.

What do you see instead?

const decoder = new TextDecoder('Shift_JIS');
const s = decoder.decode(new Uint8Array([255]));
console.log(s) // '\x1A'

From my investigation, ICU intentionally uses \x1A for unassigned codepoint on Shift_JIS encoding, and Node.js uses it as it is.
Conversion Data - ICU Documentation
Which substitution character is used if a character cannot be converted?

Additional information

ICU provides the utility ucnv_setSubstChars to specify substitution characters for any encoding, and Node.js already has it in library. I'm working on this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    confirmed-bugIssues with confirmed bugs.encodingIssues and PRs related to the TextEncoder and TextDecoder APIs.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions