Closed
Description
Version
v18.5.0
Platform
No response
Subsystem
No response
What steps will reproduce the bug?
const decoder = new TextDecoder('Shift_JIS');
const s = decoder.decode(new Uint8Array([255]));
How often does it reproduce? Is there a required condition?
Always
What is the expected behavior?
const decoder = new TextDecoder('Shift_JIS');
const s = decoder.decode(new Uint8Array([255]));
console.log(s) // '�' === '\ufffd'
According to WHATWG spec, any decoder should use �(U+FFFD)
when an unassigned codepoint is found during decoding.
What do you see instead?
const decoder = new TextDecoder('Shift_JIS');
const s = decoder.decode(new Uint8Array([255]));
console.log(s) // '\x1A'
From my investigation, ICU intentionally uses \x1A
for unassigned codepoint on Shift_JIS encoding, and Node.js uses it as it is.
Conversion Data - ICU Documentation
Which substitution character is used if a character cannot be converted?
Additional information
ICU provides the utility ucnv_setSubstChars
to specify substitution characters for any encoding, and Node.js already has it in library. I'm working on this.