U+FFFE and U+FFFF encoded wrongly #548

nwellnhof · 2024-04-23T17:40:42Z

cmark_utf8proc_encode_char was pasted from an old version of the utf8proc project and, for whatever reason, contains special handling of U+FFFE and U+FFFF, resulting in invalid serialization of these codepoints. This can be triggered when parsing numeric character references and with some renderers:

% python3 -c 'print(chr(0xFFFF))' |build/src/cmark -t commonmark |hexdump -C
00000000  ff 0a                                             |..|
00000002
% echo '&#xFFFF;' |build/src/cmark |hexdump -C
00000000  3c 70 3e ff 3c 2f 70 3e  0a                       |<p>.</p>.|
00000009

The expected UTF-8 sequence is EF BF BF.

The text was updated successfully, but these errors were encountered:

Fixes commonmark#548.

Fixes #548.

nwellnhof added a commit to nwellnhof/cmark that referenced this issue Apr 23, 2024

utf8: Fix encoding of U+FFFE and U+FFFF

80e9d1f

Fixes commonmark#548.

nwellnhof added a commit to nwellnhof/cmark that referenced this issue Apr 23, 2024

utf8: Fix encoding of U+FFFE and U+FFFF

12f5205

Fixes commonmark#548.

nwellnhof mentioned this issue Apr 23, 2024

utf8: Fix encoding of U+FFFE and U+FFFF #549

Merged

jgm closed this as completed in #549 Apr 23, 2024

jgm pushed a commit that referenced this issue Apr 23, 2024

utf8: Fix encoding of U+FFFE and U+FFFF

2f31b3c

Fixes #548.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

U+FFFE and U+FFFF encoded wrongly #548

U+FFFE and U+FFFF encoded wrongly #548

nwellnhof commented Apr 23, 2024

U+FFFE and U+FFFF encoded wrongly #548

U+FFFE and U+FFFF encoded wrongly #548

Comments

nwellnhof commented Apr 23, 2024