-
-
Notifications
You must be signed in to change notification settings - Fork 386
Description
in 2.6.2, it seems that codepoints that require more than 4 hex digits do not get stored correctly when given to annotate, for example the rustlang crab glyph '🦀':
task 1 annotate -- $'\U0001f980'when viewed in task 1 edit this appears in my text editor as two 2-byte characters 0xd83e and 0xdd80, which actually do look up to the crab glyph in UTF-16 "surrogates" table, but this will not be what gets displayed when viewing the task or exporting (it shows as '��')
Note that I can paste the crab glyph into an annotation using task 1 edit and save it, and taskwarrior stores and displays it correctly.
I stumbled on the below when playing with the 16-bit values that got stored, to figure out how these surrogate values are encoded:
$ printf '\xd8\x3e\xdd\x80' | iconv -f UTF-16BE -t UTF-8
🦀
That is, in fact the correct glyph, so the two characters that get stored are real UTF-16 surrogate characters. But why doesn't it store them as UTF-8, and it does when pasted? Note, my locale is set to en_US.UTF-8.
I am not sure what is correct behavior, or if this is a bug, as my understanding of unicode is weak. However, I expect when I annotate like so with a plane 2 code point, that is what will get stored, not some UTF-16 surrogate values. I know taskwarrior can store them, because I can directly add them in task edit and they store and display correctly.
Also note, plane 1 annotates just fine using initially described method. It's only when we get the codepoint above 16 bits that this is happening.