Skip to content

unicode plane 2 codepoints not saved correctly by 'annotate' #3286

@smemsh

Description

@smemsh

in 2.6.2, it seems that codepoints that require more than 4 hex digits do not get stored correctly when given to annotate, for example the rustlang crab glyph '🦀':

task 1 annotate -- $'\U0001f980'

when viewed in task 1 edit this appears in my text editor as two 2-byte characters 0xd83e and 0xdd80, which actually do look up to the crab glyph in UTF-16 "surrogates" table, but this will not be what gets displayed when viewing the task or exporting (it shows as '��')

Note that I can paste the crab glyph into an annotation using task 1 edit and save it, and taskwarrior stores and displays it correctly.

I stumbled on the below when playing with the 16-bit values that got stored, to figure out how these surrogate values are encoded:

 $ printf '\xd8\x3e\xdd\x80' | iconv -f UTF-16BE -t UTF-8
🦀

That is, in fact the correct glyph, so the two characters that get stored are real UTF-16 surrogate characters. But why doesn't it store them as UTF-8, and it does when pasted? Note, my locale is set to en_US.UTF-8.

I am not sure what is correct behavior, or if this is a bug, as my understanding of unicode is weak. However, I expect when I annotate like so with a plane 2 code point, that is what will get stored, not some UTF-16 surrogate values. I know taskwarrior can store them, because I can directly add them in task edit and they store and display correctly.

Also note, plane 1 annotates just fine using initially described method. It's only when we get the codepoint above 16 bits that this is happening.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions