Skip to content

genUnicodeString generates invalid unicode #167

Open
@martyall

Description

@martyall

The unicode character generator for unicode characters is picking a random CodePoint in the BMP. The unicode string generator just generates an arbitrary array of such code points and turns it into a string. It turns out that this can generate invalid unicode via unpaired surrogates: https://unicode.org/faq/utf_bom.html#utf16-7

One solution here would be to restrict the code points to avoid such cases, another would be to figure out a more complicated but correct way to generate unicode which cannot be done CodePoint by CodePoint.

For context I discovered this while trying to write a quickcheck test for utf8 encoding/decoding, you can see the failing test here

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions