Open
Description
The unicode character generator for unicode characters is picking a random CodePoint
in the BMP. The unicode string generator just generates an arbitrary array of such code points and turns it into a string. It turns out that this can generate invalid unicode via unpaired surrogates: https://unicode.org/faq/utf_bom.html#utf16-7
One solution here would be to restrict the code points to avoid such cases, another would be to figure out a more complicated but correct way to generate unicode which cannot be done CodePoint
by CodePoint
.
For context I discovered this while trying to write a quickcheck test for utf8 encoding/decoding, you can see the failing test here
Metadata
Metadata
Assignees
Labels
No labels