genUnicodeString generates invalid unicode

The [unicode character generator](https://github.com/purescript/purescript-strings/blob/master/src/Data/Char/Gen.purs#L10-L11) for unicode characters is picking a random `CodePoint` in the BMP. The [unicode string generator](https://github.com/purescript/purescript-strings/blob/master/src/Data/String/Gen.purs#L16-L19) just generates an arbitrary array of such code points and turns it into a string. It turns out that this can generate invalid unicode via unpaired surrogates: https://unicode.org/faq/utf_bom.html#utf16-7

One solution here would be to restrict the code points to avoid such cases, another would be to figure out a more complicated but correct way to generate unicode which cannot be done `CodePoint` by `CodePoint`.

For context I discovered this while trying to write a quickcheck test for utf8 encoding/decoding, you can see the failing test [here](https://github.com/f-o-a-m/purescript-bytestrings/pull/1/files#diff-71732b478b4808898d86c8591ad7ab46d8122c1e4facec4a9151ac49efba905dR91-R93)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

genUnicodeString generates invalid unicode #167

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

genUnicodeString generates invalid unicode #167

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions