Skip to content

Arbitrary Strings Not Properly Encodable. #94

Open
@AlexaDeWit

Description

@AlexaDeWit

Hi Folks, I have ongoing work to deal with string encoding for the purpose of supporting cryptography in purescript, and I am attempting to isolate the cause of the problem I encounter.

I suspect currently that the cause of the problem is actually in the way that quickcheck generates strings via Arbitrary. That is to say, that generating randomly ordered CodeUnits (purescript Char), is not truly compliant with the way UTF-16 strings should be encoded. If we wanted to join randomly generated strings, these should be the (potenitially) multi code-unit strings, aka CodePoints.

I think the best solution would be to redefine arbitrary string such that it is generated from Code Points, not Code Units. Failing this however, a "quick and dirty" solution would be to redefine the char range used to generate strings to always fall within the U+0000 to U+D7FF range. (Represented identically between their code point and code unit).

I will be happy over the weekend to write the PR should I find time and there are no objections raised here. It will also be used to verify my theory by using it with existing text encoding work I've done.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions