Description
Hi Folks, I have ongoing work to deal with string encoding for the purpose of supporting cryptography in purescript, and I am attempting to isolate the cause of the problem I encounter.
I suspect currently that the cause of the problem is actually in the way that quickcheck generates strings via Arbitrary. That is to say, that generating randomly ordered CodeUnits (purescript Char), is not truly compliant with the way UTF-16 strings should be encoded. If we wanted to join randomly generated strings, these should be the (potenitially) multi code-unit strings, aka CodePoints.
I think the best solution would be to redefine arbitrary string such that it is generated from Code Points, not Code Units. Failing this however, a "quick and dirty" solution would be to redefine the char range used to generate strings to always fall within the U+0000 to U+D7FF range. (Represented identically between their code point and code unit).
I will be happy over the weekend to write the PR should I find time and there are no objections raised here. It will also be used to verify my theory by using it with existing text encoding work I've done.