-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix UTF-8 decoding of lazy bytestrings #333
Conversation
I don't understand the doctest errors on GHC 8+, and on GHC 7+ the errors are that old versions of bytestring did not have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doctests are failing in master
as well (I guess it has something to do with a new version of doctest
package?.. Dunno)
Could you please check a coverage report to ensure that all lines are well-tested?
b42b59c
to
de7c071
Compare
That's probably fine, I guess. Looks good to me except GHC < 7.6 builds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
(Found a silly space.) |
@Lysxia could you please resolve a conflict? |
At the beginning of a new chunk we may be trying to complete a UTF-8 sequence started in the previous chunk (contained in the `undecode0` buffer). If it turns out to be invalid, we must apply the `onErr` handler to every character in that buffer. When we reach the end of the chunk, we must also be more careful about when to keep the previous buffer: a UTF-8 sequence (up to 4 bytes) can span more than two chunks, when those chunks are very short (of length 0, 1, or 2).
Fixes #330.
I found another issue that's not yet fixed: both strict and lazy
decodeUtf8With
are actually memory unsafe if you use a badonErr
argument. We allocate a destination buffer with 2x the number of bytes from the original bytestrings, but by using anonErr
function which replaces any invalid byte with aChar
which is a surrogate pair in UTF-16, it possible to blow up the size taken by aText
to 4x. In practice,onErr
is almost alwayslenientDecode
though, so perhaps a better solution than allocating more memory is to either hidedecodeUtf8With
or clamp the range ofonErr
.