-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow caller more control in stream decoding #448
Conversation
Rather than STT, safer alternatives would be to specialize this to one specific monad (was state the original motivation?) or to use a different algorithm that does not involve STT. |
State (along with reader) was secondary. My primary motivation was to multiple monads in a stack (I use IO and continuation frequently). Specializing is not ideal, since there'd have to be a different one not just for each monad, but each combination of monads. |
STT is notably unsafe with continuations. |
I'm not very happy with the API of decodeUtf8With2 :: ByteString -> ByteString -> Either Int (Text, ByteString) Then clients are free to interpret |
Removed.
Done (though not in that exact manner). Check out the test case |
This still has the same unsafety as STT, since it's just an inlining of the STT logic. You can cause a segfault by specializing to |
Ahhh... I see what you mean, and it'd be a problem with the list monad, too. However, I noticed that So I applied the same concept around the call to the error handler. Before the error handler is called, it runs through the processes just before exiting; the text array is frozen, no longer modified, and the reference to last state value is discarded. Then after the error handler returns, it simulates |
Changing the API to not take a callback at all and instead return an Passing the |
While I understand the desire to use With the requirements and constraints of
I think I found a solution that'll fit the bill. |
This is actually close to what we were suggesting, and indeed much more satisfactory. |
Excellent! A couple of questions:
|
For the first question, I went ahead and let the library track the absolute position for the caller. |
Stream decoders added for utf-16 and utf-32, and an |
We are getting closer to the final cut off before GHC 9.4. @Lysxia could you please complete your review on this? I don't have much bandwidth at the moment. |
@Bodigrim How much time is left for the cut off? I'll allocate more cycles to shepherd this PR (sorry for dragging this @david-sledge !) but I feel like this is going to take a couple more rounds. |
I'm not sure exactly. Given that @bgamari has not chased us yet, it would probably take at least a week or more. |
Given #453 (comment), I'll cut |
This branch no longer uses simdutf anymore. Should all of it and references to it be removed, or is it worth modifying simdutf to have a |
simdutf is there for performance, so if it is no longer used that needs to be justified by benchmarks. I suspect it's going to be difficult to compete with a vectorized implementation. |
2862fed
to
d347bba
Compare
I "accepted" my own change to remove the "requested change" flag in the Github UI. Please review :) And sorry for the messy git history. I'm assuming this is going to be squashed. I added some optimizations so the benchmarks run at least as fast as before. Most of the time should be spent looping in C++, so there should be no change (confirmed by getting a screenful of "same as baseline" where "baseline" is master), except for the "tiny" benchmark where the work in entering and exiting the function dominates. Somehow the "LazyText" benchmark for the "tiny" input is consistently 25% faster even on my noisy machine. (The strict "Text" benchmark for "tiny" is "same as baseline"; it used to be 3x slower, and that's what my last round of commits fixes).
Note that the benchmarks test valid UTF-8. When there are invalid bytes, the old code performs badly (quadratically) anyway as reported in issue #495. This PR fixes #495. |
a2c7f26
to
e136cad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just couple of suggestions, otherwise looks very good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
Thanks to @david-sledge for getting this started and your great work! |
The test is due to a mismatch between Unicode versions (text = Unicode 14, GHC 8.6 = ???). We probably should disable those tests for old GHC (and remain notified when new GHC/Unicode break this test again). |
Yeah, @Lysxia could you please rebase? |
Uh, I can if that's what we want to do. But I thought it would be cleaner to squash it. In the Github UI "Rebase and merge" is grayed out but we can still just select "Squash and merge". Is that okay? |
Ah, I see. Yes, squashing is OK, please go ahead. |
Hi @david-sledge, I want to decode some utf8 from a socket, and abort early if I get any invalid data.
It looks like the suffix can also contain input that wasn't parsed because the parser needs more bytes to determine whether it has a valid code point. Is that correct? |
@414owen You can look at the third component of the returned triple In the doc of
|
I've been wanting this feature for a while, so I decided to prototype it and create a pull request.