-
-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix conversion from wchar to char in LockingTextWriter.put #6474
fix conversion from wchar to char in LockingTextWriter.put #6474
Conversation
Fixes the easier part of issue 18789.
Thanks for your pull request and interest in making D better, @aG0aep6G! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please see CONTRIBUTING.md for more information. If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment. Bugzilla references
Testing this PR locallyIf you don't have a local development environment setup, you can use Digger to test this PR: dub fetch digger
dub run digger -- build "master + phobos#6474" |
Yup. All green. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from the fact that you are adding nice checks for partially written wchar surrogate pairs, when you aren't doing the same for char encodings, this looks pretty good. All of my suggestions are optional, though I would like to see the performance improvements.
std/stdio.d
Outdated
assertThrown!UTFException(writer.put(dchar('y'))); | ||
assertThrown!UTFException(writer.put(surr)); | ||
} ()); | ||
f.close(); // No idea why this is needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you don't include this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The readText
below sees an empty file.
And now I see why: LockingTextWriter holds a reference counted File. When the destructor throws, the File won't be destroyed and its reference count won't be decremented. So the file only gets closed at the end of the program.
Fixed by destroying file_
explicitly before throwing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, glad you figured that out!
immutable wchar surr = "\U0001F608"w[0]; | ||
auto f = File(deleteme, "w"); | ||
assertThrown!UTFException(() { | ||
auto writer = f.lockingTextWriter(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took me a few minutes to understand that the first assertThrown above is testing the throw from the lockingTextWriter dtor. Can you put in a comment to that effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment at the end of the block.
std/stdio.d
Outdated
char[4] wbuf; | ||
immutable size = encode(wbuf, d); | ||
foreach (i; 0 .. size) | ||
trustedFPUTC(wbuf[i], handle_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a lot of acrobatics for this purpose. A few comments here:
- You only need a buffer of 1 wchar.
- You don't need a "size" for checking if there are 0 or 1 wchars in there, just store a
wchar(0)
when you aren't using it. - From my experience with iopipe, you get better performance when you avoid using members to do the decoding. This would be better off being duplicated in the function which takes whole wstrings, and you can then use a local to do all the decoding/encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed points 1 and 2. Hopefully, the code's clearer now.
I think I'll leave point 3 for another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good progress on the simplification. One more thing, I thought it might look better actually if you handle the surrogate pair explicitly instead of the ?: operator
else
{
dchar d = c; // default to just translate directly
if(highSurrogate)
{
immutable wchar[2] buf = [highSurrogate, c];
d = buf[].front; // doesn't this just work? Do we need to use decodeFront?
highSurrogate = 0;
}
// .. all the rest of the stuff you have
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be cool if std.utf provided a way to decode N code units directly. For example, if decode(highSurrogate, c)
just worked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code misses the case where c is an unpaired low surrogate. But encode throws on that, so that works. Done.
If I find the spiritual strength to continue with #6469, I should add checks for UTF-8, too, yeah. |
I suppose the difference between wchar and char is that LockingTextWriter is actually writing char, so there is a one-to-one mapping. When you are writing wchar, you can't just stop in the middle of a surrogate pair because nothing has been written. When you are writing char, you can write everything given. In any case, there is no denying that unicode is messy when it comes to streaming. |
Yeah, I wouldn't validate in the char -> char case. Better just pass it through. But there's also the char -> wchar_t case. There, the situation is the same as here, and I should check for unused chars in the buffer before starting a new code point. |
OK, I think this looks good. I'll give it a bit of time, and then merge it. @wilzbach does that label do it automatically? |
No, it's manual. |
Not sure why the bot isn't merging, but I'll do it via auto-tester. Ping @wilzbach |
Auto-merge toggled on |
Because the bot only merges if all CI pass and Jenkins failed here. |
Ah, ok. Jenkins wasn't marked as required, so I assumed it would merge. |
Yeah the default behavior has been changed a few month ago as it's better if a human judges spurious CI failures and also it's better if we don't have them at all. |
Spin-off from #6469. I'm hoping that this will be relatively easy to get to work on the different platforms.
Instead of throwing a UTFException on unpaired high surrogates, a replacement character could be written. But throwing is what happens currently for unpaired low surrogates, so I went with that.