Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std.os.windows: Support UTF-8 ↔ UTF-16 conversion for Windows native console I/O #16618

Closed

Conversation

Prcuvu
Copy link
Contributor

@Prcuvu Prcuvu commented Jul 30, 2023

These commits aim to eliminate text encoding problems regarding Windows native console I/O.

To-do list:

@Prcuvu Prcuvu marked this pull request as draft July 30, 2023 16:13
@Prcuvu Prcuvu changed the title std: Support UTF-8 ↔ UTF-16 conversion for Windows native console I/O std.os.windows: Support UTF-8 ↔ UTF-16 conversion for Windows native console I/O Jul 30, 2023
@Prcuvu Prcuvu force-pushed the windows-console-utf16-conversion branch from 911c601 to 1a88bec Compare July 30, 2023 18:07
@squeek502
Copy link
Collaborator

Thanks for working on this! At some point I'd like to benchmark reading/writing to console against master to see what the performance cost of this approach is.

Prior attempt at the same thing: #12400

@Prcuvu Prcuvu force-pushed the windows-console-utf16-conversion branch from 1a88bec to 4ff8419 Compare August 2, 2023 00:43
@Prcuvu Prcuvu marked this pull request as ready for review August 2, 2023 00:47
@Prcuvu
Copy link
Contributor Author

Prcuvu commented Aug 2, 2023

Update: I also tried UTF-16 to UTF-8 conversion for raw mode, but I encountered ferocious edge cases, especially when I read the console input partially (leaving UTF-16 code units in the console input text buffer) and then switching on/off ENABLE_LINE_INPUT with SetConsoleMode(). The Windows console API just refused to work together with my buffers, so I had to delete the whole bunch of code for that part.

It should be perfectly fine as long as people don't mess with console input mode. I would be optimistic to assume that it is effectively done.

@Prcuvu Prcuvu force-pushed the windows-console-utf16-conversion branch 3 times, most recently from 6f18d6d to d5e15c5 Compare August 3, 2023 02:09
@Prcuvu Prcuvu force-pushed the windows-console-utf16-conversion branch 2 times, most recently from 2043ba0 to fffbb92 Compare August 4, 2023 06:05
@notcancername
Copy link
Contributor

notcancername commented Aug 5, 2023

Is the hidden allocation here fine? Wouldn't this be better solved with a stack-allocated buffer and multiple writes?
Also, in my personal opinion, the old broken behavior should be used if OOM occurs.

@Prcuvu
Copy link
Contributor Author

Prcuvu commented Aug 5, 2023

Is the hidden allocation here fine? Wouldn't this be better solved with a stack-allocated buffer and multiple writes?

I did that just to keep the logic simple.
It seems that Microsoft UCRT also takes the heap allocation approach, but you've got a good point. I will try rewriting it using a fixed-size buffer.

@Prcuvu
Copy link
Contributor Author

Prcuvu commented Aug 6, 2023

Done. The temporary buffer is now 1024-byte long, and no heap allocation involved.

@Prcuvu Prcuvu force-pushed the windows-console-utf16-conversion branch from 8208db5 to d4fe650 Compare August 6, 2023 03:24
@Prcuvu Prcuvu force-pushed the windows-console-utf16-conversion branch from d4fe650 to 9495568 Compare October 3, 2023 21:37
@Prcuvu
Copy link
Contributor Author

Prcuvu commented Oct 4, 2023

@squeek502 Just rebased onto master head and passed all checks. Are you free to review?

@squeek502
Copy link
Collaborator

squeek502 commented Oct 4, 2023

Will try writing some benchmarks / test cases when I get a chance.

In the meantime, it might be helpful if you could take a look at #12400 and compare your implementation with its (relatively simple) implementation. What are the relevant differences? Are there things that your implementation handles that wouldn't be handled by the one in #12400?

@Prcuvu
Copy link
Contributor Author

Prcuvu commented Oct 6, 2023

What are the relevant differences? Are there things that your implementation handles that wouldn't be handled by the one in #12400?

As far as I can tell after a quick glance at #12400:

  1. std: Windows Console IO Fix #12400 adds platform-specific logic and data (a FIFO buffer for leftover UTF-8 bytes and a u16 buffer for leftover UTF-16 high surrogate) in std.fs.File while my implementation keeps everything inside std.os.windows;
  2. std: Windows Console IO Fix #12400 adds a new getClampedUtf8SizeForUtf16LeSize function in std.unicode while my implementation produces a new InvalidUtf8 error that spreads into File.UpdateDeclError in src/link.zig;
  3. My implementation handles LF ↔ CRLF conversion while std: Windows Console IO Fix #12400 does not;
  4. My implementation handles Ctrl-Z while std: Windows Console IO Fix #12400 does not.

Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not good. Let's take a step back and understand what is going on before adding such hacks to the codebase.

@@ -523,19 +527,186 @@ pub fn ReadFile(in_hFile: HANDLE, buffer: []u8, offset: ?u64, io_mode: std.io.Mo
};
break :blk &overlapped_data;
} else null;
if (kernel32.ReadFile(in_hFile, buffer.ptr, want_read_count, &amt_read, overlapped) == 0) {
var console_mode: DWORD = undefined;
const is_console_handle: bool = kernel32.GetConsoleMode(in_hFile, &console_mode) != FALSE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? You want to call GetConsoleMode with every call to ReadFile?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some context:

However, it seems Go stores the kind as "console" in newFile to skip the GetConsoleMode on read/write:

But still seems to have a GetConsoleMode call in a (lower level?) API? I'm not familiar with Go.

Copy link
Contributor Author

@Prcuvu Prcuvu Oct 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? You want to call GetConsoleMode with every call to ReadFile?

No, I don't.

However, std.fs.File is merely a thin wrapper around the underlying OS file handle. If we want to remember whether the OS handle is a console handle on std.fs.File creation, we will need to add OS-specific fields in std.fs.File data structure, or add additional records in std.os.windows along with a lookup scheme by a std.fs.File handle (either pointer or a unique ID), just like how C runtime libraries deal with POSIX file descriptors internally.

For minimum intrusive changes with platform-agnostic code (i.e. std.fs.File), I chose calling GetConsoleMode with every ReadFile call. I would like to hear your thoughts on design choices before making any change.

Comment on lines +534 to +540
// There is no reliable way to implement perfectly platform-agnostic UTF-16 to UTF-8
// conversion for raw mode, because it is impossible to know the number of pending
// code units stored in console input buffer, while in cooked mode we can rely on the
// terminating LF character. Without knowing that, ReadConsoleW() may accidentally pop
// out characters without blocking, or prompt for user input at unexpected timing.
// In the case of raw mode, redirect to kernel32.ReadFile() without conversion for now,
// just don't make things worse.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not good enough. If it cannot be implemented perfectly using the current abstractions, then the current abstractions need to change.

Copy link
Contributor Author

@Prcuvu Prcuvu Oct 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me elaborate on this:

The current goal of my implementation is to make users totally unaware of UTF-16 presence even if they call std.os.windows.ReadFile directly. Users are able to pass UTF-8-encoded strings everywhere, and std.os.windows.ReadFile does the translation job under the hood. It means that I need to insert a UTF-16 → UTF-8 translation layer between std.os.windows.ReadFile and kernel32.ReadConsoleW.

Here comes the key problem that gives rise to annoying edge cases: The user-provided buffer is meant to be UTF-8 encoded. Console input (which is UTF-16 encoded) must be converted to UTF-8 before filling into the user-provided buffer, while the converted UTF-8 length can't be easily determined. When the user-provided buffer is not large enough to hold the converted console input, there are three pieces of data we have to deal with carefully:

  • UTF-8 fragments. For example, a Unicode character from console input converts into a 3-byte sequence, but the user-provided buffer is only 2 bytes long. Then ReadFile should store the first 2 bytes into the user-provided buffer and stash the final byte. Whenever ReadFile is called again, this byte should be popped first before reading anything else from the console.
  • Single UTF-16 high surrogate yet to form a complete Unicode code point. We should stash it separately and wait for a low surrogate from another ReadConsoleW call to convert into UTF-8 byte sequence.
  • UTF-16 code units left in ReadConsoleW's dedicated buffer. It is completely opaque to user-mode code, causing serious troubles, which will be discussed below.

There are two modes affecting ReadConsoleW behaviour determined by ENABLE_LINE_INPUT flag (accessed with GetConsoleMode/SetConsoleMode functions):

  • Cooked mode (ENABLE_LINE_INPUT set): If CR is encountered in the dedicated buffer, an LF is appended to the user-provided buffer and reading from console ends immediately. Otherwise, wait for console input until a CR is entered (either single key press or pasting multiple characters). Everything after the CR is left inside the dedicated buffer after the read.
  • Raw mode (ENABLE_LINE_INPUT unset): If there is at least one character available in the dedicated buffer, reading from console ends immediately. Otherwise, wait for console input (either single key press or pasting multiple characters). The dedicated buffer is flushed after the read.

Cooked mode can be emulated perfectly, as in my implementation. However, to emulate the behaviour in raw mode, we have to try flushing UTF-8 fragments and then call ReadConsoleW to flush the dedicated buffer. The problem arises when we popped at least one byte from the UTF-8 fragments. Because we have "read" something (from UTF-8 fragments), std.os.windows.ReadFile should return immediately if the ReadConsoleW dedicated buffer is empty, but calling ReadConsoleW when the dedicated buffer is empty causes an unwanted wait for console input. As I explained before, there is no way to know whether the dedicated buffer is empty other than calling ReadConsoleW, so it is a stalemate.

If there are better abstractions, please let me know.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant commit message from Go: golang/go@610d522

The readConsole implementation has changed a bit since then, the latest is here: https://github.com/golang/go/blob/6e8caefc19cae465444775f6cd107b138a26cce7/src/internal/poll/fd_windows.go#L454-L521

There does not seem to be any special handling of ENABLE_LINE_INPUT in Rust or Go at all.

@squeek502
Copy link
Collaborator

For what it's worth, here are my initial reactions to these points:

  1. std: Windows Console IO Fix #12400 adds platform-specific logic and data (a FIFO buffer for leftover UTF-8 bytes and a u16 buffer for leftover UTF-16 high surrogate) in std.fs.File while my implementation keeps everything inside std.os.windows;

This is a neutral point IMO--not a pro nor a con.

  1. std: Windows Console IO Fix #12400 adds a new getClampedUtf8SizeForUtf16LeSize function in std.unicode while my implementation produces a new InvalidUtf8 error that spreads into File.UpdateDeclError in src/link.zig;

This makes sense to me and seems like a good change. #12400 returned error.Unexpected when failing UTF-8/UTF-16 conversion which doesn't seem like a good idea.

  1. My implementation handles LF ↔ CRLF conversion while std: Windows Console IO Fix #12400 does not;

This needs some justification. Why should Zig care about LF ↔ CRLF conversion in console input/output? It doesn't seem like Rust/Go do this AFAICT.

  1. My implementation handles Ctrl-Z while std: Windows Console IO Fix #12400 does not.

👍

@Prcuvu
Copy link
Contributor Author

Prcuvu commented Oct 8, 2023

Why should Zig care about LF ↔ CRLF conversion in console input/output? It doesn't seem like Rust/Go do this AFAICT.

For console output, I agree that converting LF to CRLF will not bring special benefits, as Windows Console Host should recognize single LF. This conversion makes more sense when writing to a text file. I implemented this way just because C stdio does the same thing for all text-mode file handles.

But for console input, if we convert only the text encoding after ReadConsoleW, we will get something like "hello, world\r\n" instead of "hello, world\n". With this conversion, users only need to take care of a single form of new-line character. C stdio does the same thing.

@squeek502
Copy link
Collaborator

squeek502 commented Oct 8, 2023

Matching libc behavior is not a goal here. In my opinion the goals should be:

  • Do the absolute minimum possible to get Windows console IO to work with Unicode input/output (plus compatibility for things like Ctrl+Z)
  • Have the smallest impact on performance of read/write as possible (both for console and non-console IO)

@Prcuvu
Copy link
Contributor Author

Prcuvu commented Oct 9, 2023

All right, I will remove LF → CRLF conversion for console output. But I still believe it is a good idea to keep CRLF → LF conversion for console input:

But for console input, if we convert only the text encoding after ReadConsoleW, we will get something like "hello, world\r\n" instead of "hello, world\n". With this conversion, users only need to take care of a single form of new-line character.

What do you think of this?

@squeek502
Copy link
Collaborator

What do you think of this?

It seems like handling this via conversion within read would be more costly and error prone than the user handling CRLF line endings at the call site (which they'll already need to do for file IO generally).

@Prcuvu
Copy link
Contributor Author

Prcuvu commented Oct 9, 2023

All right, I will remove LF ↔ CRLF conversion completely and leave the job to users, considering that there are no abstractions of "text mode" and "binary mode" for Zig's std.fs.File.

@andrewrk
Copy link
Member

andrewrk commented May 9, 2024

I'm sorry, I didn't review this in time, and now it has bitrotted. Furthermore, so many pull requests have stacked up that I can't keep up and I am therefore declaring Pull Request Bankruptcy and closing old PRs that now have conflicts with master branch.

If you want to reroll, you are by all means welcome to revisit this changeset with respect to the current state of master branch, and there's a decent chance your patch will be reviewed the second time around.

Either way, I'm closing this now, otherwise the PR queue will continue to grow indefinitely.

@andrewrk andrewrk closed this May 9, 2024
@Prcuvu
Copy link
Contributor Author

Prcuvu commented May 10, 2024

All right. I have been busy recently, hopefully I will come back to this PR and rebase all the changes in a month or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants