Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression vs legacy conhost: different behavior from ReadConsoleOutputCharacterW when surrogate pair(s) are present. #16892

Closed
chrisant996 opened this issue Mar 18, 2024 · 12 comments · Fixed by #16898
Assignees
Labels
Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Priority-1 A description (P1) Product-Conhost For issues in the Console codebase

Comments

@chrisant996
Copy link

Windows Terminal version

1.19.10573.0

Windows build number

10.0.19045.4046

Other Software

This is an API issue, and affects any software that calls ReadConsoleOutputCharacterW.
The API issue only repros with Windows Terminal, not with legacy conhost (nor with ConEmu, ConsoleZ, etc).

For example clink was affected by this when also using eza or dirx (see here for details of how this was encountered during real world usage).

Steps to reproduce

When a line of text in the console screen buffer contains one or more surrogate pairs, then the behavior ReadConsoleOutputCharacterW API does not match the documented contract, and is different from the behavior with legacy conhost.

The attached repro program demonstrates the behavior:

  • It works as expected in legacy conhost (and ConEmu, ConsoleZ, etc).
  • It malfunctions in Windows Terminal.

repro_surrogate_pairs_issue.zip

Repro:

  • Use WriteConsoleW to print a line of text that includes one or more surrogate pairs.
  • Use ReadConsoleOutputCharacterW to read back the same line.

Easy demonstration program:

  1. Create a folder and extract the files from the .zip file linked above.
  2. Optional: run the signed Repro.exe file.
  3. Or, you can build the source files and run a locally built copy of the Repro.exe program.

Expected Behavior

ReadConsoleOutputCharacterW should:

  1. Fill the out param lpNumberOfCharsRead with the number of characters read (e.g. the width of the console).
  2. Fill the out param lpCharacter with characters read from the console.
  3. Return true (success).

The demo program should first write 4 lines, then verify that reading the lines back matches what was originally written.
Each line contains Unicode codepoints that correspond to certain nerdfonts icons.

Sample expected output:

OUTPUT:

-a--- 17k 13 Mar 14:38  Aaa.cpp
-a--- 11k 13 Mar 14:44 󱗀 Bbb.xml
-a--- 10k 13 Mar 14:43 󰅲 Ccc.lisp
-a--- 11k 13 Mar 14:44  Ddd.zip

RESULTS:

Line 1 len 120 matches : -a--- 17k 13 Mar 14:38  Aaa.cpp
Line 2 len 120 matches : -a--- 11k 13 Mar 14:44 󱗀 Bbb.xml
Line 3 len 120 matches : -a--- 10k 13 Mar 14:43 󰅲 Ccc.lisp
Line 4 len 120 matches : -a--- 11k 13 Mar 14:44  Ddd.zip

Actual Behavior

Only in Windows Terminal (all versions; 1.19, 1.20, 1.20 canary):

  1. Fills the out param lpNumberOfCharsRead with 0.
  2. Does not fill the out param lpCharacter.
  3. Returns true (success).

Problem 1: It reads nothing; it should read text successfully, the same as in legacy conhost.
Problem 2: It reports success; that's inaccurate since it failed to read the text that was present.

Sample actual output:

OUTPUT:

-a--- 17k 13 Mar 14:38  Aaa.cpp
-a--- 11k 13 Mar 14:44 󱗀 Bbb.xml
-a--- 10k 13 Mar 14:43 󰅲 Ccc.lisp
-a--- 11k 13 Mar 14:44  Ddd.zip

RESULTS:

Line 1 len 120 matches : -a--- 17k 13 Mar 14:38  Aaa.cpp
Line 2 len 0   DIFFERS :
Line 3 len 0   DIFFERS :
Line 4 len 120 matches : -a--- 11k 13 Mar 14:44  Ddd.zip
@chrisant996 chrisant996 added Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting labels Mar 18, 2024
@zadjii-msft
Copy link
Member

I'm thinking this is in the neighborhood of #13339

@chrisant996
Copy link
Author

chrisant996 commented Mar 18, 2024

I could potentially understand cutting the ability to read such lines. Although Windows has always been able to access the console buffer, *nix has never been able to.

At a minimum, though, the API should return failure (false) when it fails. The main problem in the case at hand wasn't even the failure to read the console text; the main issue was the API reported that it succeeded but it had actually failed.

Returning failure would have avoided the real world usage problem linked above, since the code was already actively checking for failures. Ideally the API would succeed and behave how it did in legacy conhost -- but if it won't, then returning that there was a failure lets the app handle fallback appropriately.

@lhecker
Copy link
Member

lhecker commented Mar 18, 2024

The regression is not intentional and unrelated to the intentional ones in #13339, but it is in the neighborhood. I think it's caused by #13626 which added proper support for surrogate pairs but changed the behavior of TextBufferCellIterator. It now iterates in columns instead of characters and this breaks single-cell surrogate pairs. The fact that this ever worked is mostly a coincidence because the iterator code is based on UCS2 and not UTF16, but still somewhat dealt with surrogate pairs (albeit poorly).

This internal regression causes the buffer to be filled with (for instance) 121 chars in a 120 column terminal, but you're only giving a 120 char buffer in your repro. The following code then drops the returned text and still returns S_OK. It has existed long before my changes to support surrogate pairs:

// Only copy if the whole result will fit.
if (chars.size() <= buffer.size())
{
std::copy(chars.cbegin(), chars.cend(), buffer.begin());
written = chars.size();
}

I've had a brief glance at the original conhost v1 code and it doesn't seem like this behavior existed there, so I believe this is another regression that was introduced in the original conhost v2 release.

I'm submitting a PR to fix this issue soon. It'll then correctly count the number of characters again.

@lhecker lhecker self-assigned this Mar 18, 2024
@lhecker lhecker added Product-Conhost For issues in the Console codebase Priority-1 A description (P1) Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) labels Mar 18, 2024
@chrisant996
Copy link
Author

chrisant996 commented Mar 19, 2024

@lhecker ah ok that makes sense.

Thank you for working on a fix quickly!




It sounds like you're aware of everything I'm about to add, but for completeness, I'll still call these out overtly:

The trouble with 120 vs 121, is that the API signature uses the nLength param to mean two different things:

  1. The size of the buffer to receive the read content.
  2. The number of cells to read.

So, if the screen width is 120 and a caller passes 121, then on a line with no surrogate pairs the 121st character will be read from the next line. And that requires a caller to post-process the results to figure out how much of the returned content corresponds to the target line.

A caller that has a robust Unicode iterator could ask for e.g. 2 * the screen width and then parse the returned content, measuring widths (e.g. with the wcwidth functions) until they exceed the width of the screen. But in a line that contains zero width joiners (or emoji qualifiers or etc), it could require much more than 2 * the screen width. (And my projects do have a robust Unicode iterator, so anywhere they need the full actual content, they can indeed use iterative calls -- but for the most part they just want to know if a line is empty, i.e. contains only spaces.)

I recognize that it's problematic to fit Unicode into the console APIs whose original signatures aren't detailed enough to accommodate surrogates and joiners (or anything else beyond UCS2). I really like the balanced compromises that the API has made for that over the years. For example, a caller can iteratively call and analyze the returned output and piece together an accurate representation over multiple calls (or a single extra-large calls and then parse the results), regardless of how many WCHARs are required to fetch a full screen line of text. That's pretty cool, albeit complex, and gives callers the ability to operate to whatever degree of "correctness" or "thoroughness" that they wish to.

Thanks again for the speedy response and follow-up!

@lhecker
Copy link
Member

lhecker commented Mar 19, 2024

I've also been wondering if we should just replace surrogate pairs with U+FFFD to make this behave more consistent. While I also agree with your point ("gives callers the ability to [...]"), the support for surrogate pairs makes this API different from the CHAR_INFO APIs which don't ever support them.

@chrisant996
Copy link
Author

I've also been wondering if we should just replace surrogate pairs with U+FFFD to make this behave more consistent. While I also agree with your point ("gives callers the ability to [...]"), the support for surrogate pairs makes this API different from the CHAR_INFO APIs which don't ever support them.

Have there been complaints about the existing behavior over the past 20+ years that it's behaved the current way?

It would be a breaking change. Personally, I'm very happy about how it's been working for years.

But since English is my primary language, surrogate pairs are an outlier for me and only appear when certain nerdfont icons or emoji are present. It might be useful to gather opinions from software developers for whom surrogate pairs are common in normal language text.

If there are neither complaints nor feedback, then I would lean towards preserving compatibility with previous behavior.

@chrisant996
Copy link
Author

@lhecker also, if surrogate pairs would be stripped, then I assume the intent would also be to strip zero width joiners, and all combining marks, right? There's a lot of room for complications and impactful side effects.

@lhecker
Copy link
Member

lhecker commented Mar 19, 2024

Have there been complaints about the existing behavior over the past 20+ years that it's behaved the current way?

It would be a breaking change. [...]

That's true, sort of. The console so far was (in pseudocode terms) a TCHAR[height][width] buffer for the longest time. Even when TCHAR got replaced with WCHAR in conhost v2, surrogate pair support was rudimentary and only worked properly in a few select spots. The underlying buffer was still a rectangular matrix.

The reason your repro case worked at all is because that broken UTF16 implementation coincidentally counted the narrow cell with a surrogate pair as a wide cell and thus the iterator stepped over both the leading and trailing surrogate pair. In other words, before I "fixed" it, it actually thought that the row had 121 columns (as an example), even though the buffer is only 120 wide. This caused a ton of issues and regressions elsewhere.

So, while it's definitely a breaking change by definition (= change in behavior causing failures in other components), I don't believe that the exact contents that can be read via these APIs are properly specced out yet and never were. Sure, ReadConsoleOutputCharacterW happened to work coincidentally, but what about the other Win32 APIs and VT sequences?

But let me get to a much more concerning point...

@lhecker also, if surrogate pairs would be stripped, then I assume the intent would also be to strip zero width joiners, and all combining marks, right? There's a lot of room for complications and impactful side effects.

This is exactly my biggest concern right now. I'm planning to add proper Unicode support with extended grapheme clusters in the near term. That is, support for ZWJs, VS15/16, combining marks, and so on. I'm rather worried that this will make ReadConsoleOutputCharacterW behave extremely erratic, in particular in respect to the CHAR_INFO APIs which must return U+FFFD for such complex graphemes.

Thoughts?

@chrisant996
Copy link
Author

chrisant996 commented Mar 19, 2024

@lhecker that all makes sense.

I can confirm that for my own purposes, the proposed change to RCOCW() would not cause any negative side effects in any of my projects (not even for users in locales where surrogate pairs are common). And specifically, Clink will not encounter problems from such a change. Because none of my projects do anything more than look for empty lines, anymore (well, Clink can "scrape" the screen for text to use in completions, but it's ok for that to miss some things).



Something to consider for the future:

It would be fantastic to have some way to save/restore the terminal screen (or a region within the terminal screen), and preserve arbitrarily complex text and colors and styles. I don't care whether it's an escape code or an API, and it would be perfectly fine for the saved state to be opaque (e.g. if an API filled an out-param buffer with some opaque payload that only the API knows how to unpack and restore -- though of course someone would try to "crack" the payload, and the API would need to guard against fuzzing, etc).

The Alternate Screen Buffer private mode codes (e.g. CSI ? 1049 h and CSI ? 1049 l) are almost suitable, except that there's no way to copy the Normal Screen Buffer to the Alternate Screen Buffer, so while the save/restore part works they also clear the Alternate Screen Buffer and thus are useless for showing a "popup" or "overlay" over the existing terminal screen.

FWIW, I stopped using ReadConsoleOutputW once it couldn't preserve colors and styles beyond the 4-bit VGA colors. But, I strongly wish there were a way to save and restore either the whole screen or a region of the screen (including colors, styles, and full Unicode text). The problem with the guidance "apps should keep track of what they print" is that sometimes an app wants to show a temporary popup window in a region of the screen that was printed by some other app. PowerShell's progress bars is one example, the Azure CLI installer is another example (but maybe it's just another PowerShell progress bar, I'm not sure), and there are several places in my own projects where I want to save/restore a region of the screen, but no longer can. In the case of Clink, it scrolls the screen to make room for the popup at the bottom of the screen where it doesn't overlay anything else -- it's functional, but it yields an unpolished and less pleasant experience.

@lhecker
Copy link
Member

lhecker commented Mar 19, 2024

For better or worse, the question about whether to use U+FFFD or not has been answered...
The DBCS implementation has some fairly exhaustive unit tests and those tests are failing now if I properly support narrow-cell surrogate pairs. The unit tests make is very clear that the nLength parameter in ReadConsoleOutputCharacterW is intended as a column count.

I'll make it return U+FFFD for now as that was my plan for the CHAR_INFO API anyways, which have the same restriction (CHAR_INFO is explicitly per-column/row after all).

@zadjii-msft
Copy link
Member

throwaway thought before I leave for lunch: #10810 sounds a lot like what you want

@chrisant996
Copy link
Author

throwaway thought before I leave for lunch: #10810 sounds a lot like what you want

Yes @zadjii-msft that'd be a great way for save/restore to work (i.e. to use pages and copy back and forth between pages, which I think are essentially a collection of alternate screen buffers).

It's currently NYI, though, right?

And yes, Far is an excellent example of an app that relies on accurately saving/restore content printed by other apps. Any solution that's viable for Far would meet even all of the "stretch" goals that I've ever had for what I would need from a save/restore mechanism.

DHowett pushed a commit that referenced this issue Mar 20, 2024
The `nLength` parameter of `ReadConsoleOutputCharacterW` indicates
the number of columns that should be read. For single-column (narrow)
surrogate pairs this previously clipped a trailing character of the
returned string. In the major Unicode support update in #13626
surrogate pairs truly got stored as atomic units for the first time.
This now meant that a 120 column read with such codepoints resulted
in 121 characters. Other parts of conhost still assume UCS2 however,
and so this results in the entire read failing.

This fixes the issue by turning surrogate pairs into U+FFFD
which makes it UCS2 compatible.

Closes #16892

* Write U+F15C0 and read it back with `ReadConsoleOutputCharacterW`
* Read succeeds with a single U+FFFD ✅

(cherry picked from commit 373faf0)
Service-Card-Id: 92121912
Service-Version: 1.20
DHowett pushed a commit that referenced this issue Mar 20, 2024
The `nLength` parameter of `ReadConsoleOutputCharacterW` indicates
the number of columns that should be read. For single-column (narrow)
surrogate pairs this previously clipped a trailing character of the
returned string. In the major Unicode support update in #13626
surrogate pairs truly got stored as atomic units for the first time.
This now meant that a 120 column read with such codepoints resulted
in 121 characters. Other parts of conhost still assume UCS2 however,
and so this results in the entire read failing.

This fixes the issue by turning surrogate pairs into U+FFFD
which makes it UCS2 compatible.

Closes #16892

* Write U+F15C0 and read it back with `ReadConsoleOutputCharacterW`
* Read succeeds with a single U+FFFD ✅

(cherry picked from commit 373faf0)
Service-Card-Id: 92129542
Service-Version: 1.19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Priority-1 A description (P1) Product-Conhost For issues in the Console codebase
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants