fix: surrogate pair (emoji) handling for a, r, s, ~, ga commands#9929
fix: surrogate pair (emoji) handling for a, r, s, ~, ga commands#9929k1832 wants to merge 4 commits into
Conversation
0cdb8e1 to
14abd9a
Compare
164974e to
c3b1300
Compare
`position.getRight()` increments by 1 UTF-16 code unit, but emojis outside the Basic Multilingual Plane are encoded as 2 code units (a surrogate pair). Moving by 1 code unit lands between the pair, and VSCode's `validatePosition` clamps it back to the start — so the cursor effectively goes backward. Add `getSurrogateAwareRight()`/`getSurrogateAwareLeft()` helpers on the Position prototype that skip past surrogate pairs, and use them in the affected commands: - `a` (Append): cursor now lands after the emoji - `r` (ReplaceCharacter): replaces the full emoji, not half - `s` (ChangeOperator): deletes the full emoji before entering insert - `~` (ToggleCase): advances past the emoji without corrupting it - `ga` (UnicodeInfo): shows full codepoint (e.g. U+1F604) via codePointAt() instead of the half-surrogate from charCodeAt() Commands already protected (no changes needed): x/X (DeleteOperator), l/h (MoveRight/MoveLeft), y (YankOperator).
Address review feedback: pass the TextDocument instead of a pre-extracted line string, matching the rest of the Position API (prevWordStart, etc.). Easier to call and removes the chance of passing the wrong line's text.
ddafd57 to
963c1b2
Compare
|
Forgot to re-request a review... |
|
note about Screen capture showing the difference of going from space versus t (I Same thing is happening for a yanked word: the word is pasted before or after the emoji under cursor depending on this internal state. shift+p / P seems to be working fine |
There was a problem hiding this comment.
Pull request overview
This PR addresses incorrect handling of UTF-16 surrogate pairs (e.g., emoji and other non-BMP codepoints) in several character-level VSCodeVim commands by ensuring cursor movement and ranges don’t land between surrogate halves, which can corrupt text or misplace the cursor.
Changes:
- Added
Position.getSurrogateAwareRight()/Position.getSurrogateAwareLeft()helpers to skip over surrogate pairs safely. - Updated
a,r,~, andgato use surrogate-aware movement; updatedChangeOperatorto correct surrogate boundaries fors/change flows. - Added regression tests covering append/replace/toggle/change/delete behaviors with surrogate-pair characters.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| test/operator/surrogate.test.ts | Adds regression tests for surrogate-pair behavior across several commands. |
| src/common/motion/position.ts | Introduces surrogate-aware left/right helpers on Position. |
| src/actions/operator.ts | Adjusts ChangeOperator deletion range to avoid splitting surrogate pairs. |
| src/actions/motion.ts | Adds TODO notes about possibly adopting the new surrogate-aware helpers. |
| src/actions/commands/replace.ts | Uses surrogate-aware right movement to compute replacement range for r. |
| src/actions/commands/put.ts | Adds a TODO noting xp transpose is still broken for surrogate pairs. |
| src/actions/commands/insert.ts | Updates a (Append) to move right surrogate-safely before entering Insert mode. |
| src/actions/commands/actions.ts | Updates ~ to advance surrogate-safely and ga to use codePointAt() with surrogate-safe range. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The bounds check counted UTF-16 code units while endPos is now codepoint-aware, so e.g. `2r` on a single emoji slipped past the guard and inserted an extra character instead of refusing the operation. Count codepoints in the spanned range instead, matching Vim's behavior of refusing the replace when fewer than `count` characters remain. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the detailed report and the screen recording, @NanoUser3000! That |
What this PR does / why we need it:
When the cursor is on a character encoded as a UTF-16 surrogate pair — emojis (
😄), rare CJK characters (𩸽), mathematical symbols (𝒟,𝔸), musical symbols (𝄞), etc. — several character-level commands break.position.getRight()increments by 1 UTF-16 code unit, but these characters are 2 code units. Moving by 1 lands between the pair, and VSCode'svalidatePositionclamps it back to the start — so the cursor effectively goes backward.a(Append)r(Replace)s(Change char)~(Toggle case)ga(Unicode info)Previous fixes addressed insert mode (PR #7977 / #6046) and motions (
l/h) and operators (x/X/y), but these 5 commands were missed. This PR completes the surrogate pair handling for character-level commands and adds 12 regression tests to prevent future breakage.Adds
getSurrogateAwareRight()/getSurrogateAwareLeft()helpers on the Position prototype that skip past surrogate pairs, following the same pattern already used byMoveRight(l) /MoveLeft(h). Uses these helpers in the 5 affected commands. Also switchesgafromcharCodeAt()tocodePointAt()to report the full Unicode codepoint.Which issue(s) this PR fixes
Fixes #9931
Partially addresses #8321 — this PR fixes
rand other character-level commands reported there. Remaining issues not addressed: easymotion uses raw character arithmetic for match positioning and marker decorations;xp(transpose) on surrogate pairs is also still broken (TODO added input.ts).Special notes for your reviewer: