Skip to content

fix: surrogate pair (emoji) handling for a, r, s, ~, ga commands#9929

Open
k1832 wants to merge 4 commits into
VSCodeVim:masterfrom
k1832:fix/surrogate-pairs
Open

fix: surrogate pair (emoji) handling for a, r, s, ~, ga commands#9929
k1832 wants to merge 4 commits into
VSCodeVim:masterfrom
k1832:fix/surrogate-pairs

Conversation

@k1832
Copy link
Copy Markdown
Contributor

@k1832 k1832 commented Feb 10, 2026

What this PR does / why we need it:

When the cursor is on a character encoded as a UTF-16 surrogate pair — emojis (😄), rare CJK characters (𩸽), mathematical symbols (𝒟, 𝔸), musical symbols (𝄞), etc. — several character-level commands break.

position.getRight() increments by 1 UTF-16 code unit, but these characters are 2 code units. Moving by 1 lands between the pair, and VSCode's validatePosition clamps it back to the start — so the cursor effectively goes backward.

Command Broken behavior
a (Append) Cursor lands before the character instead of after
r (Replace) Only replaces half the character, corrupting the text
s (Change char) Only deletes half the character before entering Insert mode
~ (Toggle case) Corrupts the character into a lone surrogate
ga (Unicode info) Shows the half-surrogate value instead of the full codepoint

Previous fixes addressed insert mode (PR #7977 / #6046) and motions (l/h) and operators (x/X/y), but these 5 commands were missed. This PR completes the surrogate pair handling for character-level commands and adds 12 regression tests to prevent future breakage.

Adds getSurrogateAwareRight()/getSurrogateAwareLeft() helpers on the Position prototype that skip past surrogate pairs, following the same pattern already used by MoveRight (l) / MoveLeft (h). Uses these helpers in the 5 affected commands. Also switches ga from charCodeAt() to codePointAt() to report the full Unicode codepoint.

Which issue(s) this PR fixes

Fixes #9931
Partially addresses #8321 — this PR fixes r and other character-level commands reported there. Remaining issues not addressed: easymotion uses raw character arithmetic for match positioning and marker decorations; xp (transpose) on surrogate pairs is also still broken (TODO added in put.ts).

Special notes for your reviewer:

  • 19/19 surrogate pair tests passing (7 existing + 12 new)
  • Full test suite passes (3157 passing, 0 regressions)
  • Tests cover emojis, rare CJK (𩸽), and mathematical symbols (𝒟, 𝔸) from U+1D400 block
  • Tests require fix: add missing @types/minimatch devDependency #9928 to build when re-resolving dependencies

@k1832 k1832 changed the title Fix surrogate pair (emoji) handling for a, r, s, ~, ga commands fix: surrogate pair (emoji) handling for a, r, s, ~, ga commands Feb 10, 2026
@k1832 k1832 force-pushed the fix/surrogate-pairs branch from 0cdb8e1 to 14abd9a Compare February 10, 2026 03:14
@k1832 k1832 mentioned this pull request Feb 10, 2026
@k1832 k1832 force-pushed the fix/surrogate-pairs branch 2 times, most recently from 164974e to c3b1300 Compare February 10, 2026 03:32
@J-Fields J-Fields requested a review from Copilot March 27, 2026 02:46
Comment thread src/common/motion/position.ts Outdated
k1832 added 2 commits April 26, 2026 22:44
`position.getRight()` increments by 1 UTF-16 code unit, but emojis
outside the Basic Multilingual Plane are encoded as 2 code units
(a surrogate pair). Moving by 1 code unit lands between the pair,
and VSCode's `validatePosition` clamps it back to the start — so
the cursor effectively goes backward.

Add `getSurrogateAwareRight()`/`getSurrogateAwareLeft()` helpers on
the Position prototype that skip past surrogate pairs, and use them
in the affected commands:

- `a` (Append): cursor now lands after the emoji
- `r` (ReplaceCharacter): replaces the full emoji, not half
- `s` (ChangeOperator): deletes the full emoji before entering insert
- `~` (ToggleCase): advances past the emoji without corrupting it
- `ga` (UnicodeInfo): shows full codepoint (e.g. U+1F604) via
  codePointAt() instead of the half-surrogate from charCodeAt()

Commands already protected (no changes needed): x/X (DeleteOperator),
l/h (MoveRight/MoveLeft), y (YankOperator).
Address review feedback: pass the TextDocument instead of a
pre-extracted line string, matching the rest of the Position API
(prevWordStart, etc.). Easier to call and removes the chance of
passing the wrong line's text.
@k1832 k1832 force-pushed the fix/surrogate-pairs branch from ddafd57 to 963c1b2 Compare April 26, 2026 13:50
@k1832 k1832 requested a review from J-Fields May 14, 2026 05:38
@k1832
Copy link
Copy Markdown
Contributor Author

k1832 commented May 14, 2026

Forgot to re-request a review...

@NanoUser3000
Copy link
Copy Markdown

note about p: p working depends one additional factor
when using j to move cursor over 2-wide / surrogate pair symbol, we can go from two columns (assuming they are normal characters). the cursor over emoji may be in 1 of 2 states: "in column more to the left" or "more to the right".

Screen capture showing the difference of going from space versus t (I x the a, then do jjpukljp…). Despite using p with cursor showing over the same character, the results are different)
Screencast from 2026-05-22 18-45-15.webm

Same thing is happening for a yanked word: the word is pasted before or after the emoji under cursor depending on this internal state.

shift+p / P seems to be working fine

@J-Fields J-Fields requested review from Copilot and removed request for Copilot May 26, 2026 23:02
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses incorrect handling of UTF-16 surrogate pairs (e.g., emoji and other non-BMP codepoints) in several character-level VSCodeVim commands by ensuring cursor movement and ranges don’t land between surrogate halves, which can corrupt text or misplace the cursor.

Changes:

  • Added Position.getSurrogateAwareRight() / Position.getSurrogateAwareLeft() helpers to skip over surrogate pairs safely.
  • Updated a, r, ~, and ga to use surrogate-aware movement; updated ChangeOperator to correct surrogate boundaries for s/change flows.
  • Added regression tests covering append/replace/toggle/change/delete behaviors with surrogate-pair characters.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
test/operator/surrogate.test.ts Adds regression tests for surrogate-pair behavior across several commands.
src/common/motion/position.ts Introduces surrogate-aware left/right helpers on Position.
src/actions/operator.ts Adjusts ChangeOperator deletion range to avoid splitting surrogate pairs.
src/actions/motion.ts Adds TODO notes about possibly adopting the new surrogate-aware helpers.
src/actions/commands/replace.ts Uses surrogate-aware right movement to compute replacement range for r.
src/actions/commands/put.ts Adds a TODO noting xp transpose is still broken for surrogate pairs.
src/actions/commands/insert.ts Updates a (Append) to move right surrogate-safely before entering Insert mode.
src/actions/commands/actions.ts Updates ~ to advance surrogate-safely and ga to use codePointAt() with surrogate-safe range.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/actions/commands/replace.ts Outdated
The bounds check counted UTF-16 code units while endPos is now
codepoint-aware, so e.g. `2r` on a single emoji slipped past the guard
and inserted an extra character instead of refusing the operation. Count
codepoints in the spanned range instead, matching Vim's behavior of
refusing the replace when fewer than `count` characters remain.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@k1832
Copy link
Copy Markdown
Contributor Author

k1832 commented May 30, 2026

Thanks for the detailed report and the screen recording, @NanoUser3000! That p behavior is a separate issue from this PR's scope — this PR only covers the character-level commands a, r, s, ~, and ga. The paste positioning you're seeing (text landing before/after the emoji depending on the cursor's internal column state) shares the same surrogate-pair root cause but lives in different code paths (put.ts / cursor column tracking). I'd rather not expand this PR's surface, so I'll open a follow-up issue to track p/xp on surrogate pairs so it doesn't get lost here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Surrogate pair characters (emoji, rare CJK, etc.) break a, r, s, ~, ga commands

4 participants