Skip to content

Conversation

@h-a-n-a
Copy link

@h-a-n-a h-a-n-a commented Aug 20, 2025

Fixes a simple use case in optimizing lone surrogates. Also initialized and pinpointed some places that need a fix in future PRs.

With input:

'\uD83D' + '\uDD25'
`\uD83D` + `\uDD25`

Previously outputs:

"�d838�dd25"
"�d838�dd25"

Now outputs:

"\ud83d\udd25"
"\ud83d\udd25"

Ported some optimization fixes I made in swc-project/swc#10987 and I'm really appreciate the job you all have done. Without the previous investigation made by @overlookmotel, @Boshen and Oxc team, this will not happen.

@graphite-app
Copy link
Contributor

graphite-app bot commented Aug 20, 2025

How to use the Graphite Merge Queue

Add either label to this PR to merge it via the merge queue:

  • 0-merge - adds this PR to the back of the merge queue
  • hotfix - for urgent hot fixes, skip the queue and merge this PR next

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

@github-actions github-actions bot added A-minifier Area - Minifier A-transformer Area - Transformer / Transpiler C-bug Category - Bug labels Aug 20, 2025
@Boshen
Copy link
Member

Boshen commented Aug 20, 2025

This is a lot of code to support lone surrogates, I'll take a deeper look later.

@Boshen Boshen self-assigned this Aug 20, 2025
@codspeed-hq
Copy link

codspeed-hq bot commented Aug 20, 2025

CodSpeed Instrumentation Performance Report

Merging #13229 will not alter performance

Comparing h-a-n-a:lone-surrogates-optimization (f1b2f8c) with main (d27a04b)1

Summary

✅ 34 untouched benchmarks

Footnotes

  1. No successful run was found on main (e84ae8b) during the generation of this report, so d27a04b was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@h-a-n-a
Copy link
Author

h-a-n-a commented Aug 20, 2025

This is a lot of code to support lone surrogates, I'll take a deeper look later.

@Boshen Thanks! That would be great and I'd love to fix all the TODOs I listed in the PR, probably in the future PRs. If you need any help, just let me know ;-)

@h-a-n-a h-a-n-a marked this pull request as ready for review August 20, 2025 11:19
Copilot AI review requested due to automatic review settings August 20, 2025 11:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Fixes string concatenation optimization to properly handle lone surrogates in Unicode string literals. The PR modifies the ToJsString trait to return a tuple containing both the string value and a boolean flag indicating the presence of lone surrogates, preventing unsafe optimizations that could produce incorrect output.

Key changes:

  • Modified ToJsString trait to return (Cow<'a, str>, bool) instead of just Cow<'a, str>
  • Added checks throughout the codebase to avoid optimizations when lone surrogates are present
  • Added test cases to verify correct handling of lone surrogate concatenation

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
crates/oxc_ecmascript/src/to_string.rs Core trait modification to track lone surrogates in string conversion
crates/oxc_minifier/src/peephole/fold_constants.rs Updated string concatenation logic to handle lone surrogates
crates/oxc_minifier/src/peephole/substitute_alternate_syntax.rs Added lone surrogate checks in template literal substitution
crates/oxc_minifier/tests/peephole/obscure_edge_cases.rs Added test cases for lone surrogate string concatenation
crates/oxc_minifier/tests/peephole/mod.rs Added new test helper function for output verification
crates/oxc_ecmascript/src/constant_evaluation/value.rs Updated ConstantValue::String to include lone surrogate flag
crates/oxc_ecmascript/src/constant_evaluation/mod.rs Updated constant evaluation to propagate lone surrogate information
crates/oxc_ecmascript/src/constant_evaluation/call_expr.rs Added lone surrogate handling in string method optimizations
crates/oxc_minifier/src/ctx.rs Updated expression creation to handle lone surrogates
crates/oxc_minifier/src/peephole/inline.rs Updated inline optimization check for string length

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

strings
.map(|v| v.iter().map(AsRef::as_ref).collect::<Vec<_>>().join(separator.unwrap_or(",")))
// If any element contains a lone surrogate, we cannot join them as strings.
if strings.iter().any(|s| s.iter().any(|s| s.1)) {
Copy link

Copilot AI Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The nested iterator chain is hard to read. Consider flattening this logic or adding a comment to explain what s.1 represents (the lone_surrogates flag).

Suggested change
if strings.iter().any(|s| s.iter().any(|s| s.1)) {
// Here, s.1 is the lone_surrogates flag returned by to_js_string.
if strings.iter().flatten().any(|(_, lone_surrogates)| *lone_surrogates) {

Copilot uses AI. Check for mistakes.
Comment on lines +153 to +156
if let Some((_, lone_surrogates)) = search_value
&& lone_surrogates
{
return None;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern matching syntax here is incorrect. Rust doesn't support combining a pattern match with a boolean condition in this way. This should be rewritten as:

if let Some((_, lone_surrogates)) = search_value {
    if lone_surrogates {
        return None;
    }
}

This properly separates the pattern matching from the conditional check on the extracted value.

Suggested change
if let Some((_, lone_surrogates)) = search_value
&& lone_surrogates
{
return None;
if let Some((_, lone_surrogates)) = search_value
{
if lone_surrogates
{
return None;
}
}

Spotted by Diamond

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

}

#[track_caller]
fn test_output(source_text: &str, expected: &str) {
Copy link
Author

@h-a-n-a h-a-n-a Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a standalone test runner.

The rationale behind this is that the original test runner would generate actual: "\ud83d\udd25" and expected: "🔥" if I write

test("return '\\uD83D' + '\\uDD25'", "return '\\ud83d\\udd25'")

Although the execution result in JavaScript runtime is the same, it's not the same as it compares with each other in Rust. The expected half will be interpreted as a surrogate pair in Rust and the actual half will be an escaped surrogate pair.

@h-a-n-a
Copy link
Author

h-a-n-a commented Nov 10, 2025

FYI: we've replaced our implementation here in swc with Wtf8Buf.

See:

  1. feat(hstr): Introduce Wtf8Atom swc-project/swc#11104
  2. fix(es/ast): Fix unicode unpaired surrogates handling swc-project/swc#11144

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-minifier Area - Minifier A-transformer Area - Transformer / Transpiler C-bug Category - Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants