You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(es/codegen): Encode non-ASCII chars in regex with ascii_only option (#11155)
## Summary
Fixes#11146
This PR fixes a bug where the SWC minifier does not encode non-ASCII
characters in regular expressions when the `ascii_only` option is
enabled.
Previously, when `ascii_only: true` was set, non-ASCII characters in
strings were correctly encoded to Unicode escape sequences, but regex
patterns were left unchanged. This PR ensures regex patterns receive the
same treatment.
## Changes
### 1. Added `encode_regex_for_ascii` function
A new helper function in `crates/swc_ecma_codegen/src/lit.rs` that:
- Encodes non-ASCII characters in regex patterns to Unicode escape
sequences
- Uses `\xHH` format for characters in range `\x7f` to `\xff`
- Uses `\uHHHH` format for characters above `\xff`
- Encodes characters beyond BMP (U+FFFF) as surrogate pairs for
compatibility
- Preserves ASCII characters as-is for optimal performance
- Returns borrowed string when `ascii_only: false` or pattern is pure
ASCII
### 2. Updated `Lit::Regex` emission logic
Modified the regex literal emission in `lit.rs:32-39` to:
- Check if `ascii_only` is enabled via `emitter.cfg.ascii_only`
- Apply the encoding function to the regex expression before writing
- Maintain the same behavior as string literal encoding
### 3. Added comprehensive unit tests
Five new tests in `crates/swc_ecma_codegen/src/tests.rs`:
- `ascii_only_regex_1`: Verifies non-ASCII chars preserved when
`ascii_only: false`
- `ascii_only_regex_2`: Verifies encoding with specific example from
issue #11146
- `ascii_only_regex_3`: Tests emoji preservation when `ascii_only:
false`
- `ascii_only_regex_4`: Tests emoji encoding when `ascii_only: true`
- `ascii_only_regex_5`: Ensures pure ASCII regex unchanged with
`ascii_only: true`
## Example
**Input:**
```javascript
/[\w@Ø-ÞÀ-Öß-öø-ÿ]/
```
**Output with `ascii_only: false`:**
```javascript
/[\w@Ø-ÞÀ-Öß-öø-ÿ]/
```
**Output with `ascii_only: true`:**
```javascript
/[\w@\xd8-\xde\xc0-\xd6\xdf-\xf6\xf8-\xff]/
```
## Test plan
- [x] All new unit tests pass (`cargo test -p swc_ecma_codegen --lib --
ascii_only_regex`)
- [x] Code formatted with `cargo fmt --all`
- [x] Implementation follows existing patterns from `get_quoted_utf16`
function
- [x] Adheres to CLAUDE.md requirements (performance-focused,
documented, English comments)
## Related
- Issue: #11146
- Similar behavior to how Terser handles `ascii_only` option
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Donny/강동윤 <kdy.1997.dev@gmail.com>
0 commit comments