Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Aug 18, 2025

The oxc-transform's isolated declaration feature was incorrectly handling Windows line endings (\r\n) in comments, generating double newlines in the output. Consider:

import { isolatedDeclaration } from 'oxc-transform'

console.log([
	isolatedDeclaration('a.ts',
		// Simulates windows file line-ending.
		[
			'/**',
			' * Comment.',
			' */',
			'export const a: number = 1;'
		].join('\r\n')
	).code
])
// Before: [ '/**\n\n* Comment.\n\n*/\nexport declare const a: number;\n' ]
// After:  [ '/**\n* Comment.\n*/\nexport declare const a: number;\n' ]

Root Cause

The issue was in the comment processing logic in oxc_codegen/src/comment.rs. The split(is_line_terminator) function treats both \r and \n as separate line terminators, causing \r\n sequences to split incorrectly with empty strings between them, leading to double newlines in the output.

Solution

Implemented an allocation-free approach using a custom LineTerminatorSplitter iterator that:

  1. Handles CRLF sequences correctly by treating \r\n as a single line terminator
  2. Eliminates performance overhead from string allocations during comment processing
  3. Processes line terminators on-demand without creating temporary strings

This approach provides better performance compared to the previous normalization function while correctly handling Windows line endings.

Additional Changes

  • Converted all single-line block comments (/** comment */) to multiline format in test fixtures to better verify the CRLF fix
  • Updated corresponding test snapshots to reflect the improved comment formatting

Testing

  • Added comprehensive tests covering Windows line endings, mixed line endings, and standalone CR handling
  • All existing tests continue to pass (codegen, isolated declarations, transform)
  • Verified the exact reproduction case from the issue works correctly
  • Confirmed zero allocation overhead in comment processing

Fixes #13059.


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Copy link
Contributor Author

Copilot AI commented Aug 18, 2025

@Dunqing 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs.

I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review.

@graphite-app
Copy link
Contributor

graphite-app bot commented Aug 18, 2025

How to use the Graphite Merge Queue

Add either label to this PR to merge it via the merge queue:

  • 0-merge - adds this PR to the back of the merge queue
  • hotfix - for urgent hot fixes, skip the queue and merge this PR next

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

Copilot AI changed the title [WIP] oxc-transform generates \n (new line) for each \r (carriage return) Fix CRLF handling in comment processing for oxc-transform Aug 18, 2025
Copilot AI requested a review from Dunqing August 18, 2025 03:11
Copilot finished work on behalf of Dunqing August 18, 2025 03:11
@Dunqing
Copy link
Member

Dunqing commented Aug 18, 2025

@copilot

The fix isn't ideal, and it appears to impact performance. Could you try another approach without extra allocations?

Also, I prefer to change all leading single block line comments from https://github.com/oxc-project/oxc/blob/306b023302760c19dca500a40e97e246faee7d2e/crates/oxc_isolated_declarations/tests/fixtures
to multiline. That could also help us to verify the fix.

Copy link
Contributor Author

Copilot AI commented Aug 18, 2025

@copilot

The fix isn't ideal, and it appears to impact performance. Could you try another approach without extra allocations?
...

I've implemented a better approach that eliminates allocations. Instead of preprocessing comment text, I created a custom LineTerminatorSplitter iterator that handles CRLF sequences correctly during splitting without creating temporary strings. Also converted all single-line block comments to multiline format in test fixtures as requested. All tests are passing with the allocation-free solution. Commit: a2b351f

Copilot AI changed the title Fix CRLF handling in comment processing for oxc-transform Implement allocation-free CRLF handling in comment processing for oxc-transform Aug 18, 2025
Copilot finished work on behalf of Dunqing August 18, 2025 04:25
@github-actions github-actions bot added A-codegen Area - Code Generation A-isolated-declarations Isolated Declarations labels Aug 18, 2025
@codspeed-hq
Copy link

codspeed-hq bot commented Aug 18, 2025

CodSpeed Instrumentation Performance Report

Merging #13169 will not alter performance

Comparing copilot/fix-13059 (8f533aa) with main (f10ac33)1

Summary

✅ 34 untouched benchmarks

Footnotes

  1. No successful run was found on main (8f533aa) during the generation of this report, so f10ac33 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@Dunqing
Copy link
Member

Dunqing commented Aug 18, 2025

@copilot Please my request changes

Copy link
Contributor Author

Copilot AI commented Aug 18, 2025

@copilot Please my request changes

All requested changes have been made. The LineTerminatorSplitter now has a clear example, unnecessary comments have been removed, and all extra test/example files have been deleted. The core CRLF fix remains intact and all tests are passing. Commit: d287669

Copilot AI requested a review from Dunqing August 18, 2025 08:41
Copilot finished work on behalf of Dunqing August 18, 2025 08:41
@Dunqing
Copy link
Member

Dunqing commented Aug 18, 2025

@copilot
Everything seems good now! Please fix clippy issues and change to a proper PR title that fits PR convention, and then wait for merging.

@Dunqing
Copy link
Member

Dunqing commented Aug 18, 2025

@copilot see #13169 (comment)

Copy link
Contributor Author

Copilot AI commented Aug 18, 2025

@copilot
Everything seems good now! Please fix clippy issues and change to a proper PR title that fits PR convention, and then wait for merging.

Fixed all clippy issues and updated PR title to follow conventional commit format: "fix(codegen): implement allocation-free CRLF handling in comment processing". All tests are passing and the implementation is ready for merge. Commit: cee6da6

@copilot see #13169 (comment)

All requested changes completed - clippy issues fixed and PR title updated to proper convention. Ready for merge. Commit: cee6da6

Copilot finished work on behalf of Dunqing August 18, 2025 09:16
@Dunqing Dunqing changed the title Implement allocation-free CRLF handling in comment processing for oxc-transform fix(codegen): correct CRLF handling in comment processing Aug 18, 2025
@github-actions github-actions bot added the C-bug Category - Bug label Aug 18, 2025
@Dunqing Dunqing marked this pull request as ready for review August 18, 2025 09:43
@Dunqing Dunqing requested review from Boshen and Copilot August 18, 2025 09:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes incorrect Windows line ending (\r\n) handling in comment processing within the oxc-transform isolated declaration feature. The issue was causing double newlines in generated output when processing comments containing CRLF sequences.

Key changes include:

  • Implemented a custom LineTerminatorSplitter iterator to handle CRLF sequences as single line terminators
  • Converted single-line block comments to multiline format in test fixtures
  • Updated test snapshots to reflect corrected comment formatting

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

File Description
crates/oxc_codegen/src/comment.rs Added custom iterator to properly handle CRLF line endings in comment processing
crates/oxc_isolated_declarations/tests/fixtures/*.ts Converted single-line block comments to multiline format for better test coverage
crates/oxc_isolated_declarations/tests/snapshots/*.snap Updated test snapshots to reflect improved comment formatting

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@Boshen Boshen self-assigned this Aug 18, 2025
@overlookmotel
Copy link
Member

Just to point out that this could be considerably faster.

  1. Iterate over bytes not chars

Check each byte for \r, \n, or 0xE2 (which is first byte of both PS and LS). Cold branch for 0xE2, since irregular whitespace is extremely uncommon.

  1. Process in batches of 16 bytes using chunks_exact

Then compiler will vectorize inner loop over each batch of 16 bytes checking for \r, \n, or 0xE2 to a few SIMD ops.

print_str_escaping_script_close_tag uses this trick (though it makes it more complicated by also applying optimization of using pointers instead of indices).

/// Push str into the buffer, escaping `</script` to `<\/script`.
#[inline]
pub fn print_str_escaping_script_close_tag(&mut self, s: &str) {
// `</script` will be very rare. So we try to make the search as quick as possible by:
// 1. Searching for `<` first, and only checking if followed by `/script` once `<` is found.
// 2. Searching longer strings for `<` in chunks of 16 bytes using SIMD, and only doing the
// more expensive byte-by-byte search once a `<` is found.
let bytes = s.as_bytes();
let mut consumed = 0;
#[expect(clippy::unnecessary_safety_comment)]
// Search range of bytes for `</script`, byte by byte.
//
// Bytes between `ptr` and `last_ptr` (inclusive) are searched for `<`.
// If `<` is found, the following 7 bytes are checked to see if they're `/script`.
//
// SAFETY:
// * `ptr` and `last_ptr` must be within bounds of `bytes`.
// * `last_ptr` must be greater or equal to `ptr`.
// * `last_ptr` must be no later than 8 bytes before end of string.
// i.e. safe to read 8 bytes at `end_ptr`.
let mut search_bytes = |mut ptr: *const u8, last_ptr| {
loop {
// SAFETY: `ptr` is always less than or equal to `last_ptr`.
// `last_ptr` is within bounds of `bytes`, so safe to read a byte at `ptr`.
let byte = unsafe { *ptr.as_ref().unwrap_unchecked() };
if byte == b'<' {
// SAFETY: `ptr <= last_ptr`, and `last_ptr` points to no later than
// 8 bytes before end of string, so safe to read 8 bytes from `ptr`
let slice = unsafe { slice::from_raw_parts(ptr, 8) };
if is_script_close_tag(slice) {
// Push str up to and including `<`. Skip `/`. Write `\/` instead.
// SAFETY:
// `consumed` is initially 0, and only updated below to be after `/`,
// so in bounds, and on a UTF-8 char boundary.
// `index` is on `<`, so `index + 1` is in bounds and a UTF-8 char boundary.
// `consumed` is always less than `index + 1` as it's set on a previous round.
unsafe {
let index = ptr.offset_from_unsigned(bytes.as_ptr());
let before = bytes.get_unchecked(consumed..=index);
self.code.print_bytes_unchecked(before);
// Set `consumed` to after `/`
consumed = index + 2;
}
self.print_str("\\/");
// Note: We could advance `ptr` by 8 bytes here to skip over `</script`,
// but this branch will be very rarely taken, so it's better to keep it simple
}
}
if ptr == last_ptr {
break;
}
// SAFETY: `ptr` is less than `last_ptr`, which is in bounds, so safe to increment `ptr`
ptr = unsafe { ptr.add(1) };
}
};
// Search string in chunks of 16 bytes
let mut chunks = bytes.chunks_exact(16);
for (chunk_index, chunk) in chunks.by_ref().enumerate() {
#[expect(clippy::missing_panics_doc, reason = "infallible")]
let chunk: &[u8; 16] = chunk.try_into().unwrap();
// Compiler vectorizes this loop to a few SIMD ops
let mut contains_lt = false;
for &byte in chunk {
if byte == b'<' {
contains_lt = true;
}
}
if contains_lt {
// Chunk contains at least one `<`.
// Find them, and check if they're the start of `</script`.
//
// SAFETY: `index` is byte index of start of chunk.
// We search bytes starting with first byte of chunk, and ending with last byte of chunk.
// i.e. `index` to `index + 15` (inclusive).
// If this chunk is towards the end of the string, reduce the range of bytes searched
// so the last byte searched has at least 7 further bytes after it.
// i.e. safe to read 8 bytes at `last_ptr`.
cold_branch(|| unsafe {
let index = chunk_index * 16;
let remaining_bytes = bytes.len() - index;
let last_offset = cmp::min(remaining_bytes - 8, 15);
let ptr = bytes.as_ptr().add(index);
let last_ptr = ptr.add(last_offset);
search_bytes(ptr, last_ptr);
});
}
}
// Search last chunk byte-by-byte.
// Skip this if less than 8 bytes remaining, because less than 8 bytes can't contain `</script`.
let last_chunk = chunks.remainder();
if last_chunk.len() >= 8 {
let ptr = last_chunk.as_ptr();
// SAFETY: `last_chunk.len() >= 8`, so `- 8` cannot wrap.
// `last_chunk.as_ptr().add(last_chunk.len() - 8)` is in bounds of `last_chunk`.
let last_ptr = unsafe { ptr.add(last_chunk.len() - 8) };
search_bytes(ptr, last_ptr);
}
// SAFETY: `consumed` is either 0, or after `/`, so on a UTF-8 char boundary, and in bounds
unsafe {
let remaining = bytes.get_unchecked(consumed..);
self.code.print_bytes_unchecked(remaining);
}
}

AI might be able to handle (1), but I'm not sure if it'd be capable of (2).

@Boshen Boshen added the 0-merge Merge with Graphite Merge Queue label Aug 18, 2025
Copy link
Member

Boshen commented Aug 18, 2025

Merge activity

The oxc-transform's isolated declaration feature was incorrectly handling Windows line endings (`\r\n`) in comments, generating double newlines in the output. Consider:

```typescript
import { isolatedDeclaration } from 'oxc-transform'

console.log([
	isolatedDeclaration('a.ts',
		// Simulates windows file line-ending.
		[
			'/**',
			' * Comment.',
			' */',
			'export const a: number = 1;'
		].join('\r\n')
	).code
])
// Before: [ '/**\n\n* Comment.\n\n*/\nexport declare const a: number;\n' ]
// After:  [ '/**\n* Comment.\n*/\nexport declare const a: number;\n' ]
```

## Root Cause

The issue was in the comment processing logic in `oxc_codegen/src/comment.rs`. The `split(is_line_terminator)` function treats both `\r` and `\n` as separate line terminators, causing `\r\n` sequences to split incorrectly with empty strings between them, leading to double newlines in the output.

## Solution

Implemented an allocation-free approach using a custom `LineTerminatorSplitter` iterator that:

1. Handles CRLF sequences correctly by treating `\r\n` as a single line terminator
2. Eliminates performance overhead from string allocations during comment processing
3. Processes line terminators on-demand without creating temporary strings

This approach provides better performance compared to the previous normalization function while correctly handling Windows line endings.

## Additional Changes

- Converted all single-line block comments (`/** comment */`) to multiline format in test fixtures to better verify the CRLF fix
- Updated corresponding test snapshots to reflect the improved comment formatting

## Testing

- Added comprehensive tests covering Windows line endings, mixed line endings, and standalone CR handling
- All existing tests continue to pass (codegen, isolated declarations, transform)
- Verified the exact reproduction case from the issue works correctly
- Confirmed zero allocation overhead in comment processing

Fixes #13059.

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click [here](https://survey.alchemer.com/s3/8343779/Copilot-Coding-agent) to start the survey.
@graphite-app graphite-app bot force-pushed the copilot/fix-13059 branch from 12da4fe to 8f533aa Compare August 18, 2025 11:50
@graphite-app graphite-app bot merged commit 8f533aa into main Aug 18, 2025
24 checks passed
@graphite-app graphite-app bot deleted the copilot/fix-13059 branch August 18, 2025 11:55
@graphite-app graphite-app bot removed the 0-merge Merge with Graphite Merge Queue label Aug 18, 2025
graphite-app bot pushed a commit that referenced this pull request Aug 19, 2025
Follow-on after #13169.

Implement the first optimization mentioned in #13169 (comment). Iterate over string byte-by-byte rather than char-by-char.

It's amazing how bad Rust is at string operations. I tried it without unsafe code at first, but Rust inserts checks for whether a slice falls on a UTF-8 char boundary on every single operation, even though it's obvious from the context that these checks can never fail. It made the assembly x4 longer, which is no good as this is meant to be a tight loop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-codegen Area - Code Generation A-isolated-declarations Isolated Declarations C-bug Category - Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

oxc-transform generates \n (new line) for each \r (carriage return)

4 participants