Skip to content

Conversation

@overlookmotel
Copy link
Member

@overlookmotel overlookmotel commented Aug 18, 2025

Follow-on after #13169.

Implement the first optimization mentioned in #13169 (comment). Iterate over string byte-by-byte rather than char-by-char.

It's amazing how bad Rust is at string operations. I tried it without unsafe code at first, but Rust inserts checks for whether a slice falls on a UTF-8 char boundary on every single operation, even though it's obvious from the context that these checks can never fail. It made the assembly x4 longer, which is no good as this is meant to be a tight loop.

@github-actions github-actions bot added A-codegen Area - Code Generation C-performance Category - Solution not expected to change functional behavior, only performance labels Aug 18, 2025
Copy link
Member Author

overlookmotel commented Aug 18, 2025


How to use the Graphite Merge Queue

Add either label to this PR to merge it via the merge queue:

  • 0-merge - adds this PR to the back of the merge queue
  • hotfix - for urgent hot fixes, skip the queue and merge this PR next

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@overlookmotel overlookmotel marked this pull request as ready for review August 18, 2025 14:10
Copilot AI review requested due to automatic review settings August 18, 2025 14:11
@codspeed-hq
Copy link

codspeed-hq bot commented Aug 18, 2025

CodSpeed Instrumentation Performance Report

Merging #13190 will not alter performance

Comparing 08-18-perf_codegen_faster_splitting_comments_into_lines (e3bfff1) with main (ada4e84)1

Summary

✅ 34 untouched benchmarks

Footnotes

  1. No successful run was found on main (e3bfff1) during the generation of this report, so ada4e84 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@overlookmotel overlookmotel force-pushed the 08-18-perf_codegen_faster_splitting_comments_into_lines branch from 6e4329a to e1bf875 Compare August 18, 2025 14:17
@overlookmotel overlookmotel marked this pull request as draft August 18, 2025 14:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the performance of splitting comments into lines by iterating over UTF-8 bytes instead of Unicode characters. The changes implement a byte-based approach to identify line terminators while properly handling CRLF sequences and Unicode line separators (LS and PS).

  • Replaced character-by-character iteration with byte-by-byte processing for better performance
  • Added support for Unicode line separators (LS and PS) in addition to CR/LF
  • Removed the position field from the iterator struct in favor of modifying the text slice directly

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@overlookmotel overlookmotel force-pushed the 08-18-perf_codegen_faster_splitting_comments_into_lines branch from e1bf875 to bffe03c Compare August 18, 2025 14:49
@overlookmotel overlookmotel marked this pull request as ready for review August 18, 2025 14:53
@graphite-app graphite-app bot added the 0-merge Merge with Graphite Merge Queue label Aug 19, 2025
@graphite-app
Copy link
Contributor

graphite-app bot commented Aug 19, 2025

Merge activity

Follow-on after #13169.

Implement the first optimization mentioned in #13169 (comment). Iterate over string byte-by-byte rather than char-by-char.

It's amazing how bad Rust is at string operations. I tried it without unsafe code at first, but Rust inserts checks for whether a slice falls on a UTF-8 char boundary on every single operation, even though it's obvious from the context that these checks can never fail. It made the assembly x4 longer, which is no good as this is meant to be a tight loop.
@graphite-app graphite-app bot force-pushed the 08-18-perf_codegen_faster_splitting_comments_into_lines branch from bffe03c to e3bfff1 Compare August 19, 2025 00:33
@graphite-app graphite-app bot merged commit e3bfff1 into main Aug 19, 2025
24 checks passed
@graphite-app graphite-app bot deleted the 08-18-perf_codegen_faster_splitting_comments_into_lines branch August 19, 2025 00:38
@graphite-app graphite-app bot removed the 0-merge Merge with Graphite Merge Queue label Aug 19, 2025
@hyrious
Copy link

hyrious commented Aug 20, 2025

@overlookmotel
Copy link
Member Author

Unfortunately not. From the docs:

Note that any carriage return (\r) not immediately followed by a line feed (\n) does not split a line. These carriage returns are thereby included in the produced lines.

We need to split on \r, \n, \r\n, and also irregular Unicode line breaks <LS> and <PS>.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-codegen Area - Code Generation C-performance Category - Solution not expected to change functional behavior, only performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants