Convert lexer to iterate over bytes instead of chars #3291

overlookmotel · 2024-05-15T12:17:31Z

Why the lexer is slower than it could be

Chars iterator is really slow. Lexer should iterate byte-by-byte rather than char-by-char.

In almost all cases, we're only matching against ASCII characters anyway (e.g. self.next_eq('.')), so calculating the Unicode char is completely pointless, as we only care if it's an ASCII . anyway. Surprisingly, the compiler seems not to be able to see this, and doesn't optimize out the char calculation itself. It generates a lot of pointless and slow code.

How to make it faster

Source was introduced with intention to ease the transition to byte-by-byte iteration through source code.

#2352 got some heavy perf improvements from using it (along with other tricks), but I have not been able to find the time to complete the work.

The following APIs should be converted to take/return u8 and use Source::peek_byte instead of Source::peek_char:

peek
peek2
next_eq

Usages of these APIs should be refactored unless they're directly preceded by a peek:

next_char
consume_char

Suggested implementation

Source::next_byte is an unsafe function, as it can break the invariant that Source must always be positioned on a UTF8 character boundary (i.e. not in the middle of a multi-byte Unicode char). It's preferable not to use it, to avoid littering the lexer with unsafe code.

However, Source::next_char is written to be much more transparent to the compiler than Chars::next is. So the compiler is able to optimize safe code like:

const dot_was_consumed = match lexer.source.peek_byte() {
  b'.' => {
    lexer.consume_char().unwrap();
    true
  }
  _ => false,
}

down to equivalent of:

const dot_was_consumed = match lexer.source.peek_byte() {
  b'.' => {
    unsafe {
      // This is a single assembler operation
      lexer.source.set_position( lexer.source.position().add(1) );
    }
    true
  }
  _ => false,
}

(at least that's what I remember from a few months ago when I checked it with Godbolt).

(originally mentioned in #3250 (comment))

The text was updated successfully, but these errors were encountered:

Part of #3291.

Closes #3291

overlookmotel · 2024-07-30T21:56:42Z

Closed in #4298 and #4304. Performance gain was sadly not very large. But every little helps, and if you don't try, you don't know...

overlookmotel added the C-performance Category - Solution not expected to change functional behavior, only performance label May 15, 2024

overlookmotel assigned Boshen and unassigned Boshen May 15, 2024

overlookmotel added the A-parser Area - Parser label May 15, 2024

Boshen added the E-Help Wanted Experience level - For the experienced collaborators label May 17, 2024

overlookmotel mentioned this issue Jun 10, 2024

Priority performance issues oxc-project/backlog#42

Open

Boshen assigned lucab Jul 14, 2024

This was referenced Jul 16, 2024

perf(parser): optimize conditional advance on ASCII values #4298

Merged

perf(parser): support peeking over bytes #4304

Merged

Boshen pushed a commit that referenced this issue Jul 27, 2024

perf(parser): optimize conditional advance on ASCII values (#4298)

868fc87

Part of #3291.

overlookmotel pushed a commit that referenced this issue Jul 30, 2024

perf(parser): support peeking over bytes (#4304)

c9c38a1

Closes #3291

overlookmotel closed this as completed Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert lexer to iterate over bytes instead of chars #3291

Convert lexer to iterate over bytes instead of chars #3291

overlookmotel commented May 15, 2024 •

edited

Loading

overlookmotel commented Jul 30, 2024

Convert lexer to iterate over bytes instead of chars #3291

Convert lexer to iterate over bytes instead of chars #3291

Comments

overlookmotel commented May 15, 2024 • edited Loading

Why the lexer is slower than it could be

How to make it faster

Suggested implementation

overlookmotel commented Jul 30, 2024

overlookmotel commented May 15, 2024 •

edited

Loading