Add isascii to normalize_identifier for 25% reduction in parse time #228

ndinsmore · 2023-03-23T16:02:47Z

I was trying to use JuliaSyntax is some string benchmarking and noticed a lot of calls to utf8_normalization.

In master isascii is now blazing fast, so I thought that checking if a string was ascii before any UTF8 bs would speed things up. Julia code is mostly ascii

I benchmarked by parsing base/abstractarray.jl which is the largest file in the codebase.

Before pr:

julia> @benchmark t= JuliaSyntax.parseall(JuliaSyntax.SyntaxNode,$f)
BenchmarkTools.Trial: 291 samples with 1 evaluation.
 Range (min … max):  15.019 ms … 25.913 ms  ┊ GC (min … max): 0.00% … 36.31%
 Time  (median):     16.389 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.184 ms ±  1.963 ms  ┊ GC (mean ± σ):  4.12% ±  7.50%

       ▂▂▂█▂
  ▃▅▆▇▇█████▇▆▅█▄▃▄▁▄▃▄▁▁▂▃▃▁▁▃▁▆▃▃▃▄▁▄▂▃▃▂▃▃▁▁▃▃▁▂▁▁▂▁▁▁▂▁▁▂ ▃
  15 ms           Histogram: frequency by time        23.9 ms <

 Memory estimate: 7.97 MiB, allocs estimate: 171744.

After PR

julia> @benchmark t= JuliaSyntax.parseall(JuliaSyntax.SyntaxNode,$f)
BenchmarkTools.Trial: 373 samples with 1 evaluation.
 Range (min … max):  11.547 ms … 32.907 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.722 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.422 ms ±  2.064 ms  ┊ GC (mean ± σ):  3.20% ± 6.65%

          ▇█▃
  ▂▂▂▃▅█▇████▆▄▄▄▃▃▄▃▃▂▂▂▃▂▃▃▃▂▄▃▄▂▂▃▂▃▂▃▃▁▁▁▂▂▂▂▁▁▁▁▁▁▁▁▂▁▂▂ ▃
  11.5 ms         Histogram: frequency by time        18.6 ms <

 Memory estimate: 6.48 MiB, allocs estimate: 97785.

c42f · 2023-03-23T19:43:02Z

Nice!! I assume you refer to JuliaLang/julia#48568

Does this also help with the older isascii in Julia 1.9?

ndinsmore · 2023-03-24T11:44:59Z

The old isascii could only do about 2 bytes a cycle in <1.10, The new version will do 20+bytes/cycle. The UTF8 code was slow so I am assuming the old version should still offer a good benefit.

KristofferC · 2023-03-24T11:52:02Z

The old isascii could only do about 2 bytes a cycle in <1.10, The new version will do 20+bytes/cycle.

Most identifiers are quite short though, and the chunk size in that PR is 1024, so I am not sure you will get out the full speed from it on your average Julia identifier. Would be interesting to benchmark.

ndinsmore · 2023-03-24T20:12:34Z

I had meant to write on larger strings, none the less it does still speed up things by ~25%. We could likely craft a isascii that is tailored for smaller strings.

I should note that the benchmarking was also done with JuliaLang/julia#48887 applied. But that shouldn't effect the back to back.

The thing is that when I profiled this utf8proc_decompose was taking up the majority of the TreeNode processing time, now it doesn't even show up in the profile.

KristofferC · 2023-03-24T20:14:24Z

I totally agree that this thing is the right thing to do. I am just not sure the performance win is due to the improvements of isascii on master. There might be the same win in 1.9 for example.

c42f · 2023-03-25T00:51:19Z

I agree this probably works and is beneficial on 1.9. I was just hoping you'd go ahead and do the benchmark for me ;-)

I ask because JuliaSyntax isn't just for Julia 1.10+. It's tested and works all the way back to julia 1.0 as needed to support use in the VSCode tooling.

Anyway, I checked and it's also a big improvement on 1.9:

julia> f = read(joinpath(Sys.BINDIR, Base.DATAROOTDIR, "julia", "base", "abstractarray.jl"), String);

# before
julia> @benchmark JuliaSyntax.parseall(JuliaSyntax.SyntaxNode,$f)
BenchmarkTools.Trial: 360 samples with 1 evaluation.
 Range (min … max):  12.632 ms … 22.863 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.998 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.906 ms ±  1.894 ms  ┊ GC (mean ± σ):  5.84% ± 9.83%

  ▅██▇▅▂   ▂                                                   
  ██████▇█▇███▄▄▁▁▁▄▁▁▁▁▁▁▄▁▁▁▁▁▄▇▇▇▇▇▇▇█▆▆▄▆▇█▆▁▆█▁▁▆▆▄▄▁▁▁▄ ▇
  12.6 ms      Histogram: log(frequency) by time      19.6 ms <

 Memory estimate: 8.03 MiB, allocs estimate: 172102.

# after
julia> @benchmark JuliaSyntax.parseall(JuliaSyntax.SyntaxNode,$f)
BenchmarkTools.Trial: 496 samples with 1 evaluation.
 Range (min … max):   9.157 ms … 15.936 ms  ┊ GC (min … max): 0.00% … 35.99%
 Time  (median):      9.444 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.075 ms ±  1.461 ms  ┊ GC (mean ± σ):  5.73% ± 10.30%

  ▄▇█▇▅▃▁                                                      
  ███████▇▇█▅▇▅▅▇▅▄▁▁▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▆█▅▇██▆▇▇▁▇▆▁▇▇▆▄▆▄▅▄ ▇
  9.16 ms      Histogram: log(frequency) by time      14.4 ms <

 Memory estimate: 6.53 MiB, allocs estimate: 98538.

codecov · 2023-03-25T00:56:53Z

Codecov Report

Merging #228 (320738a) into main (56c33b2) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #228   +/-   ##
=======================================
  Coverage   96.31%   96.31%           
=======================================
  Files          15       15           
  Lines        3913     3913           
=======================================
  Hits         3769     3769           
  Misses        144      144

Impacted Files	Coverage Δ
src/literal_parsing.jl	`97.87% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

PallHaraldsson · 2023-04-10T14:30:32Z

Do you know if the same improvement (automatically) applies in the default Julia parser? Can (and should) it otherwise be added there easily, until yours becomes the default, and maybe even be backported?

c42f · 2023-04-11T03:11:02Z

Julia's reference parser already uses an equivalent optimization internally (see the C code in - fl_accum_julia_symbol)

Add isascii to normalize_identifier

0d4fdb6

minor stylistic clarification

320738a

c42f merged commit 6b947db into JuliaLang:main Mar 25, 2023

PallHaraldsson mentioned this pull request Apr 10, 2023

New Julia benchmark nordlow/compiler-benchmark#28

Open

KristofferC mentioned this pull request Apr 26, 2023

Make a global dictionary const #255

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add isascii to normalize_identifier for 25% reduction in parse time #228

Add isascii to normalize_identifier for 25% reduction in parse time #228

Uh oh!

ndinsmore commented Mar 23, 2023

Uh oh!

c42f commented Mar 23, 2023

Uh oh!

ndinsmore commented Mar 24, 2023

Uh oh!

KristofferC commented Mar 24, 2023 •

edited

Loading

Uh oh!

ndinsmore commented Mar 24, 2023

Uh oh!

KristofferC commented Mar 24, 2023 •

edited

Loading

Uh oh!

c42f commented Mar 25, 2023

Uh oh!

codecov bot commented Mar 25, 2023 •

edited

Loading

Uh oh!

PallHaraldsson commented Apr 10, 2023 •

edited

Loading

Uh oh!

c42f commented Apr 11, 2023

Uh oh!

Uh oh!

Uh oh!

Add isascii to normalize_identifier for 25% reduction in parse time #228

Add isascii to normalize_identifier for 25% reduction in parse time #228

Uh oh!

Conversation

ndinsmore commented Mar 23, 2023

Uh oh!

c42f commented Mar 23, 2023

Uh oh!

ndinsmore commented Mar 24, 2023

Uh oh!

KristofferC commented Mar 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ndinsmore commented Mar 24, 2023

Uh oh!

KristofferC commented Mar 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

c42f commented Mar 25, 2023

Uh oh!

codecov bot commented Mar 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PallHaraldsson commented Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

c42f commented Apr 11, 2023

Uh oh!

Uh oh!

KristofferC commented Mar 24, 2023 •

edited

Loading

KristofferC commented Mar 24, 2023 •

edited

Loading

codecov bot commented Mar 25, 2023 •

edited

Loading

PallHaraldsson commented Apr 10, 2023 •

edited

Loading