Skip to content

Complete rewrite of tokenization/parsing #72

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 52 commits into from
Aug 6, 2022
Merged

Complete rewrite of tokenization/parsing #72

merged 52 commits into from
Aug 6, 2022

Conversation

Turnerj
Copy link
Member

@Turnerj Turnerj commented Apr 16, 2022

This is a massively breaking change internally and to anything that depended on the internal tokenization/validation system. It is changed in many fundamental ways.

The general public interface is about the same as it was with some more minor binary-breaking changes to methods and constructors.

Closes #42, closes #71 and allows access to the data in #13

Before any changes

Method Mean Error StdDev Op/s Gen 0 Gen 1 Gen 2 Allocated
FromStream 833.1 us 14.59 us 13.65 us 1,200.3 182.6172 - - 561 KB
FromString 779.2 us 15.37 us 21.04 us 1,283.3 182.6172 - - 562 KB

After the initial tokenization replacement

Method Mean Error StdDev Op/s Gen 0 Gen 1 Gen 2 Allocated
FromStream 178.1 us 3.33 us 3.11 us 5,614.1 22.4609 - - 69 KB
FromString 170.9 us 1.40 us 1.17 us 5,850.9 11.4746 - - 35 KB

After FromStream actually processing it via a stream

Method Mean Error StdDev Op/s Gen 0 Gen 1 Gen 2 Allocated
FromStream 262.0 us 4.68 us 4.15 us 3,817.3 11.7188 - - 36 KB
FromString 174.2 us 3.28 us 3.07 us 5,739.6 11.4746 - - 35 KB

After FromString sharing the same code path as FromStream

Method Mean Error StdDev Op/s Gen 0 Gen 1 Gen 2 Allocated
FromStream 268.0 us 4.58 us 4.29 us 3,731.8 11.7188 - - 36 KB
FromString 191.6 us 3.74 us 4.16 us 5,218.7 11.4746 - - 35 KB

Final Benchmark Results for RobotsFileParser

Method Mean Error StdDev Op/s Gen 0 Gen 1 Gen 2 Allocated
FromStream 122.58 us 2.034 us 2.498 us 8,158.1 11.7188 - - 36 KB
FromString 61.23 us 0.775 us 0.725 us 16,331.7 11.4746 - - 35 KB

FromStream: 95% faster, 94% fewer allocations
FromString: 92% faster, 94% fewer allocations


Before any changes

Method Mean Error StdDev Op/s Gen 0 Gen 1 Gen 2 Allocated
FromRules 51.36 us 0.713 us 0.632 us 19,468.8 13.4888 - - 41 KB

Initial LINQ-heavy rewrite

Method Mean Error StdDev Op/s Gen 0 Gen 1 Gen 2 Allocated
FromRules 11.00 us 0.070 us 0.062 us 90,949.7 3.5248 - - 11 KB

Post-LINQ optimizations and performance improvements

Method Mean Error StdDev Op/s Gen 0 Gen 1 Gen 2 Allocated
FromRules 4.168 us 0.0825 us 0.0883 us 239,922.8 0.8011 - - 2 KB

FromRules: 92% faster, 95% fewer allocations

@Turnerj Turnerj added the enhancement New feature or request label Apr 16, 2022
Turnerj added 12 commits July 25, 2022 21:24
This is a massively breaking change internally and to anything that depended on the internal tokenization/validation system. It is changed in many fundamental ways.

The general public interface is about the same as it was with some more minor binary-breaking changes to methods and constructors.
This makes some of the helper methods in NoRobotsRfcHelper redundant but also simplifies handling of tokens in any code that uses RobotsFileTokenReader.
We can do this because there is no official spec here. We're not typically expecting there to be new lines in the data we received though.
@Turnerj Turnerj force-pushed the replace-tokenizer branch from cf7557f to 84a8d97 Compare July 25, 2022 11:54
@Turnerj
Copy link
Member Author

Turnerj commented Jul 30, 2022

Performance improvements for PathComparisonUtility tested with .NET 6.

Before

Method Mean Error StdDev Op/s Gen 0 Gen 1 Gen 2 Allocated
Simple 265.1 ns 5.03 ns 7.83 ns 3,772,531.2 0.0305 - - 96 B
Wildcard 341.4 ns 2.78 ns 2.60 ns 2,929,084.7 0.0558 - - 176 B
MustEndWith 353.1 ns 4.08 ns 3.82 ns 2,831,954.1 0.0458 - - 144 B

After

Method Mean Error StdDev Op/s Gen 0 Gen 1 Gen 2 Allocated
Simple 85.06 ns 1.528 ns 1.429 ns 11,756,811.4 - - - -
Wildcard 128.39 ns 1.107 ns 0.924 ns 7,788,603.9 - - - -
MustEndWith 131.07 ns 1.487 ns 1.318 ns 7,629,488.6 - - - -

Results
Simple: 3.1x faster
Wildcard: 2.65x faster
MustEndWith: 2.69x faster

All with zero allocations!

@Turnerj
Copy link
Member Author

Turnerj commented Aug 1, 2022

Current progress has a little bit of a regression - guessing it still is about that hand off to ProcessLine etc

Method Mean Error StdDev Op/s Gen 0 Gen 1 Gen 2 Allocated
FromStream 285.5 us 4.46 us 3.96 us 3,502.3 11.7188 - - 36 KB
FromString 203.1 us 2.17 us 1.81 us 4,924.2 11.4746 - - 35 KB

@Turnerj
Copy link
Member Author

Turnerj commented Aug 3, 2022

After a bunch of back and forth, I realised rather than processing the characters directly in a span, I could process the bytes directly to avoid a bunch of overhead from going back and forth.

Method Mean Error StdDev Op/s Gen 0 Gen 1 Gen 2 Allocated
FromStream 199.3 us 3.93 us 6.00 us 5,017.7 11.7188 - - 36 KB
FromString 130.2 us 1.44 us 1.20 us 7,680.2 11.2305 - - 35 KB

This avoids a bunch of UTF-8 conversion interop
@Turnerj
Copy link
Member Author

Turnerj commented Aug 4, 2022

Items left to do:

  • Fix CI builds (will need a preview version of the SDK due to the preview version of C# being used)
  • Go through the RobotsPageParser for optimizations and benchmarking

@Turnerj
Copy link
Member Author

Turnerj commented Aug 5, 2022

Initial benchmark for RobotsPageParser. The file isn't anywhere near as big as the file RobotsFileParser needs to process but we really don't expect many rules for individual pages anyway.

Method Mean Error StdDev Op/s Gen 0 Gen 1 Gen 2 Allocated
FromRules 11.00 us 0.070 us 0.062 us 90,949.7 3.5248 - - 11 KB

@Turnerj
Copy link
Member Author

Turnerj commented Aug 6, 2022

More efficient processing of directives by using a mutable structure and dropping all of the LINQ processing.

Method Mean Error StdDev Op/s Gen 0 Gen 1 Gen 2 Allocated
FromRules 4.556 us 0.0497 us 0.0440 us 219,504.6 0.8011 - - 2 KB

FromRules is now ~59% faster and allocate ~82% less

@Turnerj Turnerj merged commit afcf3cc into main Aug 6, 2022
@Turnerj Turnerj deleted the replace-tokenizer branch August 6, 2022 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disallow: not working without a space Enhancing Tokenization and Parsing
1 participant