Skip to content

Conversation

@syedazeez337
Copy link

Replace hash-based std::unordered_map vocabulary storage with bitset flags for O(1) lookups, reducing compilation time from 4618ms to 2195ms (-52%) and vocabulary lookup CPU overhead from 31.3% to 6.5%
(-79%). Includes comprehensive testing across 40+ real-world schemas and full test suite validation with zero regressions.

@jviotti
Copy link
Member

jviotti commented Nov 10, 2025

Looking interesting! Once CI passes, we will send the benchmark results. Right now they are failing on formatting. Make sure to run "make" on your machine to re-format the code automatically using ClangFormat

@syedazeez337 syedazeez337 force-pushed the feature/bitset-vocabulary-optimization branch 6 times, most recently from b9d603a to b730b7f Compare November 11, 2025 09:10
Signed-off-by: Azeez Syed <syedazeez337@gmail.com>
@syedazeez337 syedazeez337 force-pushed the feature/bitset-vocabulary-optimization branch from b730b7f to 2c5d61c Compare November 11, 2025 09:45
@jviotti
Copy link
Member

jviotti commented Nov 11, 2025

Oh the repo benchmarks don't run on forks... I forgot we hit an upstream issue with this: https://github.com/sourcemeta/blaze/blob/main/.github/workflows/ci.yml#L145-L147.

Can you post the before/after (i.e. with and without this commit) for make benchmark?

@syedazeez337
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants