Skip to content

Reduce size of Bytes #20

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Reduce size of Bytes #20

wants to merge 1 commit into from

Conversation

emilyyyylime
Copy link

I was randomly looking at the source of this package for inspiration on an unrelated project which required a compressed trie datastructure and I soon noticed that the Bytes struct, whose size constitute a bottleneck for the maximum size of words which can be hyphenated without allocation, had a significant amount of extraneous storage.

The array::IntoIter type stores two usize indices to be able to implement DoubleSidedIterator, and we stored an extra usize just for the length of the iterator, which can already be statically computed from the iterator itself.
Switching to storing the array inline and reimplementing the iterator methods lowered this overhead by 16 bytes (3 usizes to just one). This doesn't even make the code look much lower level

We may further reduce memory usage by realising our indices have low bounds, so we can just use a single u8 for the index.
By changing that into a NonZeroU8 we also remove memory overhead when alloc is enabled, thanks to niche optimisation.

Previously, the max word size has been documented as 40 bytes in some places and 41 in others, whilst in practice it was only 38, as two bytes are reserved for matching the start and end of words.
With this change, the (true) max has been increased to 45 bytes, as alignment restrictions mean any length between 38 and 45 (inclusive) will result in the same size for the Bytes struct.

The previous size of Bytes was 72, bringing the size of Syllables to 96. The new sizes are 48 and 72, respectively.
If we were to push Bytes up to the limit of its previous size of 72, we could store as many as 69 bytes of word without allocating.

Additionally, I opted to expose a new public constant MAX_WORD_LEN to avoid a future desync between the actual max length and the documented one, especially in downstream users.
I also added tests to ensure that constant is correct. The old code passes those tests for a MAX_WORD_LEN value of 38.

I haven't found a performance regression on my machine using cargo bench -p hypher-bench. I'm not sure if that's how that extra crate is meant to be used, usually I just see benchmarks specified in the bench section of Cargo.toml.

@emilyyyylime
Copy link
Author

Sorry oversight on my part there

@laurmaedje laurmaedje requested a review from saecki June 12, 2025 08:08
Copy link
Member

@saecki saecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I quite like this optimization. I've left a few little nitpicky comments, but otherwise I think looks good!

Copy link
Member

@laurmaedje laurmaedje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just two minor nitpicks. otherwise, lgtm!

@emilyyyylime
Copy link
Author

rust-analyzer trolling me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants