Skip to content

Conversation

@brian-pane
Copy link

No description provided.

@brian-pane
Copy link
Author

I'm posting this for discussion because it produces mixed results on my test system:

  • Improvement in cycles and wall time at compression levels 2 and 9
  • Regression in instructions (but no change in cycles or wall time) at levels 3-8.

Whether it helps is probably dependent on the CPU and the input data.

Comment on lines +195 to +196
if best_len < core::mem::size_of::<u64>() {
let scan_val = u64::from_ne_bytes(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried if best_len < 32 { followed by the equivalent AVX2 implementation (similar to how compare32 processes each 32=byte chunk), but it was slower overall.

Copy link
Member

@folkertdev folkertdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a slightly higher instruction count (correctly predicted branches?) is totally acceptable if we get a performance boost, especially because levels 3,4, 7 and 8 are rarely used in practice.

And the gains at level 2 and 9 are very real, so I think this is OK. Thanks!

@folkertdev folkertdev merged commit 4ca72cd into trifectatechfoundation:main Jan 14, 2025
20 checks passed
@brian-pane brian-pane deleted the longest-match64 branch April 1, 2025 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants