Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TMBitset: Bit Twiddling Hacks #598

Open
yakra opened this issue Jul 27, 2023 · 0 comments
Open

TMBitset: Bit Twiddling Hacks #598

yakra opened this issue Jul 27, 2023 · 0 comments

Comments

@yakra
Copy link
Contributor

yakra commented Jul 27, 2023

https://graphics.stanford.edu/~seander/bithacks.html

insert, lookup, etc.

Use bitwise ops instead of modulus & division, e.g.:
index % 8 -> index & 7
index / 8 -> index >> 3
This is an oversimplification of course, as TMBitset uses the ubits datum. Making a couple new constexprs is simple enough.
Of course, all this probably doesn't matter in practice; the compiler almost certainly already optimizes all this.

count() or size()

A count() or size() function akin to unordered_set::size() isn't implemented, as I just haven't needed one yet.
I've thought about iterating byte-by-byte and using a lookup table, similar to this, but there are better methods available.
Low priority though; no need to worry about this until a need arises in production.
Until then, if I just want a bit count for debug code (which I'm not worried about optimizing), I can just use std::distance and call it a day. No need for a new member function.
x86 has POPCNT; I believe there's an intrinsic for that.

iteration

There's some pretty interesting stuff here.
Incrementing TMBitset::iterator involves more iteration under the hood.
In the very first draft, it iterated through all the bits in each unit of the data, and when it hit a 1, iteration "paused" and our iterator became visible to the outside world.
Later I improved on this by iterating only as far as the leftmost 1. Bit shifting right as we go along, once this 1 disappears, the value of our bits hits 0, and we can move along to the next unit.
There's still room for improvement though. Consider the worst (32-bit) case, a unit of 0xC0000000. Only one 1, all the way left. We don't get to skip ahead early, and must iterate through 31 0s to get there. That's pretty inefficient. It'd be great to know where the next 1 is, and skip straight to it.
This is where "Count the consecutive zero bits (trailing) on the right" comes in.
During development I tried a couple methods involving lookup tables, but didn't hit on the idea of reducing table size via modulus or multiplication. No better performance. That, and a 32 KB table for uint16_t units is starting to get a bit big. 4GB for uint32_t? Forget about it! In contrast, the lookup table used for uint32_ts by the multiplication technique can fit in half a cache line.
What I'm doing now is comparable to the linear search method.
There are a few possibilities to play around with, though as this video with a variation on the parallel technique notes, there's an x86 instruction for this. Clang and gcc have the __builtin_ctz intrinsic. It appears that XCode uses Clang, so this intrinsic should still work, just doing whatever it has to do for @jteresco's M1 chip. If we're worried about supporting other compilers, conditional compilation is a possibility.

All that said, the improvement this stuff provides will be tiny.
Especially with subgraph generation, there are much bigger things effecting efficiency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant