Request for a replication of observations #229
Replies: 4 comments
-
For a 1 million sieve size, you would use either 1 MiB, vs 128 KiB. My CPU has 256KiB L2 cache, so using only one bit per entry would allow the entire sieve to be able to fit in L2 with half of it to spare, whilst if we use 1 byte per entry, it would not. I suspect that this is the cause. Try to run your program under |
Beta Was this translation helpful? Give feedback.
-
Thanks! That makes sense: the performance hit that I'm seeing is an accumulation of L2 cache misses. Thinking about it from the cache perspective, that may also explain some other weird behavior that I was seeing last night. So, this may be a perfect example of where we try to trade-off using more memory for using less time, only to find that more memory IS more time. |
Beta Was this translation helpful? Give feedback.
-
If you want to see this monstrosity that appears to work, it is here: |
Beta Was this translation helpful? Give feedback.
-
On the other hand, as your sieve size increases, the parallelization improvements start to outweigh the cache problems — especially as the total dataset gets to the size that even using bits per value starts overflowing the cache. In my own tests, at 10,000,000 values parallel just started to outperform the linear version (by about 15%), while it increased to being 2.4x better at 1,000,000,000 values. Edit: Oh, and you can keep the memory savings of the bitpacking if you change the way you parallelize things. Break things up by range of values tested rather than by prime factor. |
Beta Was this translation helpful? Give feedback.
-
My first though in trying to make a parallel sieve algorithm was to exchange time for memory. To that end, I changed the bit field that the program was operating on into a byte field: as the processor can address bytes, you don't have a read-modify-write when flagging a number as composite. It becomes just a write, which has no hazards.
Which is to say, I changed vector bool to vector char .
What I noticed, though, and what I would like verified, is that the negative performance impact from this change is so large that no performance increase from multi-threading could ever recover from it. Does everyone else see that? Or, is my code just absolute garbage?
Thanks
Beta Was this translation helpful? Give feedback.
All reactions