-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bitfield optimization of phase2 #120
Conversation
…fies loops that iterate files, by not having them read larger chunks, but considering one entry at a time.
src/bitfield_index.hpp
Outdated
struct bitfield_index | ||
{ | ||
// cache the number of set bits evey n bits. This is n. For a bitfield of | ||
// size 2^32, this means a 2 MiB index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested comment change:
// Cache the number of set bits every kIndexBucket bits.
// For a 2^32 entries, this means a 200KiB index.
bitfield_index const idx(bitfield); | ||
CHECK(idx.lookup(1048576 - 3, 1) == std::pair<uint64_t, uint64_t>{0,1}); | ||
CHECK(idx.lookup(1048576 - 2, 1) == std::pair<uint64_t, uint64_t>{1,1}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest boundary condition tests for bitfield_index
:
// One past bucket
idx.lookup(pos=0, offset=kIndexBucket)
// One past bitfield.size()
// Init a bitfield_index with a bitfield that has .size == kIndexBucket-1, kIndexBucket, kIndexBucket+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used to have tests for unset bits, that they would throw. But this is a pretty hot function so I backed out that check. Now it's up to the caller to only use valid positions and offsets (hence the asserts)
src/bitfield_index.hpp
Outdated
uint64_t const bucket = pos / kIndexBucket; | ||
|
||
assert(bucket < index_.size()); | ||
assert(pos < bitfield_.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert(pos < bitfield_.size());
I see. So, it's okay for pos >= kIndexBucket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are different domains. kIndexBucket
is the interval of precomputed counts of bits. the index of such precomputed counts of set bits is bitfield_.size() / kIndexBucket
.
It's OK for pos
to be greater than bucket * kIndexBucket
, but only as long as it's < bitfield_.size()
.
{ | ||
uint64_t counter = 0; | ||
auto it = bitfield_.begin(); | ||
index_.reserve(bitfield_.size() / kIndexBucket); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index_.reserve(bitfield_.size() / kIndexBucket);
Should this be index_.reserve((bitfield_.size() / kIndexBucket)+1);
, or is the index not used for the last bucket where bucket size < kIndexBucket ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I think this is right. This provides an index for the number of set bits at the start of every kIndexBucket
bits. So it rounds down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is large, so I'm sending my first wave of comments now.
src/phase2.hpp
Outdated
// table 7, where we don't use the sort_manager; then it's used as a write | ||
// cache for the table, as we update it. | ||
|
||
// As the last step, we compact table 1. At that point the halfes are also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
halfes
-> halves or halfs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment is removed in a later commit, so I think I'll leave it misspelled. If I update this commit I think it would make it harder to review, since changes won't come in as commits on top of what's here
src/phase2.hpp
Outdated
@@ -149,100 +157,100 @@ std::vector<uint64_t> RunPhase2( | |||
0, | |||
strategy_t::quicksort); | |||
|
|||
if (table_index == 7) { | |||
// we don't use the sort manager in this case, so we can use the | |||
// memort as a write buffer instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing here. It's supposed to say "memory", but this comment is removed in a later commit, when the memory is heap allocated instead.
src/disk.hpp
Outdated
NeedReadCache(); | ||
// all allocations need 7 bytes head-room, since | ||
// SliceInt64FromBytes() may overrun by 7 bytes | ||
if (read_buffer_start_ <= begin && read_buffer_start_ + read_buffer_size_ > begin + length + 7) { | ||
// if the read is entirely inside the buffer, just memcopy it out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just memcopy it out
-> just pass a pointer to it
|
||
void Read(uint64_t begin, uint8_t *memcache, uint64_t length) override | ||
uint8_t const* Read(uint64_t begin, uint64_t length) override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the requested read is bigger than uint64_t read_ahead
?
Since we pass a pointer out, a valid read can only be as large as that buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we only read one entry at a time, and I think the largest entry is 7 bytes. I suppose there could be an assert though
@@ -42,11 +42,11 @@ FetchContent_Declare( | |||
FetchContent_MakeAvailable(pybind11-src) | |||
|
|||
IF (CMAKE_BUILD_TYPE STREQUAL "RELEASE") | |||
set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3") | |||
set (CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -O3") | |||
set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -mtune=native") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arvidn isn't this a potential problem if you want to create binaries to distribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be, if we build such binary distribution on an exotic machine. The problem of not tuning it at all is that GCC doesn't even think it has access to popcnt
, which the bitfield_index
relies on
@aqk I had a few typos in my additional unit test that catch (for some reason) didn't report so I didn't notice until CI ran and it detected a non-zero return code. I've fixed that now. I want to squash all the fixup commits to their respective change they belong to, but to make it easier to review my changes, I left them in as new commits on top for now. Please take a look then I'll squash. |
I ran some large benchmarks overnight (probably a lot larger than they had to be). This is on MacOS, a fairly powerful laptop against a slow spinning disk connected over USB. So, this was most definitely I/O bound. I wanted to avoid running a test so small that it would fit in RAM, and I don't have all that much free space on the internal SSD. But it might be nice to get an idea of what the numbers look like on a more reasonable setup with a large SSD as the plotting target. I ran I timed these runs with total time with current master: 9:17:47.59 6 minutes and 33 seconds faster. Not a whole lot given a > 9 hour runtime. This is the timing output from the runs,
Now, for this patch:
That's only a 1.18 % reduction in runtime. Given that it's still performing 12% fewer writes (at the expense of 2% more reads) and it's not making things slower (at least not dramatically), I think it's still a viable change. I suspect that one reason it's not improving runtime all that much on a spinning disk is that it (probably) increases seek times since phase 3 now reads from two separate sort managers, meaning it will keep two read cursors on disk that it will have to alternate between. Right now I set the disk read-ahead (and write) buffers to 4 MiB, as that's the minimum extent size on ext4 (iirc). I'll experiment with making that larger, say, 8 or 16 MiB. |
I ran another test of this patch with a 16 MiB read- and write cache. It almost took 1 hour longer than the two previous tests. I'm thinking that there's probably a lot of noise in the way I'm testing this. Most likely introduced by the spinning disk. Interestingly, almost all of the additional time happened in phase1, which should be unaffected by my patch anyway. There may be some opportunity for improvements in phase 1 on spinning disks. |
…ptimize vector<bool>
…tion of the tables
… one phase to the next.
95b58cd
to
af2bee9
Compare
squashed fixups |
This avoids some sort-on-disk in phase2 (back-propagation) by recording which entries in
table_index
are used bytable_index + 1
. Phase 2 now looks like this:current_bitfield
previous_bitfield
current_bitfield
previous_bitfield
.current_bitfield
=previous_bitfield
previous_bitfield
In order to hand over the tables in the correct order, sorted as the current phase 2 does, the "write-back" step is really done into a
SortManager
object. This object is then passed on to phase 3, as if it was aDisk
.Enabling a
SortManager
to implement the Disk interface required a few changes. TheDisk
interface now returns a pointer to the entry, when reading from it. This is howSortManager
returns data, and I envision it is how we will actually return data if we transition to memory mapped files too.To allow passing a
SortManager
from one phase to the next, they now heap allocate their memory, rather than relying in being passed a buffer that is later reused.Another optimization of the algorithm described above is that the last table, table 1, isn't sorted, it just need to be compacted. Instead of running a read and write-back pass over table 1, there's yet another abstraction implementing the
Disk
interface;FilteredDisk
. It wraps a normal file along with a bitfield and provides an object that appears to be a compacted file. i.e. all the entries not set in the bitfield are skipped when reading from the file. This saves another read and write pass over table 1.The last change that was necessary to unify the interfaces of
SortManager
andDisk
was to provide aBufferedDisk
. It wraps a plain file on disk but implements a read and write cache. This simplifies the loops in phase 2 and phase 3, where they no longer need to do their own read caching. Similarly, theSortManager
no longer needs to do its own write caching when writing to buckets.The intention of this patch is to reduce disk I/O. Enabling disk I/O logging and testing:
Yields this result:
This can be compared to the current
master
I/O, with the same command:That's a reduction of writes by 12,07% and an increase of reads by 2.63%.
When the full working set fits in RAM (i.e. no actual disk I/O is performed) this patch runs slower. This is most likely because the sorting of the buckets in phase3 is done twice, since phase 3 performs two read passes over the tables. It's expected that it will be faster when running against an a disk, plotting a file much larger than available RAM.