Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bitfield optimization of phase2 #120

Merged
merged 11 commits into from
Nov 9, 2020
Merged

bitfield optimization of phase2 #120

merged 11 commits into from
Nov 9, 2020

Conversation

arvidn
Copy link
Contributor

@arvidn arvidn commented Nov 4, 2020

This avoids some sort-on-disk in phase2 (back-propagation) by recording which entries in table_index are used by table_index + 1. Phase 2 now looks like this:

  • loop from table 7..1:
    • scan current table:
      • if table < 7, skip any entries not set in current_bitfield
      • record the indices we find in the the current table by setting the corresponding bit in previous_bitfield
    • if table < 7: scan current table (again):
      • compact table by skipping fields whose bit is not set in current_bitfield
      • write back each pos and offset with an adjusted position, based on the new (compacted) locations in previous_bitfield.
    • current_bitfield = previous_bitfield
    • clear all bits in previous_bitfield

In order to hand over the tables in the correct order, sorted as the current phase 2 does, the "write-back" step is really done into a SortManager object. This object is then passed on to phase 3, as if it was a Disk.

Enabling a SortManager to implement the Disk interface required a few changes. The Disk interface now returns a pointer to the entry, when reading from it. This is how SortManager returns data, and I envision it is how we will actually return data if we transition to memory mapped files too.

To allow passing a SortManager from one phase to the next, they now heap allocate their memory, rather than relying in being passed a buffer that is later reused.

Another optimization of the algorithm described above is that the last table, table 1, isn't sorted, it just need to be compacted. Instead of running a read and write-back pass over table 1, there's yet another abstraction implementing the Disk interface; FilteredDisk. It wraps a normal file along with a bitfield and provides an object that appears to be a compacted file. i.e. all the entries not set in the bitfield are skipped when reading from the file. This saves another read and write pass over table 1.

The last change that was necessary to unify the interfaces of SortManager and Disk was to provide a BufferedDisk. It wraps a plain file on disk but implements a read and write cache. This simplifies the loops in phase 2 and phase 3, where they no longer need to do their own read caching. Similarly, the SortManager no longer needs to do its own write caching when writing to buckets.

The intention of this patch is to reduce disk I/O. Enabling disk I/O logging and testing:

./ProofOfSpace create -k 24 -s 8192

Yields this result:

total-read:        5863072979
total-write:       5307037029

This can be compared to the current master I/O, with the same command:

total-read:        5713090499
total-write:       6035790148

That's a reduction of writes by 12,07% and an increase of reads by 2.63%.

When the full working set fits in RAM (i.e. no actual disk I/O is performed) this patch runs slower. This is most likely because the sorting of the buckets in phase3 is done twice, since phase 3 performs two read passes over the tables. It's expected that it will be faster when running against an a disk, plotting a file much larger than available RAM.

…fies loops that iterate files, by not having them read larger chunks, but considering one entry at a time.
struct bitfield_index
{
// cache the number of set bits evey n bits. This is n. For a bitfield of
// size 2^32, this means a 2 MiB index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested comment change:

// Cache the number of set bits every kIndexBucket bits.
// For a 2^32 entries, this means a 200KiB index.

bitfield_index const idx(bitfield);
CHECK(idx.lookup(1048576 - 3, 1) == std::pair<uint64_t, uint64_t>{0,1});
CHECK(idx.lookup(1048576 - 2, 1) == std::pair<uint64_t, uint64_t>{1,1});
}
Copy link
Contributor

@aqk aqk Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest boundary condition tests for bitfield_index:

// One past bucket
idx.lookup(pos=0, offset=kIndexBucket)

// One past bitfield.size()

// Init a bitfield_index with a bitfield that has .size == kIndexBucket-1, kIndexBucket, kIndexBucket+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used to have tests for unset bits, that they would throw. But this is a pretty hot function so I backed out that check. Now it's up to the caller to only use valid positions and offsets (hence the asserts)

uint64_t const bucket = pos / kIndexBucket;

assert(bucket < index_.size());
assert(pos < bitfield_.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert(pos < bitfield_.size());

I see. So, it's okay for pos >= kIndexBucket?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are different domains. kIndexBucket is the interval of precomputed counts of bits. the index of such precomputed counts of set bits is bitfield_.size() / kIndexBucket.

It's OK for pos to be greater than bucket * kIndexBucket, but only as long as it's < bitfield_.size().

{
uint64_t counter = 0;
auto it = bitfield_.begin();
index_.reserve(bitfield_.size() / kIndexBucket);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index_.reserve(bitfield_.size() / kIndexBucket);

Should this be index_.reserve((bitfield_.size() / kIndexBucket)+1);, or is the index not used for the last bucket where bucket size < kIndexBucket ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I think this is right. This provides an index for the number of set bits at the start of every kIndexBucket bits. So it rounds down.

Copy link
Contributor

@aqk aqk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is large, so I'm sending my first wave of comments now.

src/phase2.hpp Outdated
// table 7, where we don't use the sort_manager; then it's used as a write
// cache for the table, as we update it.

// As the last step, we compact table 1. At that point the halfes are also
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

halfes -> halves or halfs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is removed in a later commit, so I think I'll leave it misspelled. If I update this commit I think it would make it harder to review, since changes won't come in as commits on top of what's here

src/phase2.hpp Outdated
@@ -149,100 +157,100 @@ std::vector<uint64_t> RunPhase2(
0,
strategy_t::quicksort);

if (table_index == 7) {
// we don't use the sort manager in this case, so we can use the
// memort as a write buffer instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memort

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing here. It's supposed to say "memory", but this comment is removed in a later commit, when the memory is heap allocated instead.

src/disk.hpp Outdated
NeedReadCache();
// all allocations need 7 bytes head-room, since
// SliceInt64FromBytes() may overrun by 7 bytes
if (read_buffer_start_ <= begin && read_buffer_start_ + read_buffer_size_ > begin + length + 7) {
// if the read is entirely inside the buffer, just memcopy it out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just memcopy it out -> just pass a pointer to it


void Read(uint64_t begin, uint8_t *memcache, uint64_t length) override
uint8_t const* Read(uint64_t begin, uint64_t length) override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the requested read is bigger than uint64_t read_ahead ?
Since we pass a pointer out, a valid read can only be as large as that buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only read one entry at a time, and I think the largest entry is 7 bytes. I suppose there could be an assert though

@@ -42,11 +42,11 @@ FetchContent_Declare(
FetchContent_MakeAvailable(pybind11-src)

IF (CMAKE_BUILD_TYPE STREQUAL "RELEASE")
set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3")
set (CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -O3")
set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -mtune=native")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arvidn isn't this a potential problem if you want to create binaries to distribute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be, if we build such binary distribution on an exotic machine. The problem of not tuning it at all is that GCC doesn't even think it has access to popcnt, which the bitfield_index relies on

@arvidn
Copy link
Contributor Author

arvidn commented Nov 6, 2020

@aqk I had a few typos in my additional unit test that catch (for some reason) didn't report so I didn't notice until CI ran and it detected a non-zero return code.

I've fixed that now. I want to squash all the fixup commits to their respective change they belong to, but to make it easier to review my changes, I left them in as new commits on top for now. Please take a look then I'll squash.

@arvidn
Copy link
Contributor Author

arvidn commented Nov 6, 2020

I ran some large benchmarks overnight (probably a lot larger than they had to be).

This is on MacOS, a fairly powerful laptop against a slow spinning disk connected over USB. So, this was most definitely I/O bound. I wanted to avoid running a test so small that it would fit in RAM, and I don't have all that much free space on the internal SSD. But it might be nice to get an idea of what the numbers look like on a more reasonable setup with a large SSD as the plotting target.

I ran ./ProofOfSpace create -u 32 -b 4096 -k 31 --threads 8.

I timed these runs with time, but I also piped the output to tee to log it, along with 2>&1. I think this messed up the output from time and I lost the real time (I got the user and system CPU time as well as the total time).

total time with current master: 9:17:47.59
total time with this patch: 9:11:14.91

6 minutes and 33 seconds faster. Not a whole lot given a > 9 hour runtime.

This is the timing output from the runs, master first:

Time for phase 1 = 10833.506 seconds. CPU (116.590%) Fri Nov  6 03:50:20 2020
Time for phase 2 = 7490.284 seconds. CPU (36.240%) Fri Nov  6 05:55:10 2020
Time for phase 3 = 14189.635 seconds. CPU (44.280%) Fri Nov  6 09:51:39 2020
Time for phase 4 = 949.447 seconds. CPU (57.620%) Fri Nov  6 10:07:29 2020
Total time = 33462.874 seconds. CPU (66.270%) Fri Nov  6 10:07:29 2020

Now, for this patch:

Time for phase 1 = 9960.252 seconds. CPU (127.730%) Thu Nov  5 17:42:43 2020
Time for phase 2 = 6451.230 seconds. CPU (53.280%) Thu Nov  5 19:30:14 2020
Time for phase 3 = 15829.239 seconds. CPU (56.150%) Thu Nov  5 23:54:03 2020
Time for phase 4 = 826.582 seconds. CPU (66.470%) Fri Nov  6 00:07:49 2020
Total time = 33067.307 seconds. CPU (77.410%) Fri Nov  6 00:07:49 2020

That's only a 1.18 % reduction in runtime. Given that it's still performing 12% fewer writes (at the expense of 2% more reads) and it's not making things slower (at least not dramatically), I think it's still a viable change.

I suspect that one reason it's not improving runtime all that much on a spinning disk is that it (probably) increases seek times since phase 3 now reads from two separate sort managers, meaning it will keep two read cursors on disk that it will have to alternate between.

Right now I set the disk read-ahead (and write) buffers to 4 MiB, as that's the minimum extent size on ext4 (iirc). I'll experiment with making that larger, say, 8 or 16 MiB.

@arvidn
Copy link
Contributor Author

arvidn commented Nov 6, 2020

I ran another test of this patch with a 16 MiB read- and write cache. It almost took 1 hour longer than the two previous tests. I'm thinking that there's probably a lot of noise in the way I'm testing this. Most likely introduced by the spinning disk.

Interestingly, almost all of the additional time happened in phase1, which should be unaffected by my patch anyway. There may be some opportunity for improvements in phase 1 on spinning disks.

@arvidn
Copy link
Contributor Author

arvidn commented Nov 9, 2020

squashed fixups

@hoffmang9 hoffmang9 merged commit 74b2fb2 into master Nov 9, 2020
@hoffmang9 hoffmang9 deleted the bitfield-cleanup branch November 9, 2020 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants