Skip to content

Conversation

holiman
Copy link
Contributor

@holiman holiman commented Nov 4, 2019

This PR contains a massive refactoring in the downloader + queue area. It's not quite ready to be merged yet, I'd like to see how the tests perform.

Todo: add some more unit-tests regarding the resultstore implementation, and the queue.

Throttling

Previously, we had a doneQueue which was a map where we kept track of all downloaded items (receipts, block bodies). This map was updated when deliveries came in, and cleaned when results were pulled from the resultCache. It was quite finicky, and modifications to how the download functioned was dangerous: if these were not kept in check, it was possible that the doneQueue would blow up.

It was also quite resource intensive, where a lot of counting and cross-checking was going on between the various pools and queues.

This has now been reworked, so that

  • the resultCache maintains (like previously) a slice of *fetchResults, with a length of blockCacheLimit * 2.
  • the resultCache also knows that it should only consider the first 75% of available slots to be up for filling. Thus, when a reserve request comes in (we want do give a task to a peer), the resultCache checks if the proposed download-task is in that priority segment. Otherwise, it flags for throttling.
  • Once the results are fetched for processing, and removed from the internal slice, the priority segment moves organically, and new data become eligible for fetching.

This means I could drop all donePool thingies, which simplified things a bit.

Concurrency

Previously, the queue maintained one lock to rule them all. Now, the resultCache has it's own lock, and can handle concurrency internally. This means that body and receipt fetch/delivery can happen simultaneously, and also that verification (sha:ing) of the bodies/receipts doesn't block other threads waiting for the lock.

Previously, I think it was kind of racy when setting the Pending on the fetchResult. This has been fixed.

Tests

The downloader tests failed quite often; when receipts are added in the backend, the headers (ownHeaders) were deleted and moved into ancientHeaders. If this happened quickly enough, the next batch of headers errored with unknown parent. This has been fixed so the backend also queries ancientHeaders for header existence.

Minor changes

  • Set the incoming response time earlier in the flow, so it doesn't have to wait for obtaining locks before setting it. Should make the rtt measurements a bit more closer to the thuth.
  • the idle check used some bubble sort algo, replaced
  • the fetcher did a lot of useless work, iterating in the block filter (every block) and calculating hashes over and over again. This was simplified

@holiman
Copy link
Contributor Author

holiman commented Nov 4, 2019

This is now doing a fast-sync on the benchmarkers: https://geth-bench.ethdevops.io/d/Jpk-Be5Wk/dual-geth?orgId=1&var-exp=mon08&var-master=mon09&var-percentile=50&from=1572876506497&to=now

@holiman holiman force-pushed the refactor_downloader branch 4 times, most recently from 7d58a97 to eed7fe6 Compare November 13, 2019 11:53
@holiman
Copy link
Contributor Author

holiman commented Nov 13, 2019

Finally got greenlighted by travis. Will do one more fastsync-benchmark and post results

@fjl fjl changed the title Refactor downloader eth/downloader: refactor downloader queue Nov 14, 2019
@holiman
Copy link
Contributor Author

holiman commented Nov 14, 2019

Fast-sync done (https://geth-bench.ethdevops.io/d/Jpk-Be5Wk/dual-geth?orgId=1&from=1573721066023&to=1573752540000&var-exp=mon06&var-master=mon07&var-percentile=50) , some graphs (this PR in yellow)
Screenshot_2019-11-14 Dual Geth - Grafana(1)
Screenshot_2019-11-14 Dual Geth - Grafana


Also, totally unrelated, it's interesting to see that there's a 10x write amplification on leveldb (750Gb written, 75G stored), and a perfect 1x on ancients:
Screenshot_2019-11-14 Dual Geth - Grafana(2)

throttleThreshold := uint64((common.StorageSize(blockCacheMemory) + q.resultSize - 1) / q.resultSize)
q.resultCache.SetThrottleThreshold(throttleThreshold)
// log some info at certain times
if time.Now().Second()&0xa == 0 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bleh

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed :P

delete(q.receiptDonePool, hash)
closed = q.closed
q.lock.Unlock()
results = q.resultCache.GetCompleted(maxResultsProcess)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the condition variable should be in resultStore. This means closed needs to move into the resultStore as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that totally makes sense, but it means an even larger refactor. Then closed would have to move in there, and the things that calls Signal need to somehow trigger that via the resultStore.
Let's leave that for a future refactor (I'd be happy to continue iterating on the downloader)

@holiman holiman force-pushed the refactor_downloader branch 3 times, most recently from cfd3f18 to a0f30ba Compare December 4, 2019 08:48
@karalabe karalabe self-assigned this Jan 7, 2020
@holiman holiman mentioned this pull request Jan 14, 2020
}

// EmptyBody returns true if there is no additional 'body' to complete the header
// that is: no transactions and no uncles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

return h.TxHash == EmptyRootHash && h.UncleHash == EmptyUncleHash
}

// EmptyReceipts returns true if there are no receipts for this header/block
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

headers := packet.(*headerPack).headers
if len(headers) != 1 {
p.log.Debug("Multiple headers for single request", "headers", len(headers))
p.log.Info("Multiple headers for single request", "headers", len(headers))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should possible raise this to Warn

headers := packer.(*headerPack).headers
if len(headers) != 1 {
p.log.Debug("Multiple headers for single request", "headers", len(headers))
p.log.Info("Multiple headers for single request", "headers", len(headers))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should possible raise this to Warn

header := d.lightchain.GetHeaderByHash(h) // Independent of sync mode, header surely exists
if header.Number.Uint64() != check {
p.log.Debug("Received non requested header", "number", header.Number, "hash", header.Hash(), "request", check)
p.log.Info("Received non requested header", "number", header.Number, "hash", header.Hash(), "request", check)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should possible raise this to Warn

delay = n
}
headers = headers[:n-delay]
ignoredHeaders = delay
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably simpler if you replace delay altogether with ignoredHeader instead of defining a new delay instance and then just assigning it at the end.

@adamschmideg adamschmideg added this to the 1.9.14 milestone Apr 7, 2020
@holiman holiman force-pushed the refactor_downloader branch from 99d1503 to 060e2c0 Compare April 7, 2020 13:15
@holiman
Copy link
Contributor Author

holiman commented Apr 7, 2020

Rebased

@holiman
Copy link
Contributor Author

holiman commented Jun 26, 2020

Closing in favour of #21263

@holiman holiman closed this Jun 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants