-
Notifications
You must be signed in to change notification settings - Fork 20.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
les: implement request distributor, fix blocking issues #3660
Conversation
@zsfelfoldi, thanks for your PR! By analyzing the history of the files in this pull request, we identified @obscuren, @fjl and @bas-vk to be potential reviewers. |
thanks for the bug report, I will fix it soon. |
c421755
to
16a3244
Compare
fixed |
Not a deadlock, the server was disconnected. Did it happen because of your 'eth' command? |
Does it say "Rolled back 2048 headers..." like on the picture above? That means server disconnection during syncing. Which is bad too if an 'eth' command causes that, but it is a different issue. |
|
Please try it again... there is still some weird issue when resuming a cancelled syncing (after header rollback), I will debug that later, but the really nasty bug that caused blocking has been found. |
Thank you! The issue seems to be resolved - at least I couldn't overwhelm console with "eth" calls. Rollbacks happen - yet sync resumes and full sync happens, eventually. So, all in all - this nasty blocking bug has been resolved. |
Thanks for catching it, fixed. Now I am carefully checking all my locks, callbacks and goroutines :) |
aee030f
to
d61ed3f
Compare
57e550e
to
53e7cae
Compare
also fixed a bug that caused flow control errors and random peer disconnections. |
While overall blocking is gone indeed, there's one minor issue with a delay, rather than a block. I've done the following experiment:
I've tracked the problem to https://github.com/ethereum/go-ethereum/blob/master/light/txpool.go#L303: func (pool *TxPool) eventLoop() {
for ev := range pool.events.Chan() {
switch ev.Data.(type) {
case core.ChainHeadEvent:
pool.mu.Lock()
ctx, _ := context.WithTimeout(context.Background(), blockCheckTimeout)
head := pool.chain.CurrentHeader()
txc, _ := pool.setNewHead(ctx, head)
m, r := txc.getLists()
pool.relay.NewHead(pool.head, m, r)
pool.homestead = pool.config.IsHomestead(head.Number)
pool.signer = types.MakeSigner(pool.config, head.Number)
pool.mu.Unlock()
// <<<<<< delay is solved if I add time.Sleep(0) or runtime.Gosched()
}
}
} So, Once I added |
@farazdagi thanks for reporting these issues, I've switched the status of the PR back to "in progress" until I address them. |
53e7cae
to
f93567b
Compare
f93567b
to
3d221cc
Compare
@farazdagi I have investigated the mentioned issues and found the following:
Hopefully the remaining performance and responsiveness issues will be fixed by the new (and actually simpler) light tx pool which will (again, hopefully) be included in the 1.6 release. |
@zsfelfoldi thank you for providing more details. I've mentioned this on Gitter, copying it here:
So, unfortunately, for us it this remaining performance and responsiveness issues were show stoppers (as on mobile LES is already understandably slower than that on desktop). What we did is simply excluded any updates that happened from c57c54c I literally cherry-picked every other commits, but ones related to LES, and we have non blocking and almost up to date Geth (everything is up to date, except for LES). I am not sure how c57c54c is unrelated to the issue - I see no blocking before that commit, I see no blocking if I update everything else bug LES. It is possible that commit has been reverted, and then something else re-introduced the issue, I am not sure. At the moment we have our hackish way to keep updated Geth and outdated LES 😞 - once I have more time, I will look further into issue, as the original issue I've reported (#3639) is definitely not fixed by this PR, so hopefully it gets addressed down the road. |
5b8fd05
to
ee21f96
Compare
ee21f96
to
b5afe61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did an interactive review.
les/distributor.go
Outdated
waitBefore(uint64) (time.Duration, float64) | ||
canSendOrdered() bool | ||
orderedSend(f func()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, please remove the need for 'ordered sends' by updating flow control counters when the request is actually sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that I'd really want to keep the ordered send because the order actually affects performance, the optimal thing to do is to keep the queue priority and always send the oldest requests first. I know this is a minor nuance but with the recent additions to the protocol there are cases where the order of reply messages counts too, so I believe in the long run this is the cleanest solution. So I made the ordered send a general thing and put it into the common package.
les/distributor.go
Outdated
case retry := <-d.loopChn: | ||
// the main loop distinguishes two kinds of signals: | ||
// - "retry" is sent to re-check whether previously unsendable requests can now be sent to one of the peers | ||
// - "next" is sent when there are more requests to be processed/sent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the retry mechanism.
les/distributor.go
Outdated
r.queuePrev.queueNext = r | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be able to remove this part if you use container/list.
les/distributor.go
Outdated
getCost: getCost, | ||
canSend: canSend, | ||
request: request, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change callers to use &distReq{getCost: func(...) uint64 { ... }, ...}
instead of calling newDistReq. Doing this will add "labels" to all the func literals.
les/distributor.go
Outdated
} | ||
if bestReq == nil { | ||
} else { | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this empty condition anymore ;)
les/distributor.go
Outdated
if send != nil { | ||
peer.orderedSend(send) | ||
} | ||
chn <- peer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
req.sentChn <- peer
close(req.sentChn)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I realized that remove sets sentChn to nil so there was actually a reason for this.
les/distributor.go
Outdated
} | ||
fmt.Println(" order", r.reqOrder, "prev", p) | ||
r = r.queueNext | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this select clause.
Thank you for your contribution! Your commits seem to not adhere to the repository coding standards
Please check the contribution guidelines for more details. This message was auto-generated by https://gitcop.com |
@fjl I did the requested changes (except two of them, see my comments). |
common/execqueue.go
Outdated
// Stop stops the exec queue. | ||
func (q *ExecQueue) Stop() { | ||
atomic.StoreInt32(&q.stop, 1) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this as an unexported utility in package les instead. We don't want to add new features to package common anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
9a82a89
to
e802e27
Compare
e802e27
to
9d363f3
Compare
defer self.chainmu.Unlock() | ||
defer func() { | ||
self.chainmu.Unlock() | ||
time.Sleep(time.Millisecond * 10) // ugly hack; do not hog chain lock in case syncing is CPU-limited by validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed ugly and we should remove it. Can you explain why this helps with the lock?
} | ||
go func() { | ||
time.Sleep(wait) | ||
d.loopChn <- struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a better idea for this, will submit after the merge.
This PR implements a new, cleaner and more efficient LES request distribution mechanism that also avoids starvation issues by prioritizing requests that have been created earlier. Also, it fixes a minor issue in light.TxPool.