Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

les: implement request distributor, fix blocking issues #3660

Merged
merged 2 commits into from
Mar 22, 2017

Conversation

zsfelfoldi
Copy link
Contributor

This PR implements a new, cleaner and more efficient LES request distribution mechanism that also avoids starvation issues by prioritizing requests that have been created earlier. Also, it fixes a minor issue in light.TxPool.

@mention-bot
Copy link

@zsfelfoldi, thanks for your PR! By analyzing the history of the files in this pull request, we identified @obscuren, @fjl and @bas-vk to be potential reviewers.

@mchusovlianov
Copy link
Contributor

mchusovlianov commented Feb 10, 2017

I've tested a bit this PR ( I just run ./build/bin/geth --light --testnet console and i try to dump eth object by typing eth ENTER):
1 issue:
image
2 issue:
image

@zsfelfoldi
Copy link
Contributor Author

thanks for the bug report, I will fix it soon.

@zsfelfoldi
Copy link
Contributor Author

fixed

@mchusovlianov
Copy link
Contributor

It looks now it has deadlock:
image

image

@zsfelfoldi
Copy link
Contributor Author

Not a deadlock, the server was disconnected. Did it happen because of your 'eth' command?

@farazdagi
Copy link
Contributor

farazdagi commented Feb 10, 2017

  • I can confirm the same, if you type 'eth, eth, .. , eth N' at some (not very high) N, everything blocks, and nothing happens, smells lots like deadlock
  • the same rollback is reported in mine console too:
    image

@zsfelfoldi
Copy link
Contributor Author

Does it say "Rolled back 2048 headers..." like on the picture above? That means server disconnection during syncing. Which is bad too if an 'eth' command causes that, but it is a different issue.

@farazdagi
Copy link
Contributor

farazdagi commented Feb 10, 2017

  • yep, it always (at least in my anecdotical evidence) happens in sequence: "eth" strings start to pile, and then "rolled back" message appears, and while I can input anything, no response from server ever comes
  • Additionally, when I had another console attached in a separate window, it also just locks, when "eth DOS" happens in the first window. I wanted to have that separate window, to see the peers (to investigate why server disconnect occurs).

@zsfelfoldi
Copy link
Contributor Author

Please try it again... there is still some weird issue when resuming a cancelled syncing (after header rollback), I will debug that later, but the really nasty bug that caused blocking has been found.

@farazdagi
Copy link
Contributor

Thank you! The issue seems to be resolved - at least I couldn't overwhelm console with "eth" calls. Rollbacks happen - yet sync resumes and full sync happens, eventually.

So, all in all - this nasty blocking bug has been resolved.

@farazdagi
Copy link
Contributor

I've done more tests, unfortunately got the following panic:
image

@zsfelfoldi
Copy link
Contributor Author

Thanks for catching it, fixed. Now I am carefully checking all my locks, callbacks and goroutines :)

@zsfelfoldi
Copy link
Contributor Author

also fixed a bug that caused flow control errors and random peer disconnections.

@farazdagi
Copy link
Contributor

While overall blocking is gone indeed, there's one minor issue with a delay, rather than a block.

I've done the following experiment:

  • wiped out chaindata (so that the test starts from trusted CHT, and needs to sync all the way from it)
  • send couple of transactions
    • sometimes those transactions validate just fine, even if actual nonce is higher (in the last block), it might so happen that when tx is sent, chain state doesn't know about those higher nonces
    • curiously enough, the first transaction proceeds fine (of course it never gets included into blockchain, as peers deny it)
    • however, the second take much longer to complete (like minutes, instead of seconds)

I've tracked the problem to https://github.com/ethereum/go-ethereum/blob/master/light/txpool.go#L303:

func (pool *TxPool) eventLoop() {
	for ev := range pool.events.Chan() {
		switch ev.Data.(type) {
		case core.ChainHeadEvent:
			pool.mu.Lock()
			ctx, _ := context.WithTimeout(context.Background(), blockCheckTimeout)
			head := pool.chain.CurrentHeader()
			txc, _ := pool.setNewHead(ctx, head)
			m, r := txc.getLists()
			pool.relay.NewHead(pool.head, m, r)
			pool.homestead = pool.config.IsHomestead(head.Number)
			pool.signer = types.MakeSigner(pool.config, head.Number)
			pool.mu.Unlock()
                        // <<<<<< delay is solved if I add time.Sleep(0) or runtime.Gosched()
		}
	}
}

So, eventLoop acquires lock and preemption seems to get greatly delayed i.e. another routine was starving on txpool.Add() lock acquiring - hence the delay.

Once I added runtim.Gosched() to allow for manual preemption and context switching, no more delays occurred.

@zsfelfoldi
Copy link
Contributor Author

@farazdagi thanks for reporting these issues, I've switched the status of the PR back to "in progress" until I address them.

@zsfelfoldi
Copy link
Contributor Author

@farazdagi I have investigated the mentioned issues and found the following:

  • if you send a transaction before syncing up, the light transaction pool kills performance by downloading each block until it finds the sent transaction. It also wastes some CPU performance while syncing, even if no tx has been sent. This problem will be fixed by a subsequent PR.
  • serving data from the local db before syncing is started DOES work if everything is there, or at least I could not reproduce a scenario where it does not. The comments above the Retrieve funcion were outdated, db checking is done in the access functions found in light/odr_util.go.
  • the commit c57c54c you mentioned only affected the server mode and was rolled back since then, so it can not be related to these problems.

Hopefully the remaining performance and responsiveness issues will be fixed by the new (and actually simpler) light tx pool which will (again, hopefully) be included in the 1.6 release.

@farazdagi
Copy link
Contributor

@zsfelfoldi thank you for providing more details. I've mentioned this on Gitter, copying it here:

in my tests #3660 still doesn't fix the blocking. I did the following: 1. reset chaindata 2. start node 3. in another window attach to that node 4. try to eth.getBalance() and eth.estimateGas() 5. blocking/lags occurs.

So, unfortunately, for us it this remaining performance and responsiveness issues were show stoppers (as on mobile LES is already understandably slower than that on desktop). What we did is simply excluded any updates that happened from c57c54c I literally cherry-picked every other commits, but ones related to LES, and we have non blocking and almost up to date Geth (everything is up to date, except for LES).

I am not sure how c57c54c is unrelated to the issue - I see no blocking before that commit, I see no blocking if I update everything else bug LES. It is possible that commit has been reverted, and then something else re-introduced the issue, I am not sure.

At the moment we have our hackish way to keep updated Geth and outdated LES 😞 - once I have more time, I will look further into issue, as the original issue I've reported (#3639) is definitely not fixed by this PR, so hopefully it gets addressed down the road.

Copy link
Contributor

@fjl fjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did an interactive review.

waitBefore(uint64) (time.Duration, float64)
canSendOrdered() bool
orderedSend(f func())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, please remove the need for 'ordered sends' by updating flow control counters when the request is actually sent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that I'd really want to keep the ordered send because the order actually affects performance, the optimal thing to do is to keep the queue priority and always send the oldest requests first. I know this is a minor nuance but with the recent additions to the protocol there are cases where the order of reply messages counts too, so I believe in the long run this is the cleanest solution. So I made the ordered send a general thing and put it into the common package.

case retry := <-d.loopChn:
// the main loop distinguishes two kinds of signals:
// - "retry" is sent to re-check whether previously unsendable requests can now be sent to one of the peers
// - "next" is sent when there are more requests to be processed/sent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the retry mechanism.

r.queuePrev.queueNext = r
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to remove this part if you use container/list.

getCost: getCost,
canSend: canSend,
request: request,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change callers to use &distReq{getCost: func(...) uint64 { ... }, ...} instead of calling newDistReq. Doing this will add "labels" to all the func literals.

}
if bestReq == nil {
} else {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this empty condition anymore ;)

if send != nil {
peer.orderedSend(send)
}
chn <- peer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

req.sentChn <- peer
close(req.sentChn)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I realized that remove sets sentChn to nil so there was actually a reason for this.

}
fmt.Println(" order", r.reqOrder, "prev", p)
r = r.queueNext
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this select clause.

@GitCop
Copy link

GitCop commented Mar 18, 2017

Thank you for your contribution! Your commits seem to not adhere to the repository coding standards

  • Commit: 4f3a9eea6aff8d15aa7ce328af343e32c6bc653e

  • Commits must be prefixed with the package(s) they modify

  • Commit: 73f68d563c1fa1d0e01cedcebbe91726b2ace6de

  • Commits must be prefixed with the package(s) they modify

  • Commit: 04067f3fffb9a0ef4bf36604ec197be79cb8c337

  • Commits must be prefixed with the package(s) they modify

  • Commit: 9a82a89c338754e1b5bd0d3542ca9499ba473d7b

  • Commits must be prefixed with the package(s) they modify

Please check the contribution guidelines for more details.


This message was auto-generated by https://gitcop.com

@zsfelfoldi
Copy link
Contributor Author

@fjl I did the requested changes (except two of them, see my comments).

// Stop stops the exec queue.
func (q *ExecQueue) Stop() {
atomic.StoreInt32(&q.stop, 1)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this as an unexported utility in package les instead. We don't want to add new features to package common anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

defer self.chainmu.Unlock()
defer func() {
self.chainmu.Unlock()
time.Sleep(time.Millisecond * 10) // ugly hack; do not hog chain lock in case syncing is CPU-limited by validation
Copy link
Contributor

@fjl fjl Mar 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed ugly and we should remove it. Can you explain why this helps with the lock?

}
go func() {
time.Sleep(wait)
d.loopChn <- struct{}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a better idea for this, will submit after the merge.

@fjl fjl merged commit 525116d into ethereum:master Mar 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants