-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-implement standard traits for socket using poll + recv #462
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks straightforward to me.
I think I am a bit lost with the new signature of add_waiter_single and what we're trying to accomplish on that loop.
It used to be a simple push + pop, so if there's more logic now, I think that warrants a comment
@@ -409,7 +409,7 @@ impl StreamWriter { | |||
.upgrade() | |||
.unwrap() | |||
.fdatasync(self.file.as_ref().unwrap().as_raw_fd()); | |||
source.add_waiter_single(cx.waker().clone()); | |||
source.add_waiter_single(cx.waker()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HippoBaro please review this, since we had issues before with add_waiter_single
vs multiple.
inner.wakers.waiters.pop(); | ||
inner.wakers.waiters.push(waker); | ||
let waiters = &mut inner.wakers.waiters; | ||
match waiters.first_mut() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the purpose behind this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can see, this is meant to add the Waker
if it's different than the one already in place.
This looks wrong because Waker::will_wake
makes zero guarantees. i.e., it could return true
when two wakers would wake different tasks.
Maybe the usage is safe, and this is sound, but a comment would be great so we can understand what you're trying to do. This code is susceptible, and we introduced some pretty nasty bugs here before 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I was trying to understand the difference between glommio and tokio's socket implementation, I found that tokio used the same trick to avoid cloning a waker when possible. So I modified add_waiter_single
in the same way to exclude noise in the benchmark. It turns out it made little difference whether we have this change or not. I keep it however, because it is logically very likely the existing waker and the one being added will wake the same task.
At some point I think it will be worth to understand why we're still behind in the 1-client case, since improving performance there is likely to lift all boats. But it is very encouraging that this work move us in the right direction for the multi-client case. How many CPUs did you have in this benchmark, btw ? |
In the 1-client case, code in between each task run is executed more frequently. The complexity of this part has a stronger influence to performance than in the multi-client case. In glommio, a preempt timer is armed each time before running a task queue, a user timer is armed if there's one and the latency ring is linked to the main ring before going to sleep. These are all done with io-uring. None of them could be submitted in batch, and io-uring enter will be triggered every time. All of these consume CPU cycles and postpone processing of incomming io events. In the poll + epoll implementation, there's one more thing, the epoll fd, needs to be linked to the main ring. Although saved the cost of submit sqes for polling, it did even worse in the 1-client case. In the reactor on epoll implementation, the rings are registered to epoll, which is waited on sleeping. The linking business are eliminated. Thus the results for 1-client case are improved. The percentage number could be improved up to ~93% if the preempt timer is removed, but in that way we will lose the ability to preempt. (Perhaps it could be optimized in the way by making |
There's 1 thread for the client and 1 for the server in the echo benchmark. There's 2 tokio threads shared by the client and the echo server, and 1 thread for the proxy server in the proxy benchmark. |
It's ultimately fine, because the 1-client cases are generally uninteresting and I'd gladly sacrifice performance there for other cases that make sense. But something doesn't add up. This is work we do on idle. Let me think about this a bit more, but this should not block you in any way |
Multishot poll seems like the solution to this but unfortunately it has to be optional until it becomes more mainstream.. I'm working on adding that but will wait for this to land to integrate it..
I have high hopes that we can create fast-poll based api's that are the fastest of them all -- but we will see if that can be achieved. In theory it seems like registered buffers fast-poll would be incredibly powerful for proxying if can be done right. I have a design document and I actually think (for the first time) it might be possible.. It's based on an email Jen's himself wrote about the workflow that might enable us to accomplish this without too much excessive traffic on the rings. Great work btw! I guess we shouldn't be surprised that traits that were designed around epoll would force a usage pattern that favors epoll-style designs.. I'll do a closer review later but so far looks good. |
@bryandmc do you still have any major concerns here? Any objections to merge? |
@thirstycrow if I understand correctly, the biggest performance is by just not using io_uring at all and going with epoll? |
Also #458 should be handled by now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glommer - looks good in my opinion. This doesn't add any epoll or anything like that just does the poll + nonblocking recv. So I say we ship it... its a strong basis for further work we've discussed in a few of these issues.
EDIT: techinically epoll works the best for these types of traits and I think that makes sense if you consider their design and how epoll works. They were designed to work with epoll specifically.
@glommer Yes, epoll matches better with the existing traits. But the combination of |
I am okay with merging this, but the branch is out-of-sync with master. @thirstycrow can you rebase and make sure each commit passes fmt/clippy/tests individually? I would prefer to rebase this instead of squashing since this is not a trivial change :) |
24b97aa
to
b9a9865
Compare
Greetings @thirstycrow! It looks like your PR added a new or changed an existing dependency, and CI has failed to validate your changes.
Thank you! |
835d781
to
61576c3
Compare
@HippoBaro It is rebased now. The commits except first one are used to fix fmt/clippy issues, so I simply squashed them. @glommer Rushed dispatches are commented out in the last commit. I tested this version in a long running benchmark of our application, and no deadlock happened. Please let me know if you want it rolled-back. |
I feel it annoying sometimes when |
Agree about the license. I asked @HippoBaro to review that. I am afraid lack of rushed dispatch may hurt throughput oriented benchmarks with a single socket, but for now let's lean on the data that we do have, which indicates that their removal helps. |
This is a continuation of #451. In this PR, the standard traits for
GlommioStream
is re-implemented using poll + recv. Performance of theecho
andproxy
benchmarks improved compared to current master branch, but the result could be even better if invocations torush_dispatch
could be removed frompoll_recv
. However, to do that, #458 needs to be addressed first, otherwise there is good probability for the proxy benchmark to deadlock in the 1000-client cases.I also tested two epoll based implementations, which gives the best performance among all. The efficiency of them comes from edge-triggered epoll: no additional overhead will be introduced after when the socket fd is registered to the epoll. While in the poll + recv implementation, an additional SQE is needed each time we poll the socket fd, because poll on io-uring works in oneshot mode.
There may be a chance for an fast-poll based implementation to surpass the epoll ones in some scenarios. But I believe it could only happen after #458 is handled appropriately so we can avoid
rush_dispatch
in socket io.