Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement standard traits for socket using poll + recv #462

Merged
merged 3 commits into from
Nov 23, 2021

Conversation

thirstycrow
Copy link
Contributor

This is a continuation of #451. In this PR, the standard traits for GlommioStream is re-implemented using poll + recv. Performance of the echo and proxy benchmarks improved compared to current master branch, but the result could be even better if invocations to rush_dispatch could be removed from poll_recv. However, to do that, #458 needs to be addressed first, otherwise there is good probability for the proxy benchmark to deadlock in the 1000-client cases.

I also tested two epoll based implementations, which gives the best performance among all. The efficiency of them comes from edge-triggered epoll: no additional overhead will be introduced after when the socket fd is registered to the epoll. While in the poll + recv implementation, an additional SQE is needed each time we poll the socket fd, because poll on io-uring works in oneshot mode.

There may be a chance for an fast-poll based implementation to surpass the epoll ones in some scenarios. But I believe it could only happen after #458 is handled appropriately so we can avoid rush_dispatch in socket io.

Screenshot from 2021-11-15 11-13-07

Copy link
Collaborator

@glommer glommer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks straightforward to me.

I think I am a bit lost with the new signature of add_waiter_single and what we're trying to accomplish on that loop.

It used to be a simple push + pop, so if there's more logic now, I think that warrants a comment

@@ -409,7 +409,7 @@ impl StreamWriter {
.upgrade()
.unwrap()
.fdatasync(self.file.as_ref().unwrap().as_raw_fd());
source.add_waiter_single(cx.waker().clone());
source.add_waiter_single(cx.waker());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HippoBaro please review this, since we had issues before with add_waiter_single vs multiple.

inner.wakers.waiters.pop();
inner.wakers.waiters.push(waker);
let waiters = &mut inner.wakers.waiters;
match waiters.first_mut() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the purpose behind this ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see, this is meant to add the Waker if it's different than the one already in place.
This looks wrong because Waker::will_wake makes zero guarantees. i.e., it could return true when two wakers would wake different tasks.

Maybe the usage is safe, and this is sound, but a comment would be great so we can understand what you're trying to do. This code is susceptible, and we introduced some pretty nasty bugs here before 😬

Copy link
Contributor Author

@thirstycrow thirstycrow Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I was trying to understand the difference between glommio and tokio's socket implementation, I found that tokio used the same trick to avoid cloning a waker when possible. So I modified add_waiter_single in the same way to exclude noise in the benchmark. It turns out it made little difference whether we have this change or not. I keep it however, because it is logically very likely the existing waker and the one being added will wake the same task.

@glommer
Copy link
Collaborator

glommer commented Nov 15, 2021

At some point I think it will be worth to understand why we're still behind in the 1-client case, since improving performance there is likely to lift all boats.

But it is very encouraging that this work move us in the right direction for the multi-client case.

How many CPUs did you have in this benchmark, btw ?

@thirstycrow
Copy link
Contributor Author

thirstycrow commented Nov 16, 2021

In the 1-client case, code in between each task run is executed more frequently. The complexity of this part has a stronger influence to performance than in the multi-client case. In glommio, a preempt timer is armed each time before running a task queue, a user timer is armed if there's one and the latency ring is linked to the main ring before going to sleep. These are all done with io-uring. None of them could be submitted in batch, and io-uring enter will be triggered every time. All of these consume CPU cycles and postpone processing of incomming io events.

In the poll + epoll implementation, there's one more thing, the epoll fd, needs to be linked to the main ring. Although saved the cost of submit sqes for polling, it did even worse in the 1-client case.

In the reactor on epoll implementation, the rings are registered to epoll, which is waited on sleeping. The linking business are eliminated. Thus the results for 1-client case are improved. The percentage number could be improved up to ~93% if the preempt timer is removed, but in that way we will lose the ability to preempt. (Perhaps it could be optimized in the way by making preempt_timer_duration optional, and setting it to None for the default task queue)

@thirstycrow
Copy link
Contributor Author

There's 1 thread for the client and 1 for the server in the echo benchmark. There's 2 tokio threads shared by the client and the echo server, and 1 thread for the proxy server in the proxy benchmark.

@glommer
Copy link
Collaborator

glommer commented Nov 16, 2021

It's ultimately fine, because the 1-client cases are generally uninteresting and I'd gladly sacrifice performance there for other cases that make sense.

But something doesn't add up. This is work we do on idle. Let me think about this a bit more, but this should not block you in any way

@bryandmc
Copy link
Collaborator

I also tested two epoll based implementations, which gives the best performance among all. The efficiency of them comes from edge-triggered epoll: no additional overhead will be introduced after when the socket fd is registered to the epoll. While in the poll + recv implementation, an additional SQE is needed each time we poll the socket fd, because poll on io-uring works in oneshot mode.

Multishot poll seems like the solution to this but unfortunately it has to be optional until it becomes more mainstream.. I'm working on adding that but will wait for this to land to integrate it..

There may be a chance for an fast-poll based implementation to surpass the epoll ones in some scenarios. But I believe it could only happen after #458 is handled appropriately so we can avoid rush_dispatch in socket io.

I have high hopes that we can create fast-poll based api's that are the fastest of them all -- but we will see if that can be achieved. In theory it seems like registered buffers fast-poll would be incredibly powerful for proxying if can be done right. I have a design document and I actually think (for the first time) it might be possible.. It's based on an email Jen's himself wrote about the workflow that might enable us to accomplish this without too much excessive traffic on the rings.

Great work btw! I guess we shouldn't be surprised that traits that were designed around epoll would force a usage pattern that favors epoll-style designs.. I'll do a closer review later but so far looks good.

glommio/benches/echo.rs Outdated Show resolved Hide resolved
glommio/benches/echo.rs Outdated Show resolved Hide resolved
@glommer
Copy link
Collaborator

glommer commented Nov 20, 2021

@bryandmc do you still have any major concerns here?
I like the direction and would like to see some progress here.

Any objections to merge?

@glommer
Copy link
Collaborator

glommer commented Nov 20, 2021

@thirstycrow if I understand correctly, the biggest performance is by just not using io_uring at all and going with epoll?
That's not what's going on in this patch, is it ?

@glommer
Copy link
Collaborator

glommer commented Nov 20, 2021

Also #458 should be handled by now.

Copy link
Collaborator

@bryandmc bryandmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glommer - looks good in my opinion. This doesn't add any epoll or anything like that just does the poll + nonblocking recv. So I say we ship it... its a strong basis for further work we've discussed in a few of these issues.

EDIT: techinically epoll works the best for these types of traits and I think that makes sense if you consider their design and how epoll works. They were designed to work with epoll specifically.

@thirstycrow
Copy link
Contributor Author

@glommer Yes, epoll matches better with the existing traits. But the combination of fast-poll + multishot poll looks more promising. Once that is realized, I think it's fine to have a slightly less efficient alternative with the old traits. So, I'll keep the epoll based implementations just for comparison of benchmark results, and not going to submit them for the moment.

@HippoBaro
Copy link
Member

I am okay with merging this, but the branch is out-of-sync with master. @thirstycrow can you rebase and make sure each commit passes fmt/clippy/tests individually? I would prefer to rebase this instead of squashing since this is not a trivial change :)

@github-actions
Copy link

Greetings @thirstycrow!

It looks like your PR added a new or changed an existing dependency, and CI has failed to validate your changes.
Some possible reasons this could happen:

  • One of the dependencies you added uses a restricted license. See deny.toml for a list of licenses we allow;
  • You added or updated a dependency and didn't update the LICENSE-3rdparty.csv file. To do so, run the following and commit the changes:
$ rm -f cargo.lock
$ cargo install cargo-license
$ cargo license --all-features -a -j --no-deps -d | jq -r '(["Component","Origin","License","Copyright"]) as $cols | map(. as $row | ["name", "repository", "license", "authors"] | map($row[.])) as $rows | $cols, $rows[] | @csv' > LICENSE-3rdparty.csv

Thank you!

@thirstycrow thirstycrow changed the title WIP: re-implement standard traits for socket using poll + recv Re-implement standard traits for socket using poll + recv Nov 23, 2021
@thirstycrow
Copy link
Contributor Author

@HippoBaro It is rebased now. The commits except first one are used to fix fmt/clippy issues, so I simply squashed them.

@glommer Rushed dispatches are commented out in the last commit. I tested this version in a long running benchmark of our application, and no deadlock happened. Please let me know if you want it rolled-back.

@thirstycrow
Copy link
Contributor Author

thirstycrow commented Nov 23, 2021

I feel it annoying sometimes when LICENSE-3rdparty.csv is outdated by some updated dependency crates. Maybe it would be better if this file is maintained as a build artifact instead of part of the source code.

@bryandmc bryandmc merged commit 019371b into DataDog:master Nov 23, 2021
@glommer
Copy link
Collaborator

glommer commented Nov 23, 2021

Agree about the license. I asked @HippoBaro to review that.

I am afraid lack of rushed dispatch may hurt throughput oriented benchmarks with a single socket, but for now let's lean on the data that we do have, which indicates that their removal helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants