Re-implement standard traits for socket using poll + recv #462

thirstycrow · 2021-11-15T03:45:28Z

This is a continuation of #451. In this PR, the standard traits for GlommioStream is re-implemented using poll + recv. Performance of the echo and proxy benchmarks improved compared to current master branch, but the result could be even better if invocations to rush_dispatch could be removed from poll_recv. However, to do that, #458 needs to be addressed first, otherwise there is good probability for the proxy benchmark to deadlock in the 1000-client cases.

I also tested two epoll based implementations, which gives the best performance among all. The efficiency of them comes from edge-triggered epoll: no additional overhead will be introduced after when the socket fd is registered to the epoll. While in the poll + recv implementation, an additional SQE is needed each time we poll the socket fd, because poll on io-uring works in oneshot mode.

There may be a chance for an fast-poll based implementation to surpass the epoll ones in some scenarios. But I believe it could only happen after #458 is handled appropriately so we can avoid rush_dispatch in socket io.

glommer

Looks straightforward to me.

I think I am a bit lost with the new signature of add_waiter_single and what we're trying to accomplish on that loop.

It used to be a simple push + pop, so if there's more logic now, I think that warrants a comment

glommer · 2021-11-15T19:38:39Z

glommio/src/io/buffered_file_stream.rs

@@ -409,7 +409,7 @@ impl StreamWriter {
                        .upgrade()
                        .unwrap()
                        .fdatasync(self.file.as_ref().unwrap().as_raw_fd());
-                    source.add_waiter_single(cx.waker().clone());
+                    source.add_waiter_single(cx.waker());


@HippoBaro please review this, since we had issues before with add_waiter_single vs multiple.

glommer · 2021-11-15T19:42:56Z

glommio/src/sys/source.rs

-        inner.wakers.waiters.pop();
-        inner.wakers.waiters.push(waker);
+        let waiters = &mut inner.wakers.waiters;
+        match waiters.first_mut() {


Can you explain the purpose behind this ?

From what I can see, this is meant to add the Waker if it's different than the one already in place.
This looks wrong because Waker::will_wake makes zero guarantees. i.e., it could return true when two wakers would wake different tasks.

Maybe the usage is safe, and this is sound, but a comment would be great so we can understand what you're trying to do. This code is susceptible, and we introduced some pretty nasty bugs here before 😬

While I was trying to understand the difference between glommio and tokio's socket implementation, I found that tokio used the same trick to avoid cloning a waker when possible. So I modified add_waiter_single in the same way to exclude noise in the benchmark. It turns out it made little difference whether we have this change or not. I keep it however, because it is logically very likely the existing waker and the one being added will wake the same task.

glommer · 2021-11-15T19:45:28Z

At some point I think it will be worth to understand why we're still behind in the 1-client case, since improving performance there is likely to lift all boats.

But it is very encouraging that this work move us in the right direction for the multi-client case.

How many CPUs did you have in this benchmark, btw ?

thirstycrow · 2021-11-16T04:22:51Z

In the 1-client case, code in between each task run is executed more frequently. The complexity of this part has a stronger influence to performance than in the multi-client case. In glommio, a preempt timer is armed each time before running a task queue, a user timer is armed if there's one and the latency ring is linked to the main ring before going to sleep. These are all done with io-uring. None of them could be submitted in batch, and io-uring enter will be triggered every time. All of these consume CPU cycles and postpone processing of incomming io events.

In the poll + epoll implementation, there's one more thing, the epoll fd, needs to be linked to the main ring. Although saved the cost of submit sqes for polling, it did even worse in the 1-client case.

In the reactor on epoll implementation, the rings are registered to epoll, which is waited on sleeping. The linking business are eliminated. Thus the results for 1-client case are improved. The percentage number could be improved up to ~93% if the preempt timer is removed, but in that way we will lose the ability to preempt. (Perhaps it could be optimized in the way by making preempt_timer_duration optional, and setting it to None for the default task queue)

thirstycrow · 2021-11-16T04:48:04Z

There's 1 thread for the client and 1 for the server in the echo benchmark. There's 2 tokio threads shared by the client and the echo server, and 1 thread for the proxy server in the proxy benchmark.

glommer · 2021-11-16T22:21:55Z

It's ultimately fine, because the 1-client cases are generally uninteresting and I'd gladly sacrifice performance there for other cases that make sense.

But something doesn't add up. This is work we do on idle. Let me think about this a bit more, but this should not block you in any way

bryandmc · 2021-11-16T23:11:35Z

I also tested two epoll based implementations, which gives the best performance among all. The efficiency of them comes from edge-triggered epoll: no additional overhead will be introduced after when the socket fd is registered to the epoll. While in the poll + recv implementation, an additional SQE is needed each time we poll the socket fd, because poll on io-uring works in oneshot mode.

Multishot poll seems like the solution to this but unfortunately it has to be optional until it becomes more mainstream.. I'm working on adding that but will wait for this to land to integrate it..

There may be a chance for an fast-poll based implementation to surpass the epoll ones in some scenarios. But I believe it could only happen after #458 is handled appropriately so we can avoid rush_dispatch in socket io.

I have high hopes that we can create fast-poll based api's that are the fastest of them all -- but we will see if that can be achieved. In theory it seems like registered buffers fast-poll would be incredibly powerful for proxying if can be done right. I have a design document and I actually think (for the first time) it might be possible.. It's based on an email Jen's himself wrote about the workflow that might enable us to accomplish this without too much excessive traffic on the rings.

Great work btw! I guess we shouldn't be surprised that traits that were designed around epoll would force a usage pattern that favors epoll-style designs.. I'll do a closer review later but so far looks good.

glommio/benches/echo.rs

glommer · 2021-11-20T10:16:41Z

@bryandmc do you still have any major concerns here?
I like the direction and would like to see some progress here.

Any objections to merge?

glommer · 2021-11-20T10:18:19Z

@thirstycrow if I understand correctly, the biggest performance is by just not using io_uring at all and going with epoll?
That's not what's going on in this patch, is it ?

glommer · 2021-11-20T10:18:59Z

Also #458 should be handled by now.

bryandmc

@glommer - looks good in my opinion. This doesn't add any epoll or anything like that just does the poll + nonblocking recv. So I say we ship it... its a strong basis for further work we've discussed in a few of these issues.

EDIT: techinically epoll works the best for these types of traits and I think that makes sense if you consider their design and how epoll works. They were designed to work with epoll specifically.

thirstycrow · 2021-11-22T04:39:38Z

@glommer Yes, epoll matches better with the existing traits. But the combination of fast-poll + multishot poll looks more promising. Once that is realized, I think it's fine to have a slightly less efficient alternative with the old traits. So, I'll keep the epoll based implementations just for comparison of benchmark results, and not going to submit them for the moment.

HippoBaro · 2021-11-22T21:14:36Z

I am okay with merging this, but the branch is out-of-sync with master. @thirstycrow can you rebase and make sure each commit passes fmt/clippy/tests individually? I would prefer to rebase this instead of squashing since this is not a trivial change :)

github-actions · 2021-11-23T04:37:36Z

Greetings @thirstycrow!

It looks like your PR added a new or changed an existing dependency, and CI has failed to validate your changes.
Some possible reasons this could happen:

One of the dependencies you added uses a restricted license. See deny.toml for a list of licenses we allow;
You added or updated a dependency and didn't update the LICENSE-3rdparty.csv file. To do so, run the following and commit the changes:

$ rm -f cargo.lock
$ cargo install cargo-license
$ cargo license --all-features -a -j --no-deps -d | jq -r '(["Component","Origin","License","Copyright"]) as $cols | map(. as $row | ["name", "repository", "license", "authors"] | map($row[.])) as $rows | $cols, $rows[] | @csv' > LICENSE-3rdparty.csv

Thank you!

thirstycrow · 2021-11-23T05:16:14Z

@HippoBaro It is rebased now. The commits except first one are used to fix fmt/clippy issues, so I simply squashed them.

@glommer Rushed dispatches are commented out in the last commit. I tested this version in a long running benchmark of our application, and no deadlock happened. Please let me know if you want it rolled-back.

thirstycrow · 2021-11-23T05:34:20Z

I feel it annoying sometimes when LICENSE-3rdparty.csv is outdated by some updated dependency crates. Maybe it would be better if this file is maintained as a build artifact instead of part of the source code.

glommer · 2021-11-23T19:38:01Z

Agree about the license. I asked @HippoBaro to review that.

I am afraid lack of rushed dispatch may hurt throughput oriented benchmarks with a single socket, but for now let's lean on the data that we do have, which indicates that their removal helps.

thirstycrow mentioned this pull request Nov 15, 2021

Eliminate memory allocation and memory copy for network IO #451

Closed

glommer reviewed Nov 15, 2021

View reviewed changes

bryandmc reviewed Nov 17, 2021

View reviewed changes

glommio/benches/echo.rs Outdated Show resolved Hide resolved

glommio/benches/echo.rs Outdated Show resolved Hide resolved

bryandmc approved these changes Nov 20, 2021

View reviewed changes

thirstycrow force-pushed the tcp_poll_recv branch from 24b97aa to b9a9865 Compare November 23, 2021 04:35

jianghua added 2 commits November 23, 2021 12:59

poll + recv for socket io

76b2de4

update LICENSE-3rdparty.csv

61576c3

thirstycrow force-pushed the tcp_poll_recv branch from 835d781 to 61576c3 Compare November 23, 2021 05:00

comment out rush dispatches

a6a203c

thirstycrow changed the title ~~WIP: re-implement standard traits for socket using poll + recv~~ Re-implement standard traits for socket using poll + recv Nov 23, 2021

bryandmc merged commit 019371b into DataDog:master Nov 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-implement standard traits for socket using poll + recv #462

Re-implement standard traits for socket using poll + recv #462

thirstycrow commented Nov 15, 2021

glommer left a comment

glommer Nov 15, 2021

glommer Nov 15, 2021

HippoBaro Nov 15, 2021

thirstycrow Nov 16, 2021 •

edited

Loading

glommer commented Nov 15, 2021

thirstycrow commented Nov 16, 2021 •

edited

Loading

thirstycrow commented Nov 16, 2021

glommer commented Nov 16, 2021

bryandmc commented Nov 16, 2021

glommer commented Nov 20, 2021

glommer commented Nov 20, 2021

glommer commented Nov 20, 2021

bryandmc left a comment •

edited

Loading

thirstycrow commented Nov 22, 2021

HippoBaro commented Nov 22, 2021

github-actions bot commented Nov 23, 2021

thirstycrow commented Nov 23, 2021

thirstycrow commented Nov 23, 2021 •

edited

Loading

glommer commented Nov 23, 2021

Re-implement standard traits for socket using poll + recv #462

Re-implement standard traits for socket using poll + recv #462

Conversation

thirstycrow commented Nov 15, 2021

glommer left a comment

Choose a reason for hiding this comment

glommer Nov 15, 2021

Choose a reason for hiding this comment

glommer Nov 15, 2021

Choose a reason for hiding this comment

HippoBaro Nov 15, 2021

Choose a reason for hiding this comment

thirstycrow Nov 16, 2021 • edited Loading

Choose a reason for hiding this comment

glommer commented Nov 15, 2021

thirstycrow commented Nov 16, 2021 • edited Loading

thirstycrow commented Nov 16, 2021

glommer commented Nov 16, 2021

bryandmc commented Nov 16, 2021

glommer commented Nov 20, 2021

glommer commented Nov 20, 2021

glommer commented Nov 20, 2021

bryandmc left a comment • edited Loading

Choose a reason for hiding this comment

thirstycrow commented Nov 22, 2021

HippoBaro commented Nov 22, 2021

github-actions bot commented Nov 23, 2021

thirstycrow commented Nov 23, 2021

thirstycrow commented Nov 23, 2021 • edited Loading

glommer commented Nov 23, 2021

thirstycrow Nov 16, 2021 •

edited

Loading

thirstycrow commented Nov 16, 2021 •

edited

Loading

bryandmc left a comment •

edited

Loading

thirstycrow commented Nov 23, 2021 •

edited

Loading