Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace tokio-uring with io-uring #993

Merged
merged 14 commits into from
Aug 15, 2024
Merged

Replace tokio-uring with io-uring #993

merged 14 commits into from
Aug 15, 2024

Conversation

Jake-Shadle
Copy link
Collaborator

This is a fairly major change to swap out tokio-uring with the lower level io-uring, which has some upsides and downsides.

Upsides

In tokio-uring, every udp recv_from and send_to performs 3 heap allocations (maybe even more in other parts of the code?) which is extremely wasteful in the context of a proxy that can be sending and receiving many thousands of packets a second. Moving to io-uring means we need to take responsibility for the lifetimes of memory being written/read by the kernel during I/O, but means we can minimize/get rid of memory allocations since we have the full context. For example, the QCMP loop now doesn't use the heap at all in favor of just reusing stack allocations.

Additionally, the current code which forwards packets either downstream or upstream only ever sends 1 packet at a time per worker/session, the new code takes advantage of not being async/await by just sending up to a few thousand packets concurrently, reducing a (probably minor) throughput bottleneck.

Downsides

A lot more code, some of which is unsafe, though slightly less than it could have been, as now the session and packet_router both share the same implementation. The non-linux code is also now separated since they are no longer really compatible since the io uring loop is not async so we can't pretend the code is the same between linux and non-linux, which also contributes to code increase.

Overall, it's just simply more complicated relative to the old code, but does give us tighter control.

@Jake-Shadle Jake-Shadle force-pushed the io-uring branch 8 times, most recently from ab6b74a to ce225a3 Compare August 8, 2024 12:16
@Jake-Shadle Jake-Shadle enabled auto-merge (squash) August 8, 2024 12:50
@XAMPPRocky
Copy link
Collaborator

Can you run and include the results some benchmarks compared to main?

src/codec/qcmp.rs Outdated Show resolved Hide resolved
src/codec/qcmp.rs Outdated Show resolved Hide resolved
src/codec/qcmp.rs Outdated Show resolved Hide resolved
src/codec/qcmp.rs Outdated Show resolved Hide resolved
@quilkin-bot
Copy link
Collaborator

Build Succeeded 🥳

Build Id: e2e97722-c167-429b-82e2-4c22c253940e

The following development images have been built, and will exist for the next 30 days:

To build this version:

git fetch git@github.com:googleforgames/quilkin.git pull/993/head:pr_993 && git checkout pr_993
cargo build

@XAMPPRocky
Copy link
Collaborator

I've reviewed the code and LGTM, once we have some benchmark results so we know that it is and how much of a performance improvement it is, I'll approve

@Jake-Shadle
Copy link
Collaborator Author

The read-write benchmarks are non-functional even on main, so I need to figure out why they are broken and fix them first I guess.

@Jake-Shadle
Copy link
Collaborator Author

main:

Aggregated Function Time : count 100000 avg 0.0062609285 +/- 0.01791 min 0.002268194 max 0.753708539 sum 626.092851
# range, mid point, percentile, count
>= 0.00226819 <= 0.003 , 0.0026341 , 0.52, 519
> 0.003 <= 0.004 , 0.0035 , 16.39, 15875
> 0.004 <= 0.005 , 0.0045 , 62.38, 45983
> 0.005 <= 0.006 , 0.0055 , 70.01, 7631
> 0.006 <= 0.007 , 0.0065 , 80.66, 10655
> 0.007 <= 0.008 , 0.0075 , 87.31, 6644
> 0.008 <= 0.009 , 0.0085 , 90.74, 3432
> 0.009 <= 0.01 , 0.0095 , 93.59, 2852
> 0.01 <= 0.011 , 0.0105 , 95.11, 1523
> 0.011 <= 0.012 , 0.0115 , 96.19, 1079
> 0.012 <= 0.014 , 0.013 , 97.25, 1057
> 0.014 <= 0.016 , 0.015 , 97.89, 642
> 0.016 <= 0.018 , 0.017 , 98.31, 414
> 0.018 <= 0.02 , 0.019 , 98.59, 280
> 0.02 <= 0.025 , 0.0225 , 99.00, 410
> 0.025 <= 0.03 , 0.0275 , 99.32, 324
> 0.03 <= 0.035 , 0.0325 , 99.53, 207
> 0.035 <= 0.04 , 0.0375 , 99.69, 165
> 0.04 <= 0.045 , 0.0425 , 99.80, 111
> 0.045 <= 0.05 , 0.0475 , 99.90, 94
> 0.05 <= 0.06 , 0.055 , 99.93, 36
> 0.06 <= 0.07 , 0.065 , 99.94, 8
> 0.07 <= 0.08 , 0.075 , 99.94, 4
> 0.7 <= 0.753709 , 0.726854 , 100.00, 55
# target 50% 0.00473084
# target 75% 0.00646851
# target 90% 0.00878467
# target 99% 0.0250617
# target 99.9% 0.0508333
Error cases : count 55 avg 0.75112423 +/- 0.0009158 min 0.750025763 max 0.753708539 sum 41.3118325
# range, mid point, percentile, count
>= 0.750026 <= 0.753709 , 0.751867 , 100.00, 55
# target 50% 0.751833
# target 75% 0.752771
# target 90% 0.753333
# target 99% 0.753671
# target 99.9% 0.753705
Sockets used: 59 (for perfect no error run, would be 4)
Total Bytes sent: 2400000, received: 2398680
udp OK : 99945 (99.9 %)
udp timeout : 55 (0.1 %)
All done 100000 calls (plus 0 warmup) 6.261 ms avg, 635.6 qps

pr:

Aggregated Function Time : count 100000 avg 0.0062134501 +/- 0.01246 min 0.002205052 max 0.752862069 sum 621.345007
# range, mid point, percentile, count
>= 0.00220505 <= 0.003 , 0.00260253 , 0.34, 342
> 0.003 <= 0.004 , 0.0035 , 15.26, 14916
> 0.004 <= 0.005 , 0.0045 , 61.80, 46545
> 0.005 <= 0.006 , 0.0055 , 70.47, 8672
> 0.006 <= 0.007 , 0.0065 , 78.83, 8352
> 0.007 <= 0.008 , 0.0075 , 84.10, 5272
> 0.008 <= 0.009 , 0.0085 , 87.35, 3250
> 0.009 <= 0.01 , 0.0095 , 91.30, 3955
> 0.01 <= 0.011 , 0.0105 , 93.55, 2244
> 0.011 <= 0.012 , 0.0115 , 94.95, 1403
> 0.012 <= 0.014 , 0.013 , 96.67, 1718
> 0.014 <= 0.016 , 0.015 , 97.62, 955
> 0.016 <= 0.018 , 0.017 , 98.16, 539
> 0.018 <= 0.02 , 0.019 , 98.51, 344
> 0.02 <= 0.025 , 0.0225 , 99.07, 560
> 0.025 <= 0.03 , 0.0275 , 99.40, 333
> 0.03 <= 0.035 , 0.0325 , 99.61, 206
> 0.035 <= 0.04 , 0.0375 , 99.73, 122
> 0.04 <= 0.045 , 0.0425 , 99.86, 133
> 0.045 <= 0.05 , 0.0475 , 99.92, 58
> 0.05 <= 0.06 , 0.055 , 99.96, 41
> 0.06 <= 0.07 , 0.065 , 99.97, 9
> 0.07 <= 0.08 , 0.075 , 99.97, 5
> 0.08 <= 0.09 , 0.085 , 99.97, 1
> 0.7 <= 0.752862 , 0.726431 , 100.00, 25
# target 50% 0.00474642
# target 75% 0.00654179
# target 90% 0.00967029
# target 99% 0.0244018
# target 99.9% 0.0483621
Error cases : count 25 avg 0.75080686 +/- 0.0007781 min 0.750041919 max 0.752862069 sum 18.7701715
# range, mid point, percentile, count
>= 0.750042 <= 0.752862 , 0.751452 , 100.00, 25
# target 50% 0.751393
# target 75% 0.752128
# target 90% 0.752568
# target 99% 0.752833
# target 99.9% 0.752859
Sockets used: 29 (for perfect no error run, would be 4)
Total Bytes sent: 2400000, received: 2399400
udp OK : 99975 (100.0 %)
udp timeout : 25 (0.0 %)
All done 100000 calls (plus 0 warmup) 6.213 ms avg, 638.9 qps

This basically lines up with what I expected, they are very close to each other in the simplest case of 1<->1, but I would expect the difference to grow a bit with more clients and servers. It's at least not worse.

@Jake-Shadle Jake-Shadle merged commit 8d44088 into main Aug 15, 2024
12 checks passed
@markmandel markmandel added kind/feature New feature or request area/performance Anything to do with Quilkin being slow, or making it go faster. and removed kind/other labels Aug 16, 2024
@Jake-Shadle Jake-Shadle deleted the io-uring branch August 16, 2024 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Anything to do with Quilkin being slow, or making it go faster. kind/feature New feature or request size/xl
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants