Skip to content

Conversation

@philipch07
Copy link
Contributor

@philipch07 philipch07 commented Oct 4, 2025

Description

Adds RACK to SCTP (see the paper that this is based on here: https://icnp20.cs.ucr.edu/proceedings/nipaa/RACK%20for%20SCTP.pdf)

I would also like to add the variable burst mitigation as described in the above paper (section 5a), but that will be done in a separate PR as it doesn't require RACK to be implemented.

Reference issue

Closes #389

Upd: expanded version of the paper above: https://duepublico2.uni-due.de/servlets/MCRFileNodeServlet/duepublico_derivate_00073893/Diss_Weinrank.pdf#page=97

@codecov
Copy link

codecov bot commented Oct 4, 2025

Codecov Report

❌ Patch coverage is 88.15331% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.32%. Comparing base (012d2b4) to head (bb8671d).

Files with missing lines Patch % Lines
association.go 88.15% 17 Missing and 17 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #390      +/-   ##
==========================================
+ Coverage   81.85%   82.32%   +0.46%     
==========================================
  Files          51       51              
  Lines        3439     3717     +278     
==========================================
+ Hits         2815     3060     +245     
- Misses        482      498      +16     
- Partials      142      159      +17     
Flag Coverage Δ
go 82.32% <88.15%> (+0.46%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@philipch07
Copy link
Contributor Author

@enobufs Hi, I think you've previously tried to implement RACK for SCTP, would you happen to have any pointers for whether my current work looks like it's on the right track, or any notes/comments on things that you had in mind? If you also have a branch for me to use as reference if you've worked on it before then that would also be very appreciated!

@philipch07 philipch07 force-pushed the pch07/rack-sctp branch 2 times, most recently from eb3bd88 to 686e5ec Compare October 7, 2025 01:23
@philipch07 philipch07 marked this pull request as ready for review October 7, 2025 01:50
@philipch07
Copy link
Contributor Author

To test this I used this branch's SCTP version in the following examples:

Each of the examples (and the listed versions) worked as the behavior matched the expected behavior, so I think it's looking good so far :)

@Sean-Der
Copy link
Member

@philipch07 this is beyond amazing. You are doing something phenomenal and making it look easy :)

Anything I can do to help? In the back of your head think about writing a blogpost.

  • You get the credit you deserve
  • Its gets people excited/teaches the next generation about why this stuff is fun :)

@JoeTurki JoeTurki requested review from at-wat and sirzooro October 22, 2025 19:58
@philipch07 philipch07 force-pushed the pch07/rack-sctp branch 4 times, most recently from 6c3e8ba to 3914804 Compare October 22, 2025 21:10
@philipch07
Copy link
Contributor Author

Thank you @Sean-Der! I think the only thing I need right now regarding this PR is more reviews and testing (ideally in real applications with some latency). I'm excited but also cautious about this since it seems to be a pretty desired feature!

The next thing I have in mind at the moment for RACK would be #394 (adaptive burst mitigation) which only further improves RACK's performance. I hope that the two features would make for a compelling upgrade for users :)

I'm also brainstorming some ideas for the blog post as you've mentioned but it may be tied closer to open source contributing as a whole instead of just about RACK (especially since this type of feature is a bit less flashy). With some time and luck, I think I can create an interesting narrative about my motives, interests, and journey through the contributions that I've made so far!

@philipch07
Copy link
Contributor Author

Also I performed a short test with Linux's traffic control in my WSL setup to simulate 500ms latency for outbound traffic (so anything going from server -> client would be delayed) using the data-channels example from webrtc (latest version, so v4.1.6) and it didn't drop any messages in either direction, even when I sent a burst of messages to the server (all of which were instant), and with intermittent messages to the server.

@ValorZard
Copy link

@ValorZard
Copy link

also this is off topic, but @philipch07 are you in the discord server? I'd like to talk you more casually about SCTP/datachannel stuff.

@Sean-Der
Copy link
Member

Sean-Der commented Oct 23, 2025

this type of feature is a bit less flashy

This is the flashiest thing that could happen. If you can put up a graph that shows 'Under these network conditions file transfers happen X faster' and call out stuff that use DataChannels + Go (WebTorrent and Tor Snowflake) that is a huge thing.

I would love to introduce you to companies with a flashy intro of 'This is the person that made WebRTC Datachannels n% faster'. This and ICE Renomination that @kevmo314 working on I think have biggest change of just instantly making peoples WebRTC sessions better with no effort on their side.

@philipch07
Copy link
Contributor Author

@philipch07 Could you test this with https://github.com/pion/example-webrtc-applications/tree/master/ebiten-game

Just finished testing this and confirmed that this works! I didn't test it with added latency though.

This is the flashiest thing that could happen. If you can put up a graph that shows 'Under these network conditions file transfers happen X faster' and call out stuff that use DataChannels + Go (WebTorrent and Tor Snowflake) that is a huge thing.

That's a fantastic idea, that makes me think of creating some sort of visual simulation to display the impact of RACK vs non-RACK implementation in practice!

I would love to introduce you to companies with a flashy intro of 'This is the person that made WebRTC Datachannels n% faster'. This and ICE Renomination that @kevmo314 working on I think have biggest change of just instantly making peoples WebRTC sessions better with no effort on their side.

That makes sense to me, and I agree that "free" performance with 0 changes/settings toggled is always a win for a user :)

@philipch07
Copy link
Contributor Author

Whoops forgot to @Sean-Der in the above comment^

@kevmo314
Copy link

Chiming in to say that this is really cool! The paper is really interesting, so cool to see this in pion.

Copy link
Member

@JoeTurki JoeTurki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good, I reviewed it over days, there are no obvious issues in the code logic.
Really excited to see the benchmark.
Thank you a lot <3

@philipch07 philipch07 force-pushed the pch07/rack-sctp branch 4 times, most recently from 8fc1911 to d7f2b58 Compare October 23, 2025 22:53
Copy link
Member

@at-wat at-wat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to have better retransmission control!

I'm trying to understand the original TCP-RACK algorithm and SCTP-RACK paper, but not yet done.
Added a comment based on TCP-RACK's RFC.

srtt := a.rtoMgr.setNewRTT(rtt)
a.srtt.Store(srtt)

// RACK (NOT RFC 4960): track minRTT and latest delivered *original* send time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFC8985 (TCP RACK) says

The sender SHOULD track a windowed min-filtered estimate of recent RTT measurements that can adapt when migrating to significantly longer paths rather than tracking a simple global minimum of all RTT measurements.

However, the current implementation seems to track the global minimum RTT. Can it cause a problem when the network delay is dynamically change? (when network delay is increased after measuring minimum RTT)

https://datatracker.ietf.org/doc/html/rfc8985#section-6.2-1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this for a while (when I wasn't able to test it) and if my current understanding is correct then I believe that in the scenario you described:

  1. Start with a low network delay
  2. Send something (file, msg, etc.)
  3. Increase network delay
  4. Monitor output

Then it would lead to an over-aggressive rate of retransmission which could cause a high amount of traffic.

I just finished some really basic testing for this and ended up with some logs. You can find logging/printing approach in this sctp branch: https://github.com/pion/sctp/commits/pch07/test-rack-logs/ and my logs from the data-channel example in this webrtc branch: https://github.com/pion/webrtc/commits/pch07/test-sctp-rack/.

I think the current behavior is likely problematic. Thanks for catching it, and let me know if you find anything else! I'm going to try to come up with some way to properly calculate the minRtt with a windowed approach like RFC 8985 suggests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also thank you for reviewing!!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoeTurki helped to write a test to go from lowRTT -> highRTT (012d2b4) which worked perfectly with global min, but it was very broken when going from highRTT -> lowRTT with global min.

My commit 244b97f has fixed this now, as confirmed by @JoeTurki: #390 (comment).

Thank you for catching this! My approach was to use a monotonic deque to handle the window with a default timeout of 30s (configurable). I tried to find what the timeout was for other implementations but couldn't readily find one. If you or anyone has any ideas for what the default should be, please let me know!!

association.go Outdated
a.rackTimer.Stop()
}

a.rackTimer = time.AfterFunc(dur, func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm little curious how frequently the timers are created and how much it affects the performance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't get pprof working on my setup but from looking at the source with @JoeTurki, AfterFunc seems pretty bad and the timer approach can be implemented better. I will replace this with goroutines and channels. Thanks for pointing this out!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to comment that I fixed this in my latest commit: bb8671d. Thanks for finding this and let me know what you think!

@philipch07
Copy link
Contributor Author

I just found a slightly expanded version of the paper that we've all been looking at 🤦. Here is the slightly expanded version: https://duepublico2.uni-due.de/servlets/MCRFileNodeServlet/duepublico_derivate_00073893/Diss_Weinrank.pdf#page=97

From what I can tell, the core information is mostly the same but it has a much expanded section 5.

I just updated the PR description to include it as well.

@sirzooro
Copy link
Contributor

@philipch07 nice work! I was busy earlier this week and did not have time to look on it yet. I am reading paper now.

@JoeTurki
Copy link
Member

JoeTurki commented Oct 25, 2025

@philipch07 Hello I added a test for the condition you asked me to test in discord dms (High RTT -> low RTT), after you rebase you can un-comment this part

sctp/vnet_test.go

Lines 944 to 957 in 012d2b4

// we can uncomment this to check the FR stats.
// I tested it and it works fine on the pch07/rack-sctp branch.
// cs := <-clientStatsCh
// ss := <-serverStatsCh
//
// if assert.True(t, cs.ok, "client assoc/stats unavailable") {
// assert.LessOrEqual(t, cs.fr, uint64(2),
// "client fast retransmits should be low")
// }
// if assert.True(t, ss.ok, "server assoc/stats unavailable") {
// assert.LessOrEqual(t, ss.fr, uint64(2),
// "server fast retransmits should be low")
// }
}

Your Implementation handles it very well:

  1. No missing data.
  2. No excessive fast retransmits.

I'll also test for the other way around.

@JoeTurki
Copy link
Member

Going from low RTT to high RTT causes the packets to drop, but that's fixed after commit #244b97f

It needed to go to extreme values tho, and it was more of a manual test, I'll try to get it as a CI test, but It's very hard to make in a way that's not flaky and under ~20s

before:

--- FAIL: TestRACK_RTTSwitch_Regression (41.00s)
    vnet_test.go:887: 
                Error Trace:    /home/joe/work/pion/sctp/vnet_test.go:887
                Error:          regression
                Test:           TestRACK_RTTSwitch_Regression
                Messages:       missing echoes: got=61 want=200 missing=[61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199]
FAIL
FAIL    github.com/pion/sctp    41.006s
FAIL

After:

[joe@nixos:~/work/pion/sctp]$ go test -run ^TestRACK_RTTSwitch_Regression$ github.com/pion/sctp -count 1
ok  	github.com/pion/sctp	36.082s

Thank you.

@JoeTurki
Copy link
Member

Also there are many network conditions that makes the current sctp implementation drop packets but this implementation handles it so well. really amazing work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add RACK to SCTP

7 participants