Add rate limiting queue for packet-in #2015

GraysonWu · 2021-03-31T00:54:12Z

Before we add rate limiting mechanism in OVS, we use RateLimitingQueue for packet-in event in Antrea agent.
I have tested it using hping3 to simulate packet flood and agent can handle packet-in under the rate limit.

antoninbas

@GraysonWu please update the rate as suggested and provide CPU usage data with & without your change, otherwise it's hard to evaluate the value of the PR and whether it does address the issue. Also provide the parameters you use to run hping3 and generate traffic.

antoninbas · 2021-03-31T02:50:26Z

pkg/agent/openflow/packetin.go

+	if reason == uint8(PacketInReasonTF) {
+		featurePacketIn.packetInQueue = workqueue.NewNamedRateLimitingQueue(&workqueue.BucketRateLimiter{Limiter: rate.NewLimiter(rate.Inf, 10)}, string(reason))
+	} else {
+		featurePacketIn.packetInQueue = workqueue.NewNamedRateLimitingQueue(&workqueue.BucketRateLimiter{Limiter: rate.NewLimiter(rate.Limit(5), 10)}, string(reason))
+	}


I think a rate limit of 5 is a bit too conservative, how about 100? With a burst size of 200?
I feel like we can also use this for Traceflow

antoninbas · 2021-03-31T02:53:54Z

pkg/agent/openflow/packetin.go

@@ -108,7 +114,7 @@ func (c *client) subscribeFeaturePacketIn(featurePacketIn *featureStartPacketIn)
 	return nil
 }

-func (c *client) parsePacketIn(packetInQueue workqueue.Interface, packetHandlerReason uint8) {
+func (c *client) parsePacketIn(packetInQueue workqueue.RateLimitingInterface, packetHandlerReason uint8) {


can we call packetInQueue.Forget(obj) after packetInQueue.Done(obj) even though it is a no-op (and add a comment to mention it is a no-op for the BucketRateLimiter implementation

tnqn · 2021-03-31T18:08:46Z

pkg/agent/openflow/packetin.go

@@ -71,7 +77,7 @@ func (f *featureStartPacketIn) ListenPacketIn() {
 		case pktIn := <-f.subscribeCh:
 			// Ensure that the queue doesn't grow too big. This is NOT to provide an exact guarantee.
 			if f.packetInQueue.Len() < packetInQueueSize {
-				f.packetInQueue.Add(pktIn)
+				f.packetInQueue.AddRateLimited(pktIn)


As far as I can tell from code, AddRateLimited adds a waitFor item to a channel of 1000 buckets if there is no available rate limter bucket, so it will be blocking when the channel is full. And it doesn't drop any items even it exceeds its rate limit.
Maybe the length check against the queue can avoid appending more items when it exceeds the length of the queue. However the length of the ratelimitingQueue is the items that are ready to be processed (their ratelimiting waiting time have expired), not including the items in the waitingForAddCh channel. If the consumers process the ready items very quickly, it could happen that there are 1000 waiting items in the waitingForAddCh and 0 ready items in the queue, and a waiting item is poped from the channel and pushed to the queue every 10ms (if the rate limit is 100), and each incoming packet-in event will block here for 10ms when calling AddRateLimited and they will not be handled properly as it will only be handled after 10s (1000*10ms) when Traceflow request or connection already expire.

We need to drop the packets here. Otherwise the backpressure will propagate to ofnet, and we will essentially block the loop that receives all messages from the switch (including bundle acknowledgements?): https://github.com/wenyingd/ofnet/blob/171b6795a2da8d488d38ae318ca3ce043481fc59/ofctrl/ofSwitch.go#L306

I am not a big fan of the current architecture: unless I am missing something, I think we have one channel / queue too many (subscribeCh and packetInQueue) and that everything could be done with a single channel.

My preference would be to change and simplify the code as follows:

remove packetInQueue altogether

make subscribeCh a buffered channel

change PacketRcvd (ofctfl_bridge.go) to drop the packet if the subscribe channel is full (avoid backpressure to ofnet)

when calling SubscribePacketIn, provide a rate limiter to control how fast packets can be dequeued by a consumer (this ensures low CPU usage of consumer)

maybe we can define a simple queue like this one:

type PacketInQueue struct { rateLimiter *rate.Limiter packetsCh chan *ofctrl.PacketIn } func NewPacketInQueue(size int, r rate.Limit) *PacketInQueue { return &PacketInQueue{rateLimiter: rate.NewLimiter(r, 1), packetsCh: make(chan *ofctrl.PacketIn, size)} } func (q *PacketInQueue) AddOrDrop(packet *ofctrl.PacketIn) bool { select { case q.packetsCh <- packet: return true default: // channel is full return false } } func (q *PacketInQueue) GetRateLimited(stopCh <-chan struct{}) *ofctrl.PacketIn { when := q.rateLimiter.Reserve().Delay() t := time.NewTimer(when) defer t.Stop() for { select { case <- stopCh: return nil case <- t.C: break } } for { select { case <- stopCh: return nil case packet := <- q.packetsCh: return packet } } } // receiver side: go func() { for { packet := q.GetRateLimited(stopCh) if packet == nil { return } // call all registered handlers } }

Of course we should continue to have one instance of PacketInQueue per packet type so that Traceflow packets get their own queue.

After my test (hping3 --flood 10.10.1.114 -p 80 -2), there is no significant difference in CPU usage whether we use this rateLimitQueue. I guess continuously calling AddRateLimited at a high rate also costs a lot of CPU? I think @antoninbas 's simplified code looks good and we can drop the packet over rate in this way. To make sure the Traceflow won't be choked or dropped by other packet-in messages, maybe make TraceFlow has its own queue would be better?

As I wrote at the end, yes Traceflow should have its own queue

My bad miss the end statement.
Then let me try if this way can save the CPU usage.

Before we add rate limiting mechanism in OVS, we use RateLimitingQueue for packet-in event in Antrea agent.

GraysonWu · 2021-04-01T00:22:02Z

Pushed with what @antoninbas suggested here #2015 (comment). Tested using hping3 --flood 10.10.1.114 -p 80 -2.
With channel buffer size 200, rate.Limit(100, 1), CPU usage:

top - 23:58:23 up 5 days, 21:21,  0 users,  load average: 5.88, 2.67, 1.39
Tasks:   3 total,   1 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s): 39.9 us, 44.9 sy,  0.0 ni,  0.8 id,  0.0 wa,  0.0 hi, 14.4 si,  0.0 st
MiB Mem :   1987.6 total,     71.8 free,    554.7 used,   1361.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   1246.2 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
      1 root      20   0 1340984  45516  24148 S  57.8   2.2   1:04.76 antrea-agent
     68 root      20   0    4240   2908   2324 S   0.0   0.1   0:00.01 bash
     77 root      20   0    6092   2468   1956 R   0.0   0.1   0:00.01 top

With channel buffer size 200, rate.Limit(rate.Inf, 1), CPU usage:

top - 00:10:30 up 5 days, 21:34,  0 users,  load average: 5.63, 2.54, 1.61
Tasks:   3 total,   1 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s): 42.6 us, 47.3 sy,  0.0 ni,  0.5 id,  0.0 wa,  0.0 hi,  9.5 si,  0.0 st
MiB Mem :   1987.6 total,     75.5 free,    595.8 used,   1316.3 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   1213.8 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
      1 root      20   0 1342260  75116  22680 S  73.8   3.7   1:07.24 antrea-agent
     66 root      20   0    4240   2284   1832 S   0.0   0.1   0:00.01 bash
     75 root      20   0    6092   3136   2624 R   0.0   0.2   0:00.00 top

antoninbas · 2021-04-01T00:32:57Z

Thanks @GraysonWu. At leasts we are not making things a bit better in terms of both CPU & memory :)
TBH I was expecting worst results without the rate limiting. Are you using NP logging or NP with reject action for your benchmark? Would you consider starting another hping from a different Pod as well?
Another benefit IMO of the change is that we will no longer be blocking the message handling loop in ofnet because of the unbuffered packet in receive channel, which could have affected other operations.

GraysonWu · 2021-04-01T00:45:20Z

Thanks @GraysonWu. At leasts we are not making things a bit better in terms of both CPU & memory :)
TBH I was expecting worst results without the rate limiting. Are you using NP logging or NP with reject action for your benchmark? Would you consider starting another hping from a different Pod as well?
Another benefit IMO of the change is that we will no longer be blocking the message handling loop in ofnet because of the unbuffered packet in receive channel, which could have affected other operations.

It's NP with reject action. Let me try more hping to see the result.

antoninbas · 2021-04-01T00:48:30Z

@GraysonWu could your try logging as well?

GraysonWu · 2021-04-01T00:49:31Z

@GraysonWu could your try logging as well?

Sure.

codecov-io · 2021-04-01T07:20:42Z

Codecov Report

Merging #2015 (e5066af) into main (08ea67c) will decrease coverage by 23.13%.
The diff coverage is 83.87%.

@@             Coverage Diff             @@
##             main    #2015       +/-   ##
===========================================
- Coverage   65.12%   41.99%   -23.14%     
===========================================
  Files         197      256       +59     
  Lines       17407    18773     +1366     
===========================================
- Hits        11336     7883     -3453     
- Misses       4883     9794     +4911     
+ Partials     1188     1096       -92

Flag	Coverage Δ
kind-e2e-tests	`41.99% <83.87%> (-14.18%)`	⬇️
unit-tests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/agent/openflow/client.go	`46.80% <ø> (-16.86%)`	⬇️
pkg/ovs/openflow/ofctrl_bridge.go	`50.00% <82.60%> (+2.17%)`	⬆️
pkg/agent/openflow/packetin.go	`59.25% <87.50%> (-3.54%)`	⬇️
pkg/controller/networkpolicy/crd_utils.go	`0.00% <0.00%> (-88.61%)`	⬇️
pkg/controller/networkpolicy/endpoint_querier.go	`2.85% <0.00%> (-88.58%)`	⬇️
...g/controller/networkpolicy/clusternetworkpolicy.go	`0.00% <0.00%> (-87.15%)`	⬇️
pkg/controller/networkpolicy/status_controller.go	`0.00% <0.00%> (-86.85%)`	⬇️
...kg/controller/networkpolicy/antreanetworkpolicy.go	`0.00% <0.00%> (-85.00%)`	⬇️
pkg/controller/networkpolicy/clustergroup.go	`1.37% <0.00%> (-84.74%)`	⬇️
pkg/agent/util/iptables/lock.go	`0.00% <0.00%> (-81.82%)`	⬇️
... and 179 more

antoninbas

code looks good to me, let's see if @tnqn has an objection to the change

pkg/agent/openflow/packetin.go

pkg/agent/openflow/client.go

antoninbas

~~I found an issue. It will affect the measurements you did previously, so I believe these needs to be measured again.~~

pkg/ovs/openflow/ofctrl_bridge.go

antoninbas

Ignore my earlier comment, it was incorrect. I just checked the Go documentation :)
LGTM

pkg/agent/openflow/packetin.go

pkg/ovs/openflow/ofctrl_bridge.go

pkg/agent/openflow/packetin.go

GraysonWu · 2021-04-07T00:58:19Z

/test-all

vmwclabot added the cla-not-required label Mar 31, 2021

GraysonWu requested review from antoninbas and tnqn March 31, 2021 00:54

GraysonWu force-pushed the rate-limit-queue branch from 56c488a to 7c1319d Compare March 31, 2021 01:25

antoninbas reviewed Mar 31, 2021

View reviewed changes

tnqn reviewed Mar 31, 2021

View reviewed changes

Add rate limiting queue for packet-in

ccdd2f5

Before we add rate limiting mechanism in OVS, we use RateLimitingQueue for packet-in event in Antrea agent.

GraysonWu force-pushed the rate-limit-queue branch from 7c1319d to ccdd2f5 Compare March 31, 2021 21:00

jianjuns self-requested a review March 31, 2021 22:15

GraysonWu force-pushed the rate-limit-queue branch from bfafe85 to 96e4333 Compare April 1, 2021 00:42

GraysonWu force-pushed the rate-limit-queue branch 2 times, most recently from 4911d6a to 9222dd5 Compare April 1, 2021 04:22

antoninbas reviewed Apr 3, 2021

View reviewed changes

pkg/agent/openflow/packetin.go Outdated Show resolved Hide resolved

pkg/agent/openflow/client.go Outdated Show resolved Hide resolved

GraysonWu force-pushed the rate-limit-queue branch from 9222dd5 to 584f524 Compare April 5, 2021 21:37

antoninbas requested a review from tnqn April 5, 2021 23:26

antoninbas requested changes Apr 5, 2021

View reviewed changes

pkg/ovs/openflow/ofctrl_bridge.go Show resolved Hide resolved

antoninbas previously approved these changes Apr 5, 2021

View reviewed changes

jianjuns reviewed Apr 5, 2021

View reviewed changes

pkg/agent/openflow/packetin.go Outdated Show resolved Hide resolved

antoninbas reviewed Apr 5, 2021

View reviewed changes

pkg/ovs/openflow/ofctrl_bridge.go Outdated Show resolved Hide resolved

GraysonWu dismissed antoninbas’s stale review via 4ab770c April 6, 2021 00:23

GraysonWu force-pushed the rate-limit-queue branch 2 times, most recently from 4ab770c to daa7379 Compare April 6, 2021 00:51

jianjuns reviewed Apr 6, 2021

View reviewed changes

pkg/agent/openflow/packetin.go Outdated Show resolved Hide resolved

pkg/agent/openflow/packetin.go Outdated Show resolved Hide resolved

Only use one buffered channel with rate limiting get

e5066af

GraysonWu force-pushed the rate-limit-queue branch from daa7379 to e5066af Compare April 6, 2021 02:10

antoninbas approved these changes Apr 7, 2021

View reviewed changes

GraysonWu merged commit d9515eb into antrea-io:main Apr 7, 2021

antoninbas mentioned this pull request Apr 9, 2021

Rate limiting of PacketIn messages should be enforced in the dataplane #2069

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rate limiting queue for packet-in #2015

Add rate limiting queue for packet-in #2015

GraysonWu commented Mar 31, 2021

antoninbas left a comment

antoninbas Mar 31, 2021

antoninbas Mar 31, 2021

tnqn Mar 31, 2021

antoninbas Mar 31, 2021

GraysonWu Mar 31, 2021

antoninbas Mar 31, 2021

GraysonWu Mar 31, 2021

GraysonWu commented Apr 1, 2021 •

edited

Loading

antoninbas commented Apr 1, 2021

GraysonWu commented Apr 1, 2021

antoninbas commented Apr 1, 2021

GraysonWu commented Apr 1, 2021

codecov-io commented Apr 1, 2021 •

edited

Loading

antoninbas left a comment

antoninbas left a comment •

edited

Loading

antoninbas left a comment

GraysonWu commented Apr 7, 2021

Add rate limiting queue for packet-in #2015

Add rate limiting queue for packet-in #2015

Conversation

GraysonWu commented Mar 31, 2021

antoninbas left a comment

Choose a reason for hiding this comment

antoninbas Mar 31, 2021

Choose a reason for hiding this comment

antoninbas Mar 31, 2021

Choose a reason for hiding this comment

tnqn Mar 31, 2021

Choose a reason for hiding this comment

antoninbas Mar 31, 2021

Choose a reason for hiding this comment

GraysonWu Mar 31, 2021

Choose a reason for hiding this comment

antoninbas Mar 31, 2021

Choose a reason for hiding this comment

GraysonWu Mar 31, 2021

Choose a reason for hiding this comment

GraysonWu commented Apr 1, 2021 • edited Loading

antoninbas commented Apr 1, 2021

GraysonWu commented Apr 1, 2021

antoninbas commented Apr 1, 2021

GraysonWu commented Apr 1, 2021

codecov-io commented Apr 1, 2021 • edited Loading

Codecov Report

antoninbas left a comment

Choose a reason for hiding this comment

antoninbas left a comment • edited Loading

Choose a reason for hiding this comment

antoninbas left a comment

Choose a reason for hiding this comment

GraysonWu commented Apr 7, 2021

GraysonWu commented Apr 1, 2021 •

edited

Loading

codecov-io commented Apr 1, 2021 •

edited

Loading

antoninbas left a comment •

edited

Loading