[Follow Up]Antrea Network Policy Audit Logging Deduplication and Unit Tests #2578

qiyueyao · 2021-08-11T23:02:16Z

This is a follow up PR for Antrea Network Policy Audit Logging Deduplication and Unit Tests to address comments.

Converted back to include both antreaPolicyEnabled and loggingEnabled for extensible usage.
Changed AntreaPolicyLogger method naming.
Moved bufferLength to a member of AntreaPolicyLogger to avoid confusion.

pkg/agent/controller/networkpolicy/audit_logging.go

pkg/agent/controller/networkpolicy/audit_logging_test.go

antoninbas · 2021-08-11T23:10:12Z

pkg/agent/controller/networkpolicy/audit_logging_test.go

@@ -110,7 +111,7 @@ func TestDropPacketDedupLog(t *testing.T) {
 }

 func TestDropPacketMultiDedupLog(t *testing.T) {
-	antreaLogger, mockAnpLogger := newTestAntreaPolicyLogger()
+	antreaLogger, mockAnpLogger := newTestAntreaPolicyLogger(100 * time.Millisecond)
 	ob, expected := newLogInfo("Drop")

 	go sendMultiplePackets(antreaLogger, ob, 10)


I think @tnqn had a concern about the stability of this test. Maybe for this test we can send a packet every 40ms so that there is a bigger margin of error? We can also use a virtual clock for the tests that we advance manually, but that can be a bit tedious to do.

I tried running the test 100 times locally with enough CPU resources and it didn't fail a single time.
I tried limitting CPU to 30% then it failed 3/100 times.

# cgcreate -g cpu:/cpulimited # cgset -r cpu.cfs_quota_us=30000 cpulimited # cgexec -g cpu:cpulimited go test -race -count 100 -run TestDropPacketMultiDedupLog ./pkg/agent/controller/networkpolicy/ --- FAIL: TestDropPacketMultiDedupLog (0.22s) audit_logging_test.go:130: Error Trace: audit_logging_test.go:130 Error: "2021/08/12 AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [8 packets in 100.404692ms]\n" does not contain "AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [9" Test: TestDropPacketMultiDedupLog --- FAIL: TestDropPacketMultiDedupLog (0.22s) audit_logging_test.go:130: Error Trace: audit_logging_test.go:130 Error: "2021/08/12 AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [6 packets in 100.735445ms]\n" does not contain "AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [9" Test: TestDropPacketMultiDedupLog --- FAIL: TestDropPacketMultiDedupLog (0.22s) audit_logging_test.go:130: Error Trace: audit_logging_test.go:130 Error: "2021/08/12 AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [8 packets in 100.463209ms]\n" does not contain "AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [9" Test: TestDropPacketMultiDedupLog

I'm not sure how much CPU resources github action can provide. But I'm fine with keeping it as is and fixing it when we see actual failure.

Found a test failure in another PR: https://github.com/antrea-io/antrea/pull/2579/checks?check_run_id=3308830375, so Github action doesn't provide enough CPU resources and the test needs to be adjusted.

--- FAIL: TestDropPacketMultiDedupLog (0.22s) audit_logging_test.go:118: Error Trace: audit_logging_test.go:118 Error: "2021/08/12 AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [6 packets in 100.0035ms]\n" does not contain "AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [9"

Just found this PR's own windows unit test failed too, so it's quite flaky on windows instances.

Since the log result depends on CPU resources, does it make sense to alter the testing for MultiDedupLog a little? Testing that there are at least two log messages, and the sum of these msg equal to requested? Instead of non-determinant msg distribution. Thanks!

that makes sense to me, thanks @qiyueyao
it's still sensitive to the clock, but we can ensure that there is a comfortable margin

Maybe you can generate two continuous records that are supposed to be merged, wait for a while, generate another record that is not supposed to be merged, which can still verify the merge and unmerge case and should not be too clock sensitive. For example:

antreaLogger, mockAnpLogger := newTestAntreaPolicyLogger(100 * time.Millisecond) ... go func() { antreaLogger.logDedupPacket(ob) antreaLogger.logDedupPacket(ob) time.Sleep(500*time.Millisecond) antreaLogger.logDedupPacket(ob) }() ...

Thanks, I used the suggested solution and tested with 40% cpu power & 100 counts, it works fine.

codecov-commenter · 2021-08-12T00:40:12Z

Codecov Report

Merging #2578 (d3b8891) into main (fdf3034) will increase coverage by 4.81%.
The diff coverage is 94.44%.

@@            Coverage Diff             @@
##             main    #2578      +/-   ##
==========================================
+ Coverage   60.35%   65.16%   +4.81%     
==========================================
  Files         283      283              
  Lines       22455    25856    +3401     
==========================================
+ Hits        13552    16850    +3298     
+ Misses       7472     7457      -15     
- Partials     1431     1549     +118

Flag	Coverage Δ
e2e-tests	`55.64% <86.11%> (?)`
kind-e2e-tests	`47.36% <83.87%> (+<0.01%)`	⬆️
unit-tests	`41.97% <67.74%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...ntroller/networkpolicy/networkpolicy_controller.go	`71.06% <66.66%> (+1.60%)`	⬆️
...kg/agent/controller/networkpolicy/audit_logging.go	`96.36% <100.00%> (+0.71%)`	⬆️
pkg/agent/controller/networkpolicy/packetin.go	`71.70% <100.00%> (+0.57%)`	⬆️
pkg/controller/egress/ipallocator/allocator.go	`67.82% <0.00%> (-15.16%)`	⬇️
pkg/controller/networkpolicy/endpoint_querier.go	`77.64% <0.00%> (-13.79%)`	⬇️
pkg/apis/controlplane/v1beta1/conversion.go	`72.44% <0.00%> (-11.89%)`	⬇️
pkg/legacyapis/core/v1alpha2/register.go	`69.23% <0.00%> (-10.77%)`	⬇️
pkg/controller/egress/controller.go	`76.76% <0.00%> (-10.44%)`	⬇️
pkg/apis/stats/register.go	`71.42% <0.00%> (-10.39%)`	⬇️
pkg/legacyapis/stats/register.go	`71.42% <0.00%> (-10.39%)`	⬇️
... and 270 more

tnqn

LGTM

tnqn · 2021-08-18T02:00:18Z

/test-all

antoninbas · 2021-08-18T02:48:13Z

pkg/agent/controller/networkpolicy/audit_logging_test.go

+	numPackets := 10
+	go func() {
+		sendMultiplePackets(antreaLogger, ob, numPackets)
+		time.Sleep(500 * time.Millisecond)
+		antreaLogger.LogDedupPacket(ob)
+	}()
+
 	actual := <-mockAnpLogger.logged
-	assert.Contains(t, actual, expectedLogWithCount(expected, 9))
+	assert.Contains(t, actual, expectedLogWithCount(expected, numPackets))


IIRC, the original purpose of this test when I suggested it was to make sure that logs are never buffered for more than bufferLength. In other words, if bufferLength is 100ms, and 11 packets are sent at a 10ms interval, the last packet should be in a different log message. The new version of the test doesn't seems to verify that anymore, because we sleep for 500ms (which is greater than the bufferLength of 100ms) before the last packet. Am I missing something?

If a log is generated at a time close to the bufferLength, I guess the test will still be flaky as the goroutine may not get scheduled immediately when the timer expires. For example, the timer is set to expire after 100ms, but the goroutine is scheduled after 111ms, then a log generated at 110ms will be counted. Similarly, the goroutine that generates a log every 10ms may not get scheduled immediately when the ticker expires, leading to less records generated in 100ms. So the test generates logs much earlier than the timer expires or much later after it to avoid flaky.

What I was suggesting is a packet every 50ms (t=0ms, t=50ms, t=100ms, t=150ms) and just checking that there are 2 log entries. Do you think 50ms is not long enough?

I tried to send 4 packets every 50ms, wait for 500ms to received the logs, then compare number of logs&number of packets. Also tested with 50% cpu power, works.

@antoninbas 50ms is long enough, but do we expect the third entry (t=100ms) to be merged to the first log or the second log? I think it's not deterministic as it's on the boundary of bufferLength. The test will have to check it gets (2,2) or (3,1). What I suggested is to not generate a log on the boundary, so the test can just check (2,1) deterministically. It doesn't matter the first interval is 1ms or 50ms as long as it's not close to the boundary, and it's totally fine to reduce the second interval from 500ms to 150ms.
I see currently the test checks whether there are at least 2 logs and the sum matches. I think it doesn't cover a scenario that even the bufferLength doesn't work, the test will pass as (1, 1, 1, 1) will be considered valid. Maybe it should check there are exactly 2 logs and the sum matches if there is a log on the boundary.

50ms is long enough, but do we expect the third entry (t=100ms) to be merged to the first log or the second log?

as you pointed out, both cases should be valid

Maybe it should check there are exactly 2 logs and the sum matches if there is a log on the boundary.

yes, that's what I had in mind

Thanks for the suggestions, I updated the testcase to be 4 packets every 40ms, checking there are exactly 2 logs and sum of 4.
Because if 50ms, then I could not make sure there are exactly 2 logs, either 1/2 for 3 packets, or 2/3 for 4 packets. 40ms works as the time margin is enough for 2 logs to be generated.

Signed-off-by: Qiyue Yao <yaoq@vmware.com>

pkg/agent/controller/networkpolicy/audit_logging_test.go

antoninbas · 2021-08-20T00:37:51Z

pkg/agent/controller/networkpolicy/audit_logging_test.go

+	mockAnpLogger.mu.Lock()
+	defer mockAnpLogger.mu.Unlock()


I don't think we need mockAnpLogger.mu at all: channels are thread-safe

I tried without mutex, it gave this error.
I think it is because I used mutex when sending to channel to protect the map, so we need a lock here to avoid race condition.

Write at 0x00c00008a730 by goroutine 77: runtime.closechan() /usr/local/go/src/runtime/chan.go:357 +0x0 antrea.io/antrea/pkg/agent/controller/networkpolicy.TestDropPacketMultiDedupLog.func1() /usr/src/antrea.io/antrea/pkg/agent/controller/networkpolicy/audit_logging_test.go:127 +0x5e Previous read at 0x00c00008a730 by goroutine 76: runtime.chansend() /usr/local/go/src/runtime/chan.go:158 +0x0 antrea.io/antrea/pkg/agent/controller/networkpolicy.(*mockLogger).Write() /usr/src/antrea.io/antrea/pkg/agent/controller/networkpolicy/audit_logging_test.go:47 +0x1c6 log.(*Logger).Output() /usr/local/go/src/log/log.go:181 +0x3ac log.(*Logger).Printf() /usr/local/go/src/log/log.go:188 +0x8f antrea.io/antrea/pkg/agent/controller/networkpolicy.(*AntreaPolicyLogger).terminateLogKey() /usr/src/antrea.io/antrea/pkg/agent/controller/networkpolicy/audit_logging.go:90 +0x191 antrea.io/antrea/pkg/agent/controller/networkpolicy.(*AntreaPolicyLogger).logAfterTimer() /usr/src/antrea.io/antrea/pkg/agent/controller/networkpolicy/audit_logging.go:81 +0xc4```

actually this is because the race detector believes there is no ordering guarantee between the channel sends and the channel close operation: golang/go#27769. Or believes that a send and the close can happen concurrently. The mutex helps with the second one, so I guess since the error goes away that's what the problem was. There's probably a way to write the code differently so that it's correct even without using a mutex, but I believe the current version is good enough for test code. I'll let @tnqn take a look but this looks good to me.

pkg/agent/controller/networkpolicy/audit_logging_test.go

Signed-off-by: Qiyue Yao <yaoq@vmware.com>

tnqn

LGTM

tnqn · 2021-08-20T15:16:50Z

/test-all

antoninbas reviewed Aug 11, 2021

View reviewed changes

antoninbas requested a review from tnqn August 11, 2021 23:10

qiyueyao changed the title ~~[Follow Up]Antrea Network Policy Audit Logging Deduplication and Unit Tests~~ [Follow Up][WIP]Antrea Network Policy Audit Logging Deduplication and Unit Tests Aug 11, 2021

qiyueyao force-pushed the LogDedupFollowup branch from 994ae04 to d51b2ec Compare August 17, 2021 18:43

qiyueyao requested a review from antoninbas August 18, 2021 00:18

tnqn previously approved these changes Aug 18, 2021

View reviewed changes

antoninbas reviewed Aug 18, 2021

View reviewed changes

qiyueyao dismissed tnqn’s stale review via 85c5549 August 19, 2021 00:18

qiyueyao force-pushed the LogDedupFollowup branch 2 times, most recently from 85c5549 to fa5b19e Compare August 19, 2021 00:32

qiyueyao added 5 commits August 19, 2021 15:30

Address Comments

d896e00

Signed-off-by: Qiyue Yao <yaoq@vmware.com>

Address Comments

a3d8cba

Signed-off-by: Qiyue Yao <yaoq@vmware.com>

Change DedupMulti Test

edbe008

Signed-off-by: Qiyue Yao <yaoq@vmware.com>

Modify testcase

bd1bbc2

Signed-off-by: Qiyue Yao <yaoq@vmware.com>

Improve testcase

285af08

Signed-off-by: Qiyue Yao <yaoq@vmware.com>

qiyueyao force-pushed the LogDedupFollowup branch from fa5b19e to 285af08 Compare August 19, 2021 22:30

antoninbas reviewed Aug 20, 2021

View reviewed changes

fixup

d3b8891

Signed-off-by: Qiyue Yao <yaoq@vmware.com>

tnqn approved these changes Aug 20, 2021

View reviewed changes

antoninbas merged commit 1550786 into antrea-io:main Aug 20, 2021

qiyueyao deleted the LogDedupFollowup branch September 14, 2021 15:54

qiyueyao changed the title ~~[Follow Up][WIP]Antrea Network Policy Audit Logging Deduplication and Unit Tests~~ [Follow Up]Antrea Network Policy Audit Logging Deduplication and Unit Tests Sep 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Follow Up]Antrea Network Policy Audit Logging Deduplication and Unit Tests #2578

[Follow Up]Antrea Network Policy Audit Logging Deduplication and Unit Tests #2578

qiyueyao commented Aug 11, 2021

antoninbas Aug 11, 2021

tnqn Aug 12, 2021

tnqn Aug 12, 2021

tnqn Aug 12, 2021

qiyueyao Aug 12, 2021

antoninbas Aug 12, 2021

tnqn Aug 16, 2021

qiyueyao Aug 17, 2021

codecov-commenter commented Aug 12, 2021 •

edited

Loading

tnqn left a comment

tnqn commented Aug 18, 2021

antoninbas Aug 18, 2021

tnqn Aug 18, 2021

antoninbas Aug 18, 2021

qiyueyao Aug 19, 2021 •

edited

Loading

tnqn Aug 19, 2021

antoninbas Aug 19, 2021

qiyueyao Aug 19, 2021

antoninbas Aug 20, 2021

qiyueyao Aug 20, 2021 •

edited

Loading

antoninbas Aug 20, 2021

tnqn left a comment

tnqn commented Aug 20, 2021

[Follow Up]Antrea Network Policy Audit Logging Deduplication and Unit Tests #2578

[Follow Up]Antrea Network Policy Audit Logging Deduplication and Unit Tests #2578

Conversation

qiyueyao commented Aug 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 12, 2021 • edited Loading

Codecov Report

tnqn left a comment

Choose a reason for hiding this comment

tnqn commented Aug 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qiyueyao Aug 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qiyueyao Aug 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnqn left a comment

Choose a reason for hiding this comment

tnqn commented Aug 20, 2021

codecov-commenter commented Aug 12, 2021 •

edited

Loading

qiyueyao Aug 19, 2021 •

edited

Loading

qiyueyao Aug 20, 2021 •

edited

Loading