-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Follow Up]Antrea Network Policy Audit Logging Deduplication and Unit Tests #2578
Conversation
@@ -110,7 +111,7 @@ func TestDropPacketDedupLog(t *testing.T) { | |||
} | |||
|
|||
func TestDropPacketMultiDedupLog(t *testing.T) { | |||
antreaLogger, mockAnpLogger := newTestAntreaPolicyLogger() | |||
antreaLogger, mockAnpLogger := newTestAntreaPolicyLogger(100 * time.Millisecond) | |||
ob, expected := newLogInfo("Drop") | |||
|
|||
go sendMultiplePackets(antreaLogger, ob, 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @tnqn had a concern about the stability of this test. Maybe for this test we can send a packet every 40ms so that there is a bigger margin of error? We can also use a virtual clock for the tests that we advance manually, but that can be a bit tedious to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried running the test 100 times locally with enough CPU resources and it didn't fail a single time.
I tried limitting CPU to 30% then it failed 3/100 times.
# cgcreate -g cpu:/cpulimited
# cgset -r cpu.cfs_quota_us=30000 cpulimited
# cgexec -g cpu:cpulimited go test -race -count 100 -run TestDropPacketMultiDedupLog ./pkg/agent/controller/networkpolicy/
--- FAIL: TestDropPacketMultiDedupLog (0.22s)
audit_logging_test.go:130:
Error Trace: audit_logging_test.go:130
Error: "2021/08/12 AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [8 packets in 100.404692ms]\n" does not contain "AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [9"
Test: TestDropPacketMultiDedupLog
--- FAIL: TestDropPacketMultiDedupLog (0.22s)
audit_logging_test.go:130:
Error Trace: audit_logging_test.go:130
Error: "2021/08/12 AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [6 packets in 100.735445ms]\n" does not contain "AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [9"
Test: TestDropPacketMultiDedupLog
--- FAIL: TestDropPacketMultiDedupLog (0.22s)
audit_logging_test.go:130:
Error Trace: audit_logging_test.go:130
Error: "2021/08/12 AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [8 packets in 100.463209ms]\n" does not contain "AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [9"
Test: TestDropPacketMultiDedupLog
I'm not sure how much CPU resources github action can provide. But I'm fine with keeping it as is and fixing it when we see actual failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a test failure in another PR: https://github.com/antrea-io/antrea/pull/2579/checks?check_run_id=3308830375, so Github action doesn't provide enough CPU resources and the test needs to be adjusted.
--- FAIL: TestDropPacketMultiDedupLog (0.22s)
audit_logging_test.go:118:
Error Trace: audit_logging_test.go:118
Error: "2021/08/12 AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [6 packets in 100.0035ms]\n" does not contain "AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [9"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just found this PR's own windows unit test failed too, so it's quite flaky on windows instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the log result depends on CPU resources, does it make sense to alter the testing for MultiDedupLog
a little? Testing that there are at least two log messages, and the sum of these msg equal to requested? Instead of non-determinant msg distribution. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense to me, thanks @qiyueyao
it's still sensitive to the clock, but we can ensure that there is a comfortable margin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can generate two continuous records that are supposed to be merged, wait for a while, generate another record that is not supposed to be merged, which can still verify the merge and unmerge case and should not be too clock sensitive. For example:
antreaLogger, mockAnpLogger := newTestAntreaPolicyLogger(100 * time.Millisecond)
...
go func() {
antreaLogger.logDedupPacket(ob)
antreaLogger.logDedupPacket(ob)
time.Sleep(500*time.Millisecond)
antreaLogger.logDedupPacket(ob)
}()
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I used the suggested solution and tested with 40% cpu power & 100 counts, it works fine.
Codecov Report
@@ Coverage Diff @@
## main #2578 +/- ##
==========================================
+ Coverage 60.35% 65.16% +4.81%
==========================================
Files 283 283
Lines 22455 25856 +3401
==========================================
+ Hits 13552 16850 +3298
+ Misses 7472 7457 -15
- Partials 1431 1549 +118
Flags with carried forward coverage won't be shown. Click here to find out more.
|
994ae04
to
d51b2ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-all |
numPackets := 10 | ||
go func() { | ||
sendMultiplePackets(antreaLogger, ob, numPackets) | ||
time.Sleep(500 * time.Millisecond) | ||
antreaLogger.LogDedupPacket(ob) | ||
}() | ||
|
||
actual := <-mockAnpLogger.logged | ||
assert.Contains(t, actual, expectedLogWithCount(expected, 9)) | ||
assert.Contains(t, actual, expectedLogWithCount(expected, numPackets)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, the original purpose of this test when I suggested it was to make sure that logs are never buffered for more than bufferLength. In other words, if bufferLength is 100ms, and 11 packets are sent at a 10ms interval, the last packet should be in a different log message. The new version of the test doesn't seems to verify that anymore, because we sleep for 500ms (which is greater than the bufferLength of 100ms) before the last packet. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a log is generated at a time close to the bufferLength, I guess the test will still be flaky as the goroutine may not get scheduled immediately when the timer expires. For example, the timer is set to expire after 100ms, but the goroutine is scheduled after 111ms, then a log generated at 110ms will be counted. Similarly, the goroutine that generates a log every 10ms may not get scheduled immediately when the ticker expires, leading to less records generated in 100ms. So the test generates logs much earlier than the timer expires or much later after it to avoid flaky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I was suggesting is a packet every 50ms (t=0ms, t=50ms, t=100ms, t=150ms) and just checking that there are 2 log entries. Do you think 50ms is not long enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to send 4 packets every 50ms, wait for 500ms to received the logs, then compare number of logs&number of packets. Also tested with 50% cpu power, works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antoninbas 50ms is long enough, but do we expect the third entry (t=100ms) to be merged to the first log or the second log? I think it's not deterministic as it's on the boundary of bufferLength. The test will have to check it gets (2,2) or (3,1). What I suggested is to not generate a log on the boundary, so the test can just check (2,1) deterministically. It doesn't matter the first interval is 1ms or 50ms as long as it's not close to the boundary, and it's totally fine to reduce the second interval from 500ms to 150ms.
I see currently the test checks whether there are at least 2 logs and the sum matches. I think it doesn't cover a scenario that even the bufferLength doesn't work, the test will pass as (1, 1, 1, 1) will be considered valid. Maybe it should check there are exactly 2 logs and the sum matches if there is a log on the boundary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
50ms is long enough, but do we expect the third entry (t=100ms) to be merged to the first log or the second log?
as you pointed out, both cases should be valid
Maybe it should check there are exactly 2 logs and the sum matches if there is a log on the boundary.
yes, that's what I had in mind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions, I updated the testcase to be 4 packets every 40ms, checking there are exactly 2 logs and sum of 4.
Because if 50ms, then I could not make sure there are exactly 2 logs, either 1/2 for 3 packets, or 2/3 for 4 packets. 40ms works as the time margin is enough for 2 logs to be generated.
85c5549
to
fa5b19e
Compare
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
fa5b19e
to
285af08
Compare
mockAnpLogger.mu.Lock() | ||
defer mockAnpLogger.mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need mockAnpLogger.mu
at all: channels are thread-safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried without mutex, it gave this error.
I think it is because I used mutex when sending to channel to protect the map, so we need a lock here to avoid race condition.
Write at 0x00c00008a730 by goroutine 77:
runtime.closechan()
/usr/local/go/src/runtime/chan.go:357 +0x0
antrea.io/antrea/pkg/agent/controller/networkpolicy.TestDropPacketMultiDedupLog.func1()
/usr/src/antrea.io/antrea/pkg/agent/controller/networkpolicy/audit_logging_test.go:127 +0x5e
Previous read at 0x00c00008a730 by goroutine 76:
runtime.chansend()
/usr/local/go/src/runtime/chan.go:158 +0x0
antrea.io/antrea/pkg/agent/controller/networkpolicy.(*mockLogger).Write()
/usr/src/antrea.io/antrea/pkg/agent/controller/networkpolicy/audit_logging_test.go:47 +0x1c6
log.(*Logger).Output()
/usr/local/go/src/log/log.go:181 +0x3ac
log.(*Logger).Printf()
/usr/local/go/src/log/log.go:188 +0x8f
antrea.io/antrea/pkg/agent/controller/networkpolicy.(*AntreaPolicyLogger).terminateLogKey()
/usr/src/antrea.io/antrea/pkg/agent/controller/networkpolicy/audit_logging.go:90 +0x191
antrea.io/antrea/pkg/agent/controller/networkpolicy.(*AntreaPolicyLogger).logAfterTimer()
/usr/src/antrea.io/antrea/pkg/agent/controller/networkpolicy/audit_logging.go:81 +0xc4```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this is because the race detector believes there is no ordering guarantee between the channel sends and the channel close operation: golang/go#27769. Or believes that a send and the close can happen concurrently. The mutex helps with the second one, so I guess since the error goes away that's what the problem was. There's probably a way to write the code differently so that it's correct even without using a mutex, but I believe the current version is good enough for test code. I'll let @tnqn take a look but this looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-all |
This is a follow up PR for Antrea Network Policy Audit Logging Deduplication and Unit Tests to address comments.
antreaPolicyEnabled
andloggingEnabled
for extensible usage.AntreaPolicyLogger
method naming.bufferLength
to a member ofAntreaPolicyLogger
to avoid confusion.