Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Follow Up]Antrea Network Policy Audit Logging Deduplication and Unit Tests #2578

Merged
merged 6 commits into from
Aug 20, 2021

Conversation

qiyueyao
Copy link
Contributor

This is a follow up PR for Antrea Network Policy Audit Logging Deduplication and Unit Tests to address comments.

  • Converted back to include both antreaPolicyEnabled and loggingEnabled for extensible usage.
  • Changed AntreaPolicyLogger method naming.
  • Moved bufferLength to a member of AntreaPolicyLogger to avoid confusion.

pkg/agent/controller/networkpolicy/audit_logging.go Outdated Show resolved Hide resolved
pkg/agent/controller/networkpolicy/audit_logging_test.go Outdated Show resolved Hide resolved
@@ -110,7 +111,7 @@ func TestDropPacketDedupLog(t *testing.T) {
}

func TestDropPacketMultiDedupLog(t *testing.T) {
antreaLogger, mockAnpLogger := newTestAntreaPolicyLogger()
antreaLogger, mockAnpLogger := newTestAntreaPolicyLogger(100 * time.Millisecond)
ob, expected := newLogInfo("Drop")

go sendMultiplePackets(antreaLogger, ob, 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @tnqn had a concern about the stability of this test. Maybe for this test we can send a packet every 40ms so that there is a bigger margin of error? We can also use a virtual clock for the tests that we advance manually, but that can be a bit tedious to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running the test 100 times locally with enough CPU resources and it didn't fail a single time.
I tried limitting CPU to 30% then it failed 3/100 times.

# cgcreate -g cpu:/cpulimited
# cgset -r cpu.cfs_quota_us=30000 cpulimited
# cgexec -g cpu:cpulimited go test -race -count 100 -run TestDropPacketMultiDedupLog ./pkg/agent/controller/networkpolicy/
--- FAIL: TestDropPacketMultiDedupLog (0.22s)
    audit_logging_test.go:130:
                Error Trace:    audit_logging_test.go:130
                Error:          "2021/08/12 AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [8 packets in 100.404692ms]\n" does not contain "AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [9"
                Test:           TestDropPacketMultiDedupLog
--- FAIL: TestDropPacketMultiDedupLog (0.22s)
    audit_logging_test.go:130:
                Error Trace:    audit_logging_test.go:130
                Error:          "2021/08/12 AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [6 packets in 100.735445ms]\n" does not contain "AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [9"
                Test:           TestDropPacketMultiDedupLog
--- FAIL: TestDropPacketMultiDedupLog (0.22s)
    audit_logging_test.go:130:
                Error Trace:    audit_logging_test.go:130
                Error:          "2021/08/12 AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [8 packets in 100.463209ms]\n" does not contain "AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [9"
                Test:           TestDropPacketMultiDedupLog

I'm not sure how much CPU resources github action can provide. But I'm fine with keeping it as is and fixing it when we see actual failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a test failure in another PR: https://github.com/antrea-io/antrea/pull/2579/checks?check_run_id=3308830375, so Github action doesn't provide enough CPU resources and the test needs to be adjusted.

--- FAIL: TestDropPacketMultiDedupLog (0.22s)
    audit_logging_test.go:118: 
        	Error Trace:	audit_logging_test.go:118
        	Error:      	"2021/08/12 AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [6 packets in 100.0035ms]\n" does not contain "AntreaPolicyIngressRule AntreaNetworkPolicy:default/test Drop 0 SRC: 0.0.0.0 DEST: 1.1.1.1 60 TCP [9"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just found this PR's own windows unit test failed too, so it's quite flaky on windows instances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the log result depends on CPU resources, does it make sense to alter the testing for MultiDedupLog a little? Testing that there are at least two log messages, and the sum of these msg equal to requested? Instead of non-determinant msg distribution. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense to me, thanks @qiyueyao
it's still sensitive to the clock, but we can ensure that there is a comfortable margin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can generate two continuous records that are supposed to be merged, wait for a while, generate another record that is not supposed to be merged, which can still verify the merge and unmerge case and should not be too clock sensitive. For example:

antreaLogger, mockAnpLogger := newTestAntreaPolicyLogger(100 * time.Millisecond)
...
go func() {
	antreaLogger.logDedupPacket(ob)
	antreaLogger.logDedupPacket(ob)
	time.Sleep(500*time.Millisecond)
	antreaLogger.logDedupPacket(ob)
}()
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I used the suggested solution and tested with 40% cpu power & 100 counts, it works fine.

@antoninbas antoninbas requested a review from tnqn August 11, 2021 23:10
@qiyueyao qiyueyao changed the title [Follow Up]Antrea Network Policy Audit Logging Deduplication and Unit Tests [Follow Up][WIP]Antrea Network Policy Audit Logging Deduplication and Unit Tests Aug 11, 2021
@codecov-commenter
Copy link

codecov-commenter commented Aug 12, 2021

Codecov Report

Merging #2578 (d3b8891) into main (fdf3034) will increase coverage by 4.81%.
The diff coverage is 94.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2578      +/-   ##
==========================================
+ Coverage   60.35%   65.16%   +4.81%     
==========================================
  Files         283      283              
  Lines       22455    25856    +3401     
==========================================
+ Hits        13552    16850    +3298     
+ Misses       7472     7457      -15     
- Partials     1431     1549     +118     
Flag Coverage Δ
e2e-tests 55.64% <86.11%> (?)
kind-e2e-tests 47.36% <83.87%> (+<0.01%) ⬆️
unit-tests 41.97% <67.74%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ntroller/networkpolicy/networkpolicy_controller.go 71.06% <66.66%> (+1.60%) ⬆️
...kg/agent/controller/networkpolicy/audit_logging.go 96.36% <100.00%> (+0.71%) ⬆️
pkg/agent/controller/networkpolicy/packetin.go 71.70% <100.00%> (+0.57%) ⬆️
pkg/controller/egress/ipallocator/allocator.go 67.82% <0.00%> (-15.16%) ⬇️
pkg/controller/networkpolicy/endpoint_querier.go 77.64% <0.00%> (-13.79%) ⬇️
pkg/apis/controlplane/v1beta1/conversion.go 72.44% <0.00%> (-11.89%) ⬇️
pkg/legacyapis/core/v1alpha2/register.go 69.23% <0.00%> (-10.77%) ⬇️
pkg/controller/egress/controller.go 76.76% <0.00%> (-10.44%) ⬇️
pkg/apis/stats/register.go 71.42% <0.00%> (-10.39%) ⬇️
pkg/legacyapis/stats/register.go 71.42% <0.00%> (-10.39%) ⬇️
... and 270 more

tnqn
tnqn previously approved these changes Aug 18, 2021
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member

tnqn commented Aug 18, 2021

/test-all

Comment on lines 121 to 129
numPackets := 10
go func() {
sendMultiplePackets(antreaLogger, ob, numPackets)
time.Sleep(500 * time.Millisecond)
antreaLogger.LogDedupPacket(ob)
}()

actual := <-mockAnpLogger.logged
assert.Contains(t, actual, expectedLogWithCount(expected, 9))
assert.Contains(t, actual, expectedLogWithCount(expected, numPackets))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the original purpose of this test when I suggested it was to make sure that logs are never buffered for more than bufferLength. In other words, if bufferLength is 100ms, and 11 packets are sent at a 10ms interval, the last packet should be in a different log message. The new version of the test doesn't seems to verify that anymore, because we sleep for 500ms (which is greater than the bufferLength of 100ms) before the last packet. Am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a log is generated at a time close to the bufferLength, I guess the test will still be flaky as the goroutine may not get scheduled immediately when the timer expires. For example, the timer is set to expire after 100ms, but the goroutine is scheduled after 111ms, then a log generated at 110ms will be counted. Similarly, the goroutine that generates a log every 10ms may not get scheduled immediately when the ticker expires, leading to less records generated in 100ms. So the test generates logs much earlier than the timer expires or much later after it to avoid flaky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was suggesting is a packet every 50ms (t=0ms, t=50ms, t=100ms, t=150ms) and just checking that there are 2 log entries. Do you think 50ms is not long enough?

Copy link
Contributor Author

@qiyueyao qiyueyao Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to send 4 packets every 50ms, wait for 500ms to received the logs, then compare number of logs&number of packets. Also tested with 50% cpu power, works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoninbas 50ms is long enough, but do we expect the third entry (t=100ms) to be merged to the first log or the second log? I think it's not deterministic as it's on the boundary of bufferLength. The test will have to check it gets (2,2) or (3,1). What I suggested is to not generate a log on the boundary, so the test can just check (2,1) deterministically. It doesn't matter the first interval is 1ms or 50ms as long as it's not close to the boundary, and it's totally fine to reduce the second interval from 500ms to 150ms.
I see currently the test checks whether there are at least 2 logs and the sum matches. I think it doesn't cover a scenario that even the bufferLength doesn't work, the test will pass as (1, 1, 1, 1) will be considered valid. Maybe it should check there are exactly 2 logs and the sum matches if there is a log on the boundary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

50ms is long enough, but do we expect the third entry (t=100ms) to be merged to the first log or the second log?

as you pointed out, both cases should be valid

Maybe it should check there are exactly 2 logs and the sum matches if there is a log on the boundary.

yes, that's what I had in mind

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions, I updated the testcase to be 4 packets every 40ms, checking there are exactly 2 logs and sum of 4.
Because if 50ms, then I could not make sure there are exactly 2 logs, either 1/2 for 3 packets, or 2/3 for 4 packets. 40ms works as the time margin is enough for 2 logs to be generated.

Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
pkg/agent/controller/networkpolicy/audit_logging_test.go Outdated Show resolved Hide resolved
pkg/agent/controller/networkpolicy/audit_logging_test.go Outdated Show resolved Hide resolved
Comment on lines 88 to 89
mockAnpLogger.mu.Lock()
defer mockAnpLogger.mu.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need mockAnpLogger.mu at all: channels are thread-safe

Copy link
Contributor Author

@qiyueyao qiyueyao Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried without mutex, it gave this error.
I think it is because I used mutex when sending to channel to protect the map, so we need a lock here to avoid race condition.

Write at 0x00c00008a730 by goroutine 77:
  runtime.closechan()
      /usr/local/go/src/runtime/chan.go:357 +0x0
  antrea.io/antrea/pkg/agent/controller/networkpolicy.TestDropPacketMultiDedupLog.func1()
      /usr/src/antrea.io/antrea/pkg/agent/controller/networkpolicy/audit_logging_test.go:127 +0x5e

Previous read at 0x00c00008a730 by goroutine 76:
  runtime.chansend()
      /usr/local/go/src/runtime/chan.go:158 +0x0
  antrea.io/antrea/pkg/agent/controller/networkpolicy.(*mockLogger).Write()
      /usr/src/antrea.io/antrea/pkg/agent/controller/networkpolicy/audit_logging_test.go:47 +0x1c6
  log.(*Logger).Output()
      /usr/local/go/src/log/log.go:181 +0x3ac
  log.(*Logger).Printf()
      /usr/local/go/src/log/log.go:188 +0x8f
  antrea.io/antrea/pkg/agent/controller/networkpolicy.(*AntreaPolicyLogger).terminateLogKey()
      /usr/src/antrea.io/antrea/pkg/agent/controller/networkpolicy/audit_logging.go:90 +0x191
  antrea.io/antrea/pkg/agent/controller/networkpolicy.(*AntreaPolicyLogger).logAfterTimer()
      /usr/src/antrea.io/antrea/pkg/agent/controller/networkpolicy/audit_logging.go:81 +0xc4```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this is because the race detector believes there is no ordering guarantee between the channel sends and the channel close operation: golang/go#27769. Or believes that a send and the close can happen concurrently. The mutex helps with the second one, so I guess since the error goes away that's what the problem was. There's probably a way to write the code differently so that it's correct even without using a mutex, but I believe the current version is good enough for test code. I'll let @tnqn take a look but this looks good to me.

pkg/agent/controller/networkpolicy/audit_logging_test.go Outdated Show resolved Hide resolved
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member

tnqn commented Aug 20, 2021

/test-all

@antoninbas antoninbas merged commit 1550786 into antrea-io:main Aug 20, 2021
@qiyueyao qiyueyao deleted the LogDedupFollowup branch September 14, 2021 15:54
@qiyueyao qiyueyao changed the title [Follow Up][WIP]Antrea Network Policy Audit Logging Deduplication and Unit Tests [Follow Up]Antrea Network Policy Audit Logging Deduplication and Unit Tests Sep 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants