Advanced Audit tests flaking #60719

liggitt · 2018-03-02T17:37:50Z

Started flaking yesterday: https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=Advanced%20Audit

Seems to coincide with #60237

/assign @tallclair @crassirostris

need to triage this ASAP to know if we need to roll back that PR

liggitt · 2018-03-02T18:27:07Z

/milestone clear

liggitt · 2018-03-02T18:27:14Z

/milestone v1.10

BenTheElder · 2018-03-02T18:29:26Z

/milestone v1.10
(should not work, lack of perms)

k8s-ci-robot · 2018-03-02T18:29:27Z

@BenTheElder: You must be a member of the kubernetes-milestone-maintainers github team to set the milestone.

In response to this:

/milestone v1.10
(should not work, lack of perms)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tallclair · 2018-03-02T19:17:49Z

From apiserver log for run ci-kubernetes-e2e-gci-gce/22334:

I0302 17:08:26.869897       1 flags.go:27] FLAG: --audit-log-batch-buffer-size="10000"
I0302 17:08:26.869904       1 flags.go:27] FLAG: --audit-log-batch-max-size="400"
I0302 17:08:26.869910       1 flags.go:27] FLAG: --audit-log-batch-max-wait="30s"
I0302 17:08:26.869919       1 flags.go:27] FLAG: --audit-log-batch-throttle-burst="15"
I0302 17:08:26.869925       1 flags.go:27] FLAG: --audit-log-batch-throttle-enable="true"
I0302 17:08:26.869931       1 flags.go:27] FLAG: --audit-log-batch-throttle-qps="10"
I0302 17:08:26.869941       1 flags.go:27] FLAG: --audit-log-format="json"
I0302 17:08:26.869947       1 flags.go:27] FLAG: --audit-log-maxage="0"
I0302 17:08:26.869953       1 flags.go:27] FLAG: --audit-log-maxbackup="0"
I0302 17:08:26.869960       1 flags.go:27] FLAG: --audit-log-maxsize="2000000000"
I0302 17:08:26.869966       1 flags.go:27] FLAG: --audit-log-mode="batch"
I0302 17:08:26.869972       1 flags.go:27] FLAG: --audit-log-path="/var/log/kube-apiserver-audit.log"
I0302 17:08:26.869979       1 flags.go:27] FLAG: --audit-policy-file="/etc/audit_policy.config"
I0302 17:08:26.869986       1 flags.go:27] FLAG: --audit-webhook-batch-buffer-size="10000"
I0302 17:08:26.869992       1 flags.go:27] FLAG: --audit-webhook-batch-initial-backoff="10s"
I0302 17:08:26.869999       1 flags.go:27] FLAG: --audit-webhook-batch-max-size="400"
I0302 17:08:26.870005       1 flags.go:27] FLAG: --audit-webhook-batch-max-wait="30s"
I0302 17:08:26.870012       1 flags.go:27] FLAG: --audit-webhook-batch-throttle-burst="15"
I0302 17:08:26.870018       1 flags.go:27] FLAG: --audit-webhook-batch-throttle-enable="false"
I0302 17:08:26.870024       1 flags.go:27] FLAG: --audit-webhook-batch-throttle-qps="10"
I0302 17:08:26.870031       1 flags.go:27] FLAG: --audit-webhook-config-file=""
I0302 17:08:26.870037       1 flags.go:27] FLAG: --audit-webhook-initial-backoff="10s"
I0302 17:08:26.870044       1 flags.go:27] FLAG: --audit-webhook-mode="batch"

I'm pretty sure this is what's happening: the default audit mode changed to batch for the logging backend, but the audit test expects the logs to be there immediately.

Recommended Action:

Revert the defaults on the audit flags. I don't think we can change the default behavior for backwards compatibility.
Add retry logic to the audit test so that it can handle async logs.
(optional) update the e2e configuration to use batch mode with a very short timeout.

If we rollback #60237 it will cause problems with the scale tests that were re-enabled in kubernetes/test-infra#7000

krzyzacy · 2018-03-02T19:24:28Z

http://k8s-testgrid.appspot.com/sig-release-master-blocking#gci-gce
starts to look pretty bad
priority/critical-urgent
cc @jberkus

tallclair · 2018-03-02T19:31:01Z

Actually, the forward fix would break scalability anyway. I'll send a rollback.

liggitt · 2018-03-05T06:26:58Z

/status in-progress

crassirostris · 2018-03-05T09:42:23Z

@tallclair

Actually, the forward fix would break scalability anyway. I'll send a rollback.

No, because audit logging tests are not enabled in scalability tests, there's DisabledForLargeClusters in the test name

crassirostris · 2018-03-05T09:42:57Z

I think the e2e test fix is better in this case

crassirostris · 2018-03-05T14:57:25Z

Filed #60794 to fix the problem

I still believe that using buffered audit logging is a better default, since slowing down all api requests is a pretty serious drawback and should be enabled consciously

crassirostris · 2018-03-05T19:04:08Z

@liggitt @tallclair If you're fine with forward-fixing the problem in e2es, I'd prefer this way, it's the easiest. Fix is in #60794

tallclair · 2018-03-05T20:00:37Z

I still believe that using buffered audit logging is a better default, since slowing down all api requests is a pretty serious drawback and should be enabled consciously

If we make batch mode the default, I think we should adjust the default parameters to include:

Very low timeout (1s?) - I don't think there's any reason to collect large batches.
Only a single go-routine (do we have the right params for this?) - This is important to reduce write contention, and to keep log lines in order.

I'm OK moving forward with the test fix, but I'd like to fix those defaults before 1.10 is cut, if possible.

crassirostris · 2018-03-06T14:04:55Z

I'm OK moving forward with the test fix, but I'd like to fix those defaults before 1.10 is cut, if possible.

I agree that defaults should change. What about changing the batch size to 1, so that each log entry is written asynchronously in its own goroutine? Anyways, this or very low timeout SGTM, since I don't think that the second option is feasible right now.

Anyways, let's continue the discussion in #60739, since it's unrelated to the flaking.

@tallclair

Automatic merge from submit-queue (batch tested with PRs 60630, 60794). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add retrying to audit logging e2e tests Fixes #60719 Adds retrying to the audit logging e2e tests so it can work when audit logging is in batch mode and actual writing is delayed. ```release-note NONE ``` /cc @tallclair @liggitt @sttts

liggitt · 2018-03-13T21:38:01Z

still failing in upgrade test... e2e fix likely needs backport to 1.9 e2es
http://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-1.9-master-upgrade-master

crassirostris · 2018-03-13T21:52:51Z

Good point! Created #61134 to address it

k8s-github-robot · 2018-03-15T08:38:46Z

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@crassirostris @liggitt @tallclair @kubernetes/sig-auth-misc

Action Required: This issue has not been updated since Mar 13. Please provide an update.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required

Issue Labels

sig/auth: Issue will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

@sttts

…-#60794-upstream-release-1.9 Automatic merge from submit-queue. Automated cherry pick of #60794: Add retrying to audit logging e2e tests Cherry pick of #60794 on release-1.9. Fixes #60719, since audit logging behavior has changed in 1.10. Purely e2e change, so no release note #60794: Add retrying to audit logging e2e tests ```release-note NONE ``` /cc @sttts @liggitt

liggitt · 2018-03-15T20:22:53Z

fixed in upgrade test by #61134

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 2, 2018

k8s-ci-robot assigned crassirostris and tallclair Mar 2, 2018

k8s-ci-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 2, 2018

liggitt added the kind/bug Categorizes issue or PR as related to a bug. label Mar 2, 2018

liggitt added this to the v1.10 milestone Mar 2, 2018

liggitt mentioned this issue Mar 2, 2018

Audit use buffered backend #60237

Merged

liggitt added the kind/flake Categorizes issue or PR as related to a flaky test. label Mar 2, 2018

k8s-ci-robot removed this from the v1.10 milestone Mar 2, 2018

krzyzacy added this to the v1.10 milestone Mar 2, 2018

k8s-github-robot added the milestone/needs-attention label Mar 2, 2018

This was referenced Mar 2, 2018

Revert "Audit use buffered backend" #60727

Closed

Fix default auditing options. #60739

Merged

liggitt mentioned this issue Mar 5, 2018

fix advanced audit e2e test #60773

Closed

k8s-ci-robot added the status/in-progress label Mar 5, 2018

k8s-github-robot removed the milestone/needs-attention label Mar 5, 2018

liggitt mentioned this issue Mar 5, 2018

[sig-auth] Advanced Audit should audit API calls flake #60777

Closed

crassirostris mentioned this issue Mar 5, 2018

Add retrying to audit logging e2e tests #60794

Merged

jberkus mentioned this issue Mar 5, 2018

1.10 Issue Burndown kubernetes/sig-release#86

Closed

k8s-github-robot closed this as completed in #60794 Mar 6, 2018

tallclair mentioned this issue Mar 6, 2018

Advanced Auditing 1.10 umbrella bug #58083

Closed

11 tasks

liggitt reopened this Mar 13, 2018

crassirostris mentioned this issue Mar 13, 2018

Automated cherry pick of #60794: Add retrying to audit logging e2e tests #61134

Merged

k8s-github-robot added the milestone/needs-attention label Mar 15, 2018

liggitt closed this as completed Mar 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced Audit tests flaking #60719

Advanced Audit tests flaking #60719

liggitt commented Mar 2, 2018

liggitt commented Mar 2, 2018

liggitt commented Mar 2, 2018

BenTheElder commented Mar 2, 2018

k8s-ci-robot commented Mar 2, 2018

tallclair commented Mar 2, 2018

krzyzacy commented Mar 2, 2018

tallclair commented Mar 2, 2018

liggitt commented Mar 5, 2018

crassirostris commented Mar 5, 2018

crassirostris commented Mar 5, 2018

crassirostris commented Mar 5, 2018

crassirostris commented Mar 5, 2018

tallclair commented Mar 5, 2018

crassirostris commented Mar 6, 2018 •

edited

Loading

liggitt commented Mar 13, 2018

crassirostris commented Mar 13, 2018

k8s-github-robot commented Mar 15, 2018

liggitt commented Mar 15, 2018

Advanced Audit tests flaking #60719

Advanced Audit tests flaking #60719

Comments

liggitt commented Mar 2, 2018

liggitt commented Mar 2, 2018

liggitt commented Mar 2, 2018

BenTheElder commented Mar 2, 2018

k8s-ci-robot commented Mar 2, 2018

tallclair commented Mar 2, 2018

krzyzacy commented Mar 2, 2018

tallclair commented Mar 2, 2018

liggitt commented Mar 5, 2018

crassirostris commented Mar 5, 2018

crassirostris commented Mar 5, 2018

crassirostris commented Mar 5, 2018

crassirostris commented Mar 5, 2018

tallclair commented Mar 5, 2018

crassirostris commented Mar 6, 2018 • edited Loading

liggitt commented Mar 13, 2018

crassirostris commented Mar 13, 2018

k8s-github-robot commented Mar 15, 2018

liggitt commented Mar 15, 2018

crassirostris commented Mar 6, 2018 •

edited

Loading