[Umbrella Issue] Auditing improvements

EDIT: Opting to treat this as an umbrella issue instead of placeholder to noodle on ideas

An umbrella issue to capture ideas and suggestions to improve our audit process.

Currently:
- we have a hand-written bash script that dumps information about services we care about
- we have a prowjob that runs this script periodically, as a serviceaccount that we've tried to give just enough read access to via a custom IAM role
- the results of this script are submitted via PR by the prowjob
- humans manually review the PRs (usually me)
  - review comments are used to confirm expected changes, ideally with links to issue comments or PRs, e.g.
    - "This thing changed because someone ran scripts when that PR merged"
    - "This was me doing what I described in `link_to_issue_comment`"
  - review comments and followup issues are used to ask questions, or clean up things that need to be cleaned up, e.g.
    - "Looks like a new GCP feature rolled out, we'll want to disable these"
    - "Hey `@foo` did you change something manually here?"
    - "This is way too much noise, let's make our audit script ignore this"
- the job runs 4 times a day 

Some problems with this:
- I just rattled the above off the top of my head.  This is poorly documented, a.k.a audit/README.md needs work
- It takes too long to dump information, currently [about 100m for the job to run][audit-testgrid]
- The review burden is high, it's 100% manual right now
- Our audit output format (status) is not easily reconciled with our (uh, lack of) input format (spec) (ref: https://github.com/kubernetes/k8s.io/issues/516#issuecomment-766157452)
- All of the above means the feedback cycle is... too long
- We're using a completely home-rolled thing that we need to maintain ourselves, but like... two people have ever touched it
- We're not getting exhaustive dumps, so for all I know there are mysterious things lurking out there

----
TODO: flesh these out into issues? or just track a list here

Our audit results are not easily reconciled:
- What changes are live that aren't in source?
  - ... reconciling this now requires lots of focused human review of bash, and even then I'm not sure I'd trust it
  - smaller updates from audit script will help this
- Can we reduce the toil involved in updating source with previously untracked changes?
  - at the moment a human needs to know which script(s) to update, and how
  - if we used some other tooling (e.g. terraform, crossplane) would it make sense to try dumping in that format?
  - could we recognize common cases?
- What changes are in source that aren't live?
  - at the moment a human needs to figure this out
- Can we reduce the toil in making them live?
  - at the moment a human needs to know which script(s) to manually run

We can't audit or dump everything due to IAM issues:
- instead of a bunch of pre-defined roles, can we aggregate into a custom role?
- do we even need a custom role? why not simply assign `roles/viewer` at the org level?

Auditing dumps are too slow:
- Try using Cloud Asset Inventory instead, like one or both of:
  - `gcloud asset`
  - `gcloud resource-config bulk-export` - https://github.com/kubernetes/k8s.io/issues/1981

Bugs with our audit script right now:
- noise reduction: ensure we don't dump etags for all resources - https://github.com/kubernetes/k8s.io/issues/2062
- missing resources: BigQuery datasets - https://github.com/kubernetes/k8s.io/issues/2029

/wg k8s-infra
/area infra/auditing
/area access
/priority important-longterm
/kind cleanup

cc @dims @thockin @cblecker @hh 

[audit-testgrid]: https://testgrid.k8s.io/wg-k8s-infra-k8sio#ci-k8sio-audit&graph-metrics=test-duration-minutes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Umbrella Issue] Auditing improvements #1657

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development