Open
Description
EDIT: Opting to treat this as an umbrella issue instead of placeholder to noodle on ideas
An umbrella issue to capture ideas and suggestions to improve our audit process.
Currently:
- we have a hand-written bash script that dumps information about services we care about
- we have a prowjob that runs this script periodically, as a serviceaccount that we've tried to give just enough read access to via a custom IAM role
- the results of this script are submitted via PR by the prowjob
- humans manually review the PRs (usually me)
- review comments are used to confirm expected changes, ideally with links to issue comments or PRs, e.g.
- "This thing changed because someone ran scripts when that PR merged"
- "This was me doing what I described in
link_to_issue_comment
"
- review comments and followup issues are used to ask questions, or clean up things that need to be cleaned up, e.g.
- "Looks like a new GCP feature rolled out, we'll want to disable these"
- "Hey
@foo
did you change something manually here?" - "This is way too much noise, let's make our audit script ignore this"
- review comments are used to confirm expected changes, ideally with links to issue comments or PRs, e.g.
- the job runs 4 times a day
Some problems with this:
- I just rattled the above off the top of my head. This is poorly documented, a.k.a audit/README.md needs work
- It takes too long to dump information, currently about 100m for the job to run
- The review burden is high, it's 100% manual right now
- Our audit output format (status) is not easily reconciled with our (uh, lack of) input format (spec) (ref: Refactor infra/gcp/... #516 (comment))
- All of the above means the feedback cycle is... too long
- We're using a completely home-rolled thing that we need to maintain ourselves, but like... two people have ever touched it
- We're not getting exhaustive dumps, so for all I know there are mysterious things lurking out there
TODO: flesh these out into issues? or just track a list here
Our audit results are not easily reconciled:
- What changes are live that aren't in source?
- ... reconciling this now requires lots of focused human review of bash, and even then I'm not sure I'd trust it
- smaller updates from audit script will help this
- Can we reduce the toil involved in updating source with previously untracked changes?
- at the moment a human needs to know which script(s) to update, and how
- if we used some other tooling (e.g. terraform, crossplane) would it make sense to try dumping in that format?
- could we recognize common cases?
- What changes are in source that aren't live?
- at the moment a human needs to figure this out
- Can we reduce the toil in making them live?
- at the moment a human needs to know which script(s) to manually run
We can't audit or dump everything due to IAM issues:
- instead of a bunch of pre-defined roles, can we aggregate into a custom role?
- do we even need a custom role? why not simply assign
roles/viewer
at the org level?
Auditing dumps are too slow:
- Try using Cloud Asset Inventory instead, like one or both of:
gcloud asset
gcloud resource-config bulk-export
- GCP gcloud bulk-export as an audit trail #1981
Bugs with our audit script right now:
- noise reduction: ensure we don't dump etags for all resources - audit: ensure no etags for dumped resources #2062
- missing resources: BigQuery datasets - BigQuery datasets are not reported in the daily audit report #2029
/wg k8s-infra
/area infra/auditing
/area access
/priority important-longterm
/kind cleanup
Metadata
Assignees
Labels
Define who has access to what via IAM bindings, role bindings, policy, etc.Audit of project resources, audit followup issues, code in audit/Categorizes issue or PR as related to cleaning up code, process, or technical debt.Indicates that an issue or PR should not be auto-closed due to staleness.Higher priority than priority/awaiting-more-evidence.Categorizes an issue or PR as relevant to SIG K8s Infra.
Type
Projects
Status
No status