Skip to content

[Umbrella Issue] Auditing improvements #1657

Open
@spiffxp

Description

EDIT: Opting to treat this as an umbrella issue instead of placeholder to noodle on ideas

An umbrella issue to capture ideas and suggestions to improve our audit process.

Currently:

  • we have a hand-written bash script that dumps information about services we care about
  • we have a prowjob that runs this script periodically, as a serviceaccount that we've tried to give just enough read access to via a custom IAM role
  • the results of this script are submitted via PR by the prowjob
  • humans manually review the PRs (usually me)
    • review comments are used to confirm expected changes, ideally with links to issue comments or PRs, e.g.
      • "This thing changed because someone ran scripts when that PR merged"
      • "This was me doing what I described in link_to_issue_comment"
    • review comments and followup issues are used to ask questions, or clean up things that need to be cleaned up, e.g.
      • "Looks like a new GCP feature rolled out, we'll want to disable these"
      • "Hey @foo did you change something manually here?"
      • "This is way too much noise, let's make our audit script ignore this"
  • the job runs 4 times a day

Some problems with this:

  • I just rattled the above off the top of my head. This is poorly documented, a.k.a audit/README.md needs work
  • It takes too long to dump information, currently about 100m for the job to run
  • The review burden is high, it's 100% manual right now
  • Our audit output format (status) is not easily reconciled with our (uh, lack of) input format (spec) (ref: Refactor infra/gcp/... #516 (comment))
  • All of the above means the feedback cycle is... too long
  • We're using a completely home-rolled thing that we need to maintain ourselves, but like... two people have ever touched it
  • We're not getting exhaustive dumps, so for all I know there are mysterious things lurking out there

TODO: flesh these out into issues? or just track a list here

Our audit results are not easily reconciled:

  • What changes are live that aren't in source?
    • ... reconciling this now requires lots of focused human review of bash, and even then I'm not sure I'd trust it
    • smaller updates from audit script will help this
  • Can we reduce the toil involved in updating source with previously untracked changes?
    • at the moment a human needs to know which script(s) to update, and how
    • if we used some other tooling (e.g. terraform, crossplane) would it make sense to try dumping in that format?
    • could we recognize common cases?
  • What changes are in source that aren't live?
    • at the moment a human needs to figure this out
  • Can we reduce the toil in making them live?
    • at the moment a human needs to know which script(s) to manually run

We can't audit or dump everything due to IAM issues:

  • instead of a bunch of pre-defined roles, can we aggregate into a custom role?
  • do we even need a custom role? why not simply assign roles/viewer at the org level?

Auditing dumps are too slow:

Bugs with our audit script right now:

/wg k8s-infra
/area infra/auditing
/area access
/priority important-longterm
/kind cleanup

cc @dims @thockin @cblecker @hh

Metadata

Assignees

No one assigned

    Labels

    area/accessDefine who has access to what via IAM bindings, role bindings, policy, etc.area/auditAudit of project resources, audit followup issues, code in audit/kind/cleanupCategorizes issue or PR as related to cleaning up code, process, or technical debt.lifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.priority/backlogHigher priority than priority/awaiting-more-evidence.sig/k8s-infraCategorizes an issue or PR as relevant to SIG K8s Infra.

    Type

    No type

    Projects

    • Status

      No status

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions