Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Umbrella Issue] Auditing improvements #1657

Open
spiffxp opened this issue Feb 16, 2021 · 11 comments
Open

[Umbrella Issue] Auditing improvements #1657

spiffxp opened this issue Feb 16, 2021 · 11 comments
Labels
area/access Define who has access to what via IAM bindings, role bindings, policy, etc. area/audit Audit of project resources, audit followup issues, code in audit/ kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.
Milestone

Comments

@spiffxp
Copy link
Member

spiffxp commented Feb 16, 2021

EDIT: Opting to treat this as an umbrella issue instead of placeholder to noodle on ideas

An umbrella issue to capture ideas and suggestions to improve our audit process.

Currently:

  • we have a hand-written bash script that dumps information about services we care about
  • we have a prowjob that runs this script periodically, as a serviceaccount that we've tried to give just enough read access to via a custom IAM role
  • the results of this script are submitted via PR by the prowjob
  • humans manually review the PRs (usually me)
    • review comments are used to confirm expected changes, ideally with links to issue comments or PRs, e.g.
      • "This thing changed because someone ran scripts when that PR merged"
      • "This was me doing what I described in link_to_issue_comment"
    • review comments and followup issues are used to ask questions, or clean up things that need to be cleaned up, e.g.
      • "Looks like a new GCP feature rolled out, we'll want to disable these"
      • "Hey @foo did you change something manually here?"
      • "This is way too much noise, let's make our audit script ignore this"
  • the job runs 4 times a day

Some problems with this:

  • I just rattled the above off the top of my head. This is poorly documented, a.k.a audit/README.md needs work
  • It takes too long to dump information, currently about 100m for the job to run
  • The review burden is high, it's 100% manual right now
  • Our audit output format (status) is not easily reconciled with our (uh, lack of) input format (spec) (ref: Refactor infra/gcp/... #516 (comment))
  • All of the above means the feedback cycle is... too long
  • We're using a completely home-rolled thing that we need to maintain ourselves, but like... two people have ever touched it
  • We're not getting exhaustive dumps, so for all I know there are mysterious things lurking out there

TODO: flesh these out into issues? or just track a list here

Our audit results are not easily reconciled:

  • What changes are live that aren't in source?
    • ... reconciling this now requires lots of focused human review of bash, and even then I'm not sure I'd trust it
    • smaller updates from audit script will help this
  • Can we reduce the toil involved in updating source with previously untracked changes?
    • at the moment a human needs to know which script(s) to update, and how
    • if we used some other tooling (e.g. terraform, crossplane) would it make sense to try dumping in that format?
    • could we recognize common cases?
  • What changes are in source that aren't live?
    • at the moment a human needs to figure this out
  • Can we reduce the toil in making them live?
    • at the moment a human needs to know which script(s) to manually run

We can't audit or dump everything due to IAM issues:

  • instead of a bunch of pre-defined roles, can we aggregate into a custom role?
  • do we even need a custom role? why not simply assign roles/viewer at the org level?

Auditing dumps are too slow:

Bugs with our audit script right now:

/wg k8s-infra
/area infra/auditing
/area access
/priority important-longterm
/kind cleanup

cc @dims @thockin @cblecker @hh

@k8s-ci-robot k8s-ci-robot added wg/k8s-infra area/audit Audit of project resources, audit followup issues, code in audit/ area/access Define who has access to what via IAM bindings, role bindings, policy, etc. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Feb 16, 2021
@hh
Copy link
Member

hh commented Feb 24, 2021

I'm getting close to having our existing job run hourly.
I'm not sure where we would want them to run as post-submits. (any change to k8s.io?)
I assume your wanting that to be able to type the changes via PRs into audit updates.

@spiffxp
Copy link
Member Author

spiffxp commented Mar 4, 2021

Glanced at Cloud Asset Inventory to look into something else

Two things that disqualify it gcloud asset search-all-resources as a general-purpose solution for us, I think:

But, gcloud asset search-all-iam-policies dumps IAM policy bindings really quickly, and for most of the resources we commonly we use:

# a prow cluster
$ gcloud asset search-all-iam-policies --scope=projects/k8s-infra-prow-build-trusted --format="value(resource)" | cut -d/ -f4-
projects/k8s-infra-prow-build-trusted/serviceAccounts/prow-deployer@k8s-infra-prow-build-trusted.iam.gserviceaccount.com
projects/k8s-infra-prow-build-trusted/serviceAccounts/gcb-builder@k8s-infra-prow-build-trusted.iam.gserviceaccount.com
projects/k8s-infra-prow-build-trusted/datasets/usage_metering_prow_build_trusted
projects/k8s-infra-prow-build-trusted/serviceAccounts/prow-build-trusted@k8s-infra-prow-build-trusted.iam.gserviceaccount.com
projects/k8s-infra-prow-build-trusted

# an e2e project
$ gcloud asset search-all-iam-policies --scope=projects/k8s-infra-e2e-boskos-001 --format="value(resource)" | cut -d/ -f4-
kubernetes-staging-485128143e-asia
kubernetes-staging-485128143e-eu
kubernetes-staging-485128143e
projects/k8s-infra-e2e-boskos-00

# a staging project
$ gcloud asset search-all-iam-policies --scope=projects/k8s-staging-e2e-test-images --format="value(resource)" | cut -d/ -f4-
k8s-staging-e2e-test-images-gcb
k8s-staging-e2e-test-images
artifacts.k8s-staging-e2e-test-images.appspot.com
projects/k8s-staging-e2e-test-images

So if nothing else, and excluding secrets, I could see this being useful to quickly audit/reconcile IAM polices across the org.

A next step would be to look at what sort of info is availabe from gcloud asset export

@spiffxp
Copy link
Member Author

spiffxp commented May 18, 2021

#1981 covers exploring cloud alpha resource-config bulk-export

@spiffxp
Copy link
Member Author

spiffxp commented May 19, 2021

kubernetes/test-infra#22239 should update the audit job to only bump open PRs if there are new changes in the audit directory, which will hopefully cut down on open PRs with long trails of force-pushes that don't actually change the files that have been reviewed.

@spiffxp spiffxp changed the title Auditing improvements [Umbrella Issue] Auditing improvements May 20, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Aug 17, 2021

It currently takes about ballpark 80 minutes to perform a full audit: https://testgrid.k8s.io/wg-k8s-infra-k8sio#ci-k8sio-audit&width=20&graph-metrics=test-duration-minutes

I think we can do better.

gcloud asset list is a thing now, if we want to try munging the yaml / json that dumps into the same format we're currently using, or make our yaml / json match its format

@spiffxp
Copy link
Member Author

spiffxp commented Aug 17, 2021

/milestone v1.23
/priority backlog
That said I think speeding this up may be less important than moving things over.

@k8s-ci-robot k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Aug 17, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Aug 17, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Aug 17, 2021

/remove-priority important-longterm

@k8s-ci-robot k8s-ci-robot removed the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Aug 17, 2021
@k8s-ci-robot k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. and removed wg/k8s-infra labels Sep 29, 2021
@ameukam
Copy link
Member

ameukam commented Dec 14, 2021

/milestone v1.24

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.23, v1.24 Dec 14, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2022
@ameukam
Copy link
Member

ameukam commented Mar 15, 2022

/remove-lifecycle stale
/lifecycle frozen
/milestone clear

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 15, 2022
@k8s-ci-robot k8s-ci-robot removed this from the v1.24 milestone Mar 15, 2022
@ameukam
Copy link
Member

ameukam commented Mar 3, 2024

/milestone v1.32

@k8s-ci-robot k8s-ci-robot added this to the v1.32 milestone Mar 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/access Define who has access to what via IAM bindings, role bindings, policy, etc. area/audit Audit of project resources, audit followup issues, code in audit/ kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.
Projects
Status: No status
Development

No branches or pull requests

5 participants