Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trivy-operator: Pull Images via CRI #101

Open
isugimpy opened this issue Jun 6, 2022 · 16 comments
Open

trivy-operator: Pull Images via CRI #101

isugimpy opened this issue Jun 6, 2022 · 16 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. target/kubernetes Issues relating to kubernetes cluster scanning

Comments

@isugimpy
Copy link

isugimpy commented Jun 6, 2022

Reached out via Twitter and was redirected here.

From my understanding, the operator is pulling container images based on the image path it scrapes from live objects on the cluster. However, it's pulling them directly, bypassing configuration on the host. Personally, I'd like the option to utilize the CRI on the actual host so that we can be sure the image that gets pulled actually matches what's running for real.

To cite a specific example, the clusters I manage are running CRI-O, and utilize the registry mirror functionality to do rewrites of the image pulls live. This allows us to attempt to pull from a cluster-local image cache first, with fallback to our canonical registry if that fails. Additionally, we use the rewrites to ensure that a pull from something like Docker Hub is forcefully redirected to pull through our canonical registry so we can cache and scan images. By having Trivy respect the behavior of the node it runs on, we can better ensure that the images it scans are accurately pulled.

@isugimpy isugimpy added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 6, 2022
@chen-keinan chen-keinan added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jun 13, 2022
@itaysk
Copy link

itaysk commented Jun 13, 2022

Thanks for opening an issue @isugimpy
Trivy tries to scan images from local image store if available, and if not it will pull them from registry. The interaction with local image store is similar to docker export. Today Trivy can do this local scanning with Docker and (recently) containerd. We looked into interacting with CRI but it seems that CRI doesn't have an export equivalent method [1] [2]. Instead, we could look into supporting CRI-O directly if that makes sense, but then you will have to mount the socket into the Trivy scanner pod for it to work. Another approach as suggested by @erikgb in #118 which is to still pull images, but at least be more accurate by taking the image digest from the running Pod status.

@isugimpy
Copy link
Author

If it ensures accuracy, having a single pod that mounts in the socket from its host doesn't seem unreasonable at all. Would love to see that. Not being able to ensure that we can pull and scan the images from the same host that they'd be pulled from to be actually run is, in my opinion, a blocker to being able to use the operator in a live environment.

@erikgb
Copy link
Contributor

erikgb commented Jun 13, 2022

If it ensures accuracy, having a single pod that mounts in the socket from its host doesn't seem unreasonable at all. Would love to see that. Not being able to ensure that we can pull and scan the images from the same host that they'd be pulled from to be actually run is, in my opinion, a blocker to being able to use the operator in a live environment.

I agree, but image name + image digest is the canonical refererence to an image. So if using digest, you don't need to mount the host socket. At least we should have an option to NOT mount anything from the host.

@isugimpy
Copy link
Author

You're absolutely correct that the combo is canonical, but if there's no other way to resolve that canonical reference due to a rewrite, it'd leave things in undefined behavior territory. Your point makes a ton of sense in the 90+% case. I just worry about it for the smaller percentage of users that take advantage of this feature (containerd also offers the rewrite capability, so this isn't limited specifically to CRI-O).

@itaysk itaysk removed the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jun 15, 2022
@itaysk
Copy link

itaysk commented Jun 15, 2022

I just tried to run nginx from ECR just to see how how the Pod Status looks like, and it has the following:

containerID: containerd://366a98117285434b8869ff692a020a136b01b7f6444aaa6e83a158ded431584b
image: public.ecr.aws/nginx/nginx:stable
imageID: public.ecr.aws/nginx/nginx@sha256:76503f7e2c5cf0910640e70264588f264016acdcdb91c99bc7b007efa971c708

Seems like if a rewrite was happening, the image/imageID registry reference would reflect the rewritten one. Is it not the case? (if you could confim @isugimpy)

@itaysk
Copy link

itaysk commented Jun 15, 2022

Also I wanted to point out a couple of additional options to address the original question:

  1. There's an option to specify the mirror as configuration (somehow lost in the docs):

Mirror for the registry , e.g. trivy.registry.mirror.index.docker.io: mirror.io would use mirror.io to get images originated from index.docker.io

  1. There's an experimental (also undocumented) feature to scan workloads by scanning the running container filesystem (vs the container image) using the trivy.command = filesystem configuration. It's currently broken due to FS scanning doesn't work with Trivy version >= 0.23.0 #49. But I mention here to here what people think about this approach.

@isugimpy
Copy link
Author

Seems like if a rewrite was happening, the image/imageID registry reference would reflect the rewritten one. Is it not the case? (if you could confim @isugimpy)

From one of my clusters:

- containerID: cri-o://defd9a0b99597b814d7f3dc74c125cbdb918d585da43a60cbfbcfafa1658365a
  image: quay.io/superq/chrony-exporter:v0.1.0
  imageID: quay.io/superq/chrony-exporter@sha256:30b6f76b72fe11e769ee3e18d5468d8fbb9a9467640689258d7100eee81db18b

I can confirm with 100% certainty that despite this pointing at quay.io, it was pulled from our internal passthrough cache via mirror rewrites at runtime. CRI-O (and containerd as I recall) doesn't present the result of the rewrite to the Kubelet.

  1. There's an option to specify the mirror as configuration (somehow lost in the docs):

This definitely helps the cause. It's not ideal to have to duplicate the mirror config, but it definitely would work. The one bummer is that it doesn't have fallback to a secondary mirror available. That's hardly a dealbreaker though. If the cluster-local cache is unavailable for some reason, I've got bigger problems than if Trivy is scanning.

  1. There's an experimental (also undocumented) feature to scan workloads by scanning the running container filesystem (vs the container image) using the trivy.command = filesystem configuration. It's currently broken due to FS scanning doesn't work with Trivy version >= 0.23.0 #49. But I mention here to here what people think about this approach.

Filesystem could work, it's an intriguing approach! Wouldn't that require a daemonset though, since not every image is present on every host?

@itaysk
Copy link

itaysk commented Jun 18, 2022

I can confirm with 100% certainty that despite this pointing at quay.io, it was pulled from our internal passthrough cache via mirror rewrites at runtime. CRI-O (and containerd as I recall) doesn't present the result of the rewrite to the Kubelet.

Thanks for checking. For our knowledge, to configure CRI-O for mirroring, are you using this configuration? https://github.com/containers/image/blob/main/registries.conf#L68

@itaysk
Copy link

itaysk commented Jun 18, 2022

Filesystem could work, it's an intriguing approach! Wouldn't that require a daemonset though, since not every image is present on every host?

This feature is meant to be used with vulnerabilityReports.scanJobsInSameNamespace configuration, which makes the scan job run in the namespace of the scanned workload instead of in the trivy-operator dedicated namespace. Running in the target ns is presumed to succeed in starting a new workload (the scan job) with the scanned image (permissions wise).

The feature originally suggested in this issue - scanning the image from local store by mounting the socket into Trivy container - has stricter requirement - of scheduling the scan job on the same node as the scanned workload.

@chen-keinan
Copy link
Contributor

chen-keinan commented Jun 19, 2022

@isugimpy related PR #150 that solve the fs scanning issue

@chen-keinan chen-keinan added target/kubernetes Issues relating to kubernetes cluster scanning priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jun 28, 2022
@chen-keinan
Copy link
Contributor

@isugimpy will trivy-operator with mode:fileSystem, scanning will solve this issue ?

@erikgb
Copy link
Contributor

erikgb commented Oct 9, 2022

@isugimpy will trivy-operator with mode:fileSystem, scanning will solve this issue ?

No, at least not for us. As Openshift SCC does not allow an unprivileged user to run containers as the root user, ref.:

# For filesystem scanning, Trivy needs to run as the root user
# runAsUser: 0

@JAORMX
Copy link
Contributor

JAORMX commented Oct 10, 2022

@erikgb trivy-operator should anyways not run with the restricted SCC, it should use at least anyuid. If it's running with a low privileged SCC it should be fixed in the OLM packaging.

@erikgb
Copy link
Contributor

erikgb commented Oct 10, 2022

@erikgb trivy-operator should anyways not run with the restricted SCC, it should use at least anyuid. If it's running with a low privileged SCC it should be fixed in the OLM packaging.

But it is not the operator that is the problem, but the jobs it schedules.

@JAORMX
Copy link
Contributor

JAORMX commented Oct 10, 2022

@erikgb I see... well, adding the needed annotation to schedule them with a higher privilege SCC shouldn't be too problematic. But, either way, permissions to use that SCC would need to be added to the packaging though (it's an RBAC rule).

@avishefi
Copy link

Related discussion: aquasecurity/trivy#4957

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. target/kubernetes Issues relating to kubernetes cluster scanning
Projects
None yet
Development

No branches or pull requests

6 participants