Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(design): Scan Container Images with Trivy Filesystem Scanne #830

Merged
merged 2 commits into from
Jan 11, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions docs/design/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,17 @@ created for different purposes. Mainly to brainstorm how Starboard works.

## Overview

| File | Description |
|-------|-------------|
| [managing_access_to_security_reports.md] | Managing Access to Security Reports |
| [design_scan_by_image_digest.png] | Design of vulnerability scanning by image digest (ContainerStatus vs PodSpec). |
| [design_starboard_at_scale.png] | Design of Starboard Operator at scale with more efficient worker queue. |
| [design_vulnerability_scanning_2.0.png] | Design of most efficient vulnerability scanning that you can imagine. |
| [explain_starboard_rescan_jitter.png] | Explain a preferred way to rescan (evenly distributed vs bursty events). |
| File | Description |
|------------------------------------------|-----------------------------------------------------------------------------------------|
| [design_trivy_file_system_scanner.md] | Scan Container Images with Trivy Filesystem Scanner |
| [design_scan_by_image_digest.png] | Design of vulnerability scanning by image digest (ContainerStatus vs PodSpec). |
| [design_starboard_at_scale.png] | Design of Starboard Operator at scale with more efficient worker queue. |
| [design_vulnerability_scanning_2.0.png] | Design of most efficient vulnerability scanning that you can imagine. |
| [explain_starboard_rescan_jitter.png] | Explain a preferred way to rescan (evenly distributed vs bursty events). |
| [explain_starboard_cli_init.png] | Explain which K8s API object are created when the `starboard init` command is executed. |
| [design_namespace_security_report.pdf] | Design of a security report generated by Starboard CLI for a given namespace. |
| [design_namespace_security_report.pdf] | Design of a security report generated by Starboard CLI for a given namespace. |

[managing_access_to_security_reports.md]: ./managing_access_to_security_reports.md
[design_trivy_file_system_scanner.md]: ./design_trivy_file_system_scanner.md
[design_scan_by_image_digest.png]: ./design_scan_by_image_digest.png
[design_starboard_at_scale.png]: ./design_starboard_at_scale.png
[design_vulnerability_scanning_2.0.png]: ./design_vulnerability_scanning_2.0.png
Expand Down
197 changes: 197 additions & 0 deletions docs/design/design_trivy_file_system_scanner.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
# Scan Container Images with Trivy Filesystem Scanner

Authors: [Devendra Turkar], [Daniel Pacak]

## Overview

Starboard currently uses Trivy in [Standalone] or [ClientServer] mode to scan and generate VulnerabilityReports for
container images by pulling the images from remote registries. Starboard scans a specified K8s workload by running the
Trivy executable as a K8s Job. This approach implies that Trivy does not have access to images cached by the container
runtime on cluster nodes. Therefore, to scan images from private registries Starboard reads ImagePullSecrets specified
on workloads or on service accounts used by the workloads, and passes them down to Trivy executable as `TRIVY_USERNAME`
and `TRIVY_PASSWORD` environment variables.

Since ImagePullSecrets are not the only way to provide registry credential, the following alternatives are not
currently supported by Starboard:
1. Pre-pulled images
2. [Configuring nodes to authenticate to a private registry]
3. Vendor-specific or local extension. For example, methods described on [AWS ECR Private registry authentication].

Even though we could resolve some of above-mentioned limitations with hostPath volume mounts to the container runtime
socket, it would have its own disadvantages that we are trying to avoid. For example, more permissions to schedule scan
Jobs and additional information about cluster's infrastructure such as location of the container runtime socket.

## Solution

### TL;DR;

Use Trivy filesystem scanning to scan container images. The main idea, which is discussed in this proposal, is to
schedule a scan Job on the same cluster node where the scanned workload. This allows Trivy to scan a filesystem of
the container image which is already cached on that node without pulling the image from a remote registry. What's more,
Trivy will scan container images from private registries without providing registry credentials (as ImagePullSecret or
in any other proprietary way).

### Deep Dive

To scan a container image of a given K8s workload Starboard will create a corresponding container of a scan Job and
override its entrypoint to invoke Trivy filesystem scanner.

This approach requires Trivy executable to be downloaded and made available to the entrypoint. We'll do that by adding

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could maybe be done with a csi driver as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that I understood this comment. Could you elaborate on how we can use csi driver in this case?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just thinking, instead of copying the file from an init container to an emtpydir, a csi driver in ephemeral mode could mount the file into the container.

Copy link

@kfox1111 kfox1111 Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May not be worth the effort. but could be like:

kind: Pod
apiVersion: v1
metadata:
  name: my-csi-app-inline-volume
spec:
  containers:
    - name: my-frontend
      image: busybox
      command: [ "sleep", "100000" ]
      volumeMounts:
      - mountPath: "/trivy"
        name: my-csi-volume
  volumes:
  - name: my-csi-volume
    csi:
      driver: trivy

the init container to the scan Job. Such init container will use the Trivy container image to copy Trivy executable out
to the emptyDir volume, which will be shared with the other containers.

Another init container is required to download Trivy vulnerability database and save it to the mounted shared volume.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.


Finally, the scan container will use shared volume with the Trivy executable and Trivy database to perform the actual
filesystem scan. (See the provided [Example](#example) to have a better idea of all containers defined by a scan Job and
how they share data via the emptyDir volume.)

> Note that the second init container is required in [Standalone] mode, which is the only mode supported by Trivy
> filesystem scanner at the time of writing this proposal.

We further restrict scan Jobs to run on the same node where scanned Pod is running and never pull images from remote
registries by setting the `ImagePullPolicy` to `Never`. To determine the node for a scan Job Starboard will list active
Pods controlled by the scanned workload. If the list is not empty it will take the node name from the first Pod,
otherwise it will ignore the workload.
danielpacak marked this conversation as resolved.
Show resolved Hide resolved

### Example

Let's assume that there's the `nginx` Deployment in the `poc-ns` namespace. It runs the `example.registry.com/nginx:1.16`
container image from a private registry `example.registry.com`. Registry credentials are stored in the `private-registry`
ImagePullSecret. (Alternatively, ImagePullSecret can be attached to a service account referred to by the Deployment.)

```yaml
---
apiVersion: v1
kind: Namespace
metadata:
name: poc-ns
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: nginx
name: nginx
namespace: poc-ns
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
imagePullSecrets:
- name: private-registry
containers:
- name: nginx
image: example.registry.com/nginx:1.16
```

To scan the `nginx` container of the `nginx` Deployment, Starboard will create the following scan Job in the
`starboard-system` namespace and observe it until it's Completed or Failed.

```yaml
---
apiVersion: batch/v1
kind: Job
metadata:
name: scan-vulnerabilityreport-ab3134
namespace: starboard-system
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
# Explicit nodeName indicates our intention to schedule a scan pod
# on the same cluster node where the nginx workload is running.
# This could also imply considering taints and tolerations and other
# properties respected by K8s scheduler.
nodeName: kind-control-plane
volumes:
- name: scan-volume
emptyDir: { }
initContainers:
# The trivy-get-binary init container is used to copy out the trivy executable
# binary from the upstream Trivy container image, i.e. aquasec/trivy:0.19.2,
# to a shared emptyDir volume.
- name: trivy-get-binary
image: aquasec/trivy:0.19.2
command:
- cp
- -v
- /usr/local/bin/trivy
- /var/starboard/trivy
volumeMounts:
- name: scan-volume
mountPath: /var/starboard
# The trivy-download-db container is using trivy executable binary
# from the previous step to download Trivy vulnerability database
# from GitHub releases page.
# This won't be required once Trivy supports ClientServer mode
# for the fs subcommand.
- name: trivy-download-db
image: aquasec/trivy:0.19.2
command:
- /var/starboard/trivy
- --download-db-only
- --cache-dir
- /var/starboard/trivy-db
volumeMounts:
- name: scan-volume
mountPath: /var/starboard
containers:
# The nginx container is based on the container image that
# we want to scan with Trivy. However, it has overwritten command (entrypoint)
# to invoke trivy file system scan. The scan results are output to stdout
# in JSON format, so we can parse them and store as VulnerabilityReport.
- name: nginx
image: example.registry.com/nginx:1.16
# To scan image layers cached on a cluster node without pulling
# it from a remote registry.
imagePullPolicy: Never
securityContext:
# Trivy must run as root, so we set UID here.
runAsUser: 0
command:
danielpacak marked this conversation as resolved.
Show resolved Hide resolved
- /var/starboard/trivy
- --cache-dir
- /var/starboard/trivy-db
- fs
- --format
- json
- /
volumeMounts:
- name: scan-volume
mountPath: /var/starboard
```

Notice that the scan Job does not use registry credentials stored in the `private-registry` ImagePullSecret at all.
Also, the `ImagePullPolicy` for the `nginx` container is set to `Never` to avoid pulling the image from the
`example.registry.com/nginx` repository that requires authentication. And finally, the `nodeName` property is explicitly
set to `kube-control-plane` to make sure that the scan Job is scheduled on the same node as a Pod controlled by the
`nginx` Deployment. (We assumed that there was at least one Pod controlled by the `nginx` Deployment, and it was scheduled
on the `kube-control-plane` node.)

Trivy must run as root so the scan Job defined the `securityContext` with the `runAsUser` property set to `0` UID.

## Remarks

1. We cannot scan K8s workloads scaled down to 0 replicas because we cannot infer on which cluster node a scan Job should
run. (In general, a node name is only set on a running Pod.) But once a workload is scaled up, Starboard Operator
will receive the update event and will have another chance to scan it.
2. It's hard to identify Pods managed by the CronJob controller, therefore we'll skip them.
3. Trivy filesystem command does not work in [ClientServer] mode. Therefore, this solution is subject to the limits of
the [Standalone] mode. We plan to extend Trivy filesystem command to work in ClientServer mode and improve the
implementation of Starboard once it's available.
4. Trivy must run as root and this may be blocked by some Admission Controllers such as PodSecurityPolicy.

[Devendra Turkar]: https://github.com/deven0t
[Daniel Pacak]: https://github.com/danielpacak
[Standalone]: https://aquasecurity.github.io/starboard/v0.13.2/integrations/vulnerability-scanners/trivy/#standalone
[ClientServer]: https://aquasecurity.github.io/starboard/v0.13.2/integrations/vulnerability-scanners/trivy/#clientserver
[Configuring nodes to authenticate to a private registry]: https://kubernetes.io/docs/concepts/containers/images/#configuring-nodes-to-authenticate-to-a-private-registry
[AWS ECR Private registry authentication]: https://docs.aws.amazon.com/AmazonECR/latest/userguide/registry_auth.html