Skip to content
This repository was archived by the owner on Nov 1, 2022. It is now read-only.
This repository was archived by the owner on Nov 1, 2022. It is now read-only.

Flux 1.22.2+ - kustomize build commands causing flux pod evictions due to disk space consumption in tmp directory #3500

Closed
@mmcaya

Description

@mmcaya

Describe the bug

After upgrading to flux 1.22.2, k8s clusters immediately saw a spike in flux pod disk consumption due to /tmp not being properly cleaned up after sync loops involving kustomize build commands in our .flux.yml.
Disk space was quickly consumed leading to flux pod evictions due to disk pressure

Initial investigation suggests the update from #3381 is errantly or prematurely cancelling the sync loop, leaving orphaned data in the /tmp directory typically cleared by kustomize directly when it completes execution.

I haven't traced through the entire code execution path, but the context used during the calls to kustomize build (or whatever is in the generators section of the .flux.yml) via execCommand (link below) already had a context timeout using the same sync timeout settings as the PR noted above, which now means the same context was wrapped with a timeout twice.

See: https://github.com/fluxcd/flux/blob/master/pkg/manifests/configfile.go#L492

Issue is not present when reverting to 1.22.1.

I've also locally tried using both kustomize 3.8.4 and 3.8.10 with flux 1.22.2 to eliminate that as the potential culprit (as it was also upgraded in flux 1.22.2), and so far have seen the issue with either kustomize version

To Reproduce

Steps to reproduce the behaviour:

  1. Provide Flux install instructions
    Current flux settings
 spec:
      containers:
      - args:
        - --log-format=fmt
        - --ssh-keygen-dir=/var/fluxd/keygen
        - --k8s-secret-name=XXX
        - --memcached-hostname=XXX
        - --sync-state=git
        - --sync-timeout=10m
        - --memcached-service=XXX
        - --git-url=git@github.com:XXX
        - --git-branch=release
        - --git-path=overlays/XXX
        - --git-readonly=false
        - --git-user=Weave Flux
        - --git-email=support@weave.works
        - --git-set-author=false
        - --git-poll-interval=5m
        - --git-timeout=20s
        - --sync-interval=5m
        - --git-ci-skip=false
        - --git-label=XXX-flux-sync
        - --manifest-generation=true
        - --registry-poll-interval=5m
        - --registry-rps=200
        - --registry-burst=125
        - --registry-trace=false
  1. Provide a GitHub repository with Kubernetes manifests

Sample manifests to reproduce issue:

.
├── bases
│   └── aws-load-balancer-controller
│       └── kustomization.yaml
└── overlays
    ├── aws-load-balancer-controller
    │   └── kustomization.yaml
    └── .flux.yaml

.flux.yaml

version: 1
patchUpdated:
  generators:
    - command: kustomize build aws-load-balancer-controller
  patchFile:  flux-patch.yaml

bases/aws-load-balancer-controller/kustomization.yaml

resources:
  - github.com/kubernetes-sigs/aws-load-balancer-controller/config/default/?ref=v2.2.1

overlays/aws-load-balancer-controller/kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../bases/aws-load-balancer-controller

Expected behavior

Any generator commands should not prematurely exit from context timeout tied to cancel calls of parent functions, and should rely on the context timeout and error handling already present in execCommand
As of now, we are pinned to version 1.22.1 until this issue gets patched, or we are able to start our migration to flux v2.

Logs

Sample tmp dir output from a live 1.22.2 instance

Right after startup

/home/flux # date
Fri Jul  9 19:57:07 UTC 2021
/home/flux # du -hs /tmp
72.2M   /tmp

After initial sync (~4 minutes later), already containing orphaned data

/home/flux # date
Fri Jul  9 20:01:01 UTC 2021

/home/flux # du -hs /tmp/*
532.0K  /tmp/flux-gitclone621756422
1.3M    /tmp/flux-working107927848
8.0K    /tmp/getter106386092
155.4M  /tmp/getter405598756
12.0K   /tmp/kustomize-937816237
12.0K   /tmp/kustomize-979226010

/home/flux # du -hs /tmp
157.2M  /tmp

30 Minutes later with orphaned data growth

/home/flux # date
Fri Jul  9 20:32:48 UTC 2021

/home/flux # du -hs /tmp/*
544.0K  /tmp/flux-gitclone621756422
1.3M    /tmp/flux-working082270969
8.0K    /tmp/getter013765201
8.0K    /tmp/getter106386092
155.4M  /tmp/getter198844007
8.0K    /tmp/getter246770219
155.4M  /tmp/getter329558908
8.0K    /tmp/getter388119449
155.4M  /tmp/getter405598756
8.0K    /tmp/getter494579190
8.0K    /tmp/getter588493785
8.0K    /tmp/getter616006697
100.3M  /tmp/getter679815951
8.0K    /tmp/getter932683696
480.0K  /tmp/kustomize-770103074
12.0K   /tmp/kustomize-937816237
12.0K   /tmp/kustomize-979226010

/home/flux # du -hs /tmp
583.5M  /tmp

For comparison, here is data from flux 1.22.1 working as expected on the same git repo with no data being orphaned

Initial Startup

/home/flux # date
Fri Jul  9 21:12:56 UTC 2021
/home/flux # du -hs /tmp
1.8M    /tmp
/home/flux # du -hs /tmp/*
532.0K  /tmp/flux-gitclone740260762
1.3M    /tmp/flux-working391222833

~30 minutes later

/home/flux # date
Fri Jul  9 21:45:35 UTC 2021
/home/flux # du -hs /tmp
1.8M    /tmp
/home/flux # du -hs /tmp/*
544.0K  /tmp/flux-gitclone740260762
1.3M    /tmp/flux-working323903471

~50 minutes later

/home/flux # date
Fri Jul  9 22:02:51 UTC 2021
/home/flux # du -hs /tmp/*
544.0K  /tmp/flux-gitclone740260762
1.3M    /tmp/flux-working208442515
/home/flux # du -hs /tmp
1.8M    /tmp

Additional context

  • Flux version: 1.22.2+
  • Kubernetes version: 1.18,1.19

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions