Webhooks get context deadline exceeded #1094

mbrancato · 2024-09-04T17:27:04Z

Pulling this in from Slack for tracking - After some time I've been unable to reliably resolve this. I am running the VM Operator on 6+ cluster but only getting this frequently on a few clusters, and I could not identify and correlation as to why.

After upgrading to v0.47.3 (helm chart 0.34.7) using the helm chart, I'm getting these errors from server-side apply:

Internal error occurred: failed calling webhook "vmagent.victoriametrics.com": failed to call webhook: Post "https://victoria-metrics-operator.monitoring.svc/validate-operator-victoriametrics-com-v1beta1-vmagent?timeout=10s": context deadline exceeded

I tried disabling the vmagent webhook, and then I started to get the same error with other resources like vmalert. There are no errors in the logs that seem directly related.

I've been able to confirm that we can connect to the service, it just seems to not respond to the webhook posts.

The text was updated successfully, but these errors were encountered:

AndrewChubatiuk · 2024-09-05T08:03:06Z

hey @mbrancato
what tool are you using for chart deployment?

mbrancato · 2024-09-05T18:23:31Z

@AndrewChubatiuk we are using the helm-controller which under the hood uses real helm. Its not like ArgoCD which templates charts - I've ran into multiple issues in the past using templating of charts. The helm-controller uses HelmRelease CRDs to define releases.

% helm -n flux-system status victoria-metrics-operator
NAME: victoria-metrics-operator
LAST DEPLOYED: Wed Sep  4 17:30:47 2024
NAMESPACE: monitoring
STATUS: deployed
REVISION: 14
TEST SUITE: None
NOTES:
victoria-metrics-operator has been installed. Check its status by running:
  kubectl --namespace monitoring get pods -l "app.kubernetes.io/instance=victoria-metrics-operator"

AndrewChubatiuk · 2024-09-05T18:36:24Z

are you using default values for a chart?

mbrancato · 2024-09-05T18:53:51Z

Here is my whole release definition @AndrewChubatiuk. What is under spec.values is basically the values.yaml.

I added admissionWebhooks.enabled: false yesterday after the upgrade just to stop getting errors from a slack integration during reconciliation attempts.

---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: victoria-metrics-operator
  namespace: flux-system
spec:
  chart:
    spec:
      chart: victoria-metrics-operator
      sourceRef:
        kind: HelmRepository
        name: victoria-metrics
      version: 0.34.7
  interval: 1m0s
  targetNamespace: monitoring
  releaseName: victoria-metrics-operator
  timeout: "600s"

  values:
    image:
      tag: v0.47.3
    crd:
      create: true

    # Disabled webhooks that were added somewhere around 0.34.0 due to context timeouts
    # https://github.com/VictoriaMetrics/operator/issues/1094
    admissionWebhooks:
      enabled: false

    replicaCount: 1

    rbac:
      create: true

    operator:
      disable_prometheus_converter: false
      prometheus_converter_add_argocd_ignore_annotations: false
      enable_converter_ownership: false
      useCustomConfigReloader: false

    serviceAccount:
      create: true

    extraArgs:
      loggerFormat: json
      loggerJSONFields: "ts:timestamp,msg:message,level:severity"

    resources:
      limits:
        cpu: 120m
        memory: 320Mi
      requests:
        cpu: 80m
        memory: 120Mi

mbrancato · 2024-09-05T18:57:22Z

And here is reading from helm, same data

% helm -n flux-system get values victoria-metrics-operator       
USER-SUPPLIED VALUES:
admissionWebhooks:
  enabled: false
crd:
  create: true
extraArgs:
  loggerFormat: json
  loggerJSONFields: ts:timestamp,msg:message,level:severity
image:
  tag: v0.47.3
operator:
  disable_prometheus_converter: false
  enable_converter_ownership: false
  prometheus_converter_add_argocd_ignore_annotations: false
  useCustomConfigReloader: false
rbac:
  create: true
replicaCount: 1
resources:
  limits:
    cpu: 120m
    memory: 320Mi
  requests:
    cpu: 80m
    memory: 120Mi
serviceAccount:
  create: true

AndrewChubatiuk · 2024-09-05T22:44:42Z

can it be related to firewall or any connectivity issues? are other webhooks working properly?

mbrancato · 2024-09-07T19:05:26Z

Other admission webhooks work fine. I triple checked - we're running GKE so we do have the 8443 and 9443 ports open from the master nodes. This did behave differently with a context deadline exceeded - so I'm tempted to test running it with a much longer context somehow and see if it ever completes. It succeeds on some of our clusters, but about half fail with this error.

AndrewChubatiuk · 2024-09-09T11:22:51Z

from error message you've posted

Internal error occurred: failed calling webhook "vmagent.victoriametrics.com": failed to call webhook: Post "https://victoria-metrics-operator.monitoring.svc/validate-operator-victoriametrics-com-v1beta1-vmagent?timeout=10s": context deadline exceeded

webhook service is on port 443. do you have it opened as well?

AndrewChubatiuk · 2024-09-10T06:23:06Z

hey @mbrancato
please try a latest operator version, webhook port on service was changed to 9443 there

mbrancato · 2024-09-11T03:27:34Z

@AndrewChubatiuk

I first re-enabled the webhook in a test cluster and saw the same behavior. I then edited the webook config and upped the timeout to 30 seconds. That too failed to help.

So I upgraded the operator chart to 0.34.8, but I saw no change. I'm still getting these errors - it is not consistent but it is frequent. Our kustomize operator does a server-side dry run every minute or two, but it might be 2 minute in-between these logs, or it might be 10 minutes.

Internal error occurred: failed calling webhook "[vmagent.victoriametrics.com](http://vmagent.victoriametrics.com/)": failed to call webhook: Post "https://victoria-metrics-operator.monitoring.svc:9443/validate-operator-victoriametrics-com-v1beta1-vmagent?timeout=10s": context deadline exceeded

I also tried deleting the webhook config:
kubectl delete validatingwebhookconfigurations victoria-metrics-operator-admission
And then forced a helm upgrade and let it be re-created, and then restarted the operator pod to let it re-register webhooks. Admittedly that "helped", but the issue does not fully go away.

Is there any way to add request logging to the webhooks? I looked and there didn't seem to be any debug logs set.

I did confirm in firewall logs that the traffic and calls to the webhook are allowed. I didn't change any rules, I just enabled logging. Note the pod IP: 10.28.16.160

% kubectl -n monitoring get pod -l app.kubernetes.io/instance=victoria-metrics-operator -o wide
NAME                                         READY   STATUS    RESTARTS   AGE   IP             NODE                                       NOMINATED NODE   READINESS GATES
victoria-metrics-operator-546c6b487f-bknbz   1/1     Running   0          32m   10.28.16.160   gke-cluster-vpc1-midmem-v4-dcf4d3cf-bbm1   <none>           <none>

And the allow log from the master node to that pod:

I do think the traffic is hitting the operator, but that it might just get stuck and is not responding periodically. It is very inconsistent. For example it hasn't happened in 25 minutes. I will leave the upgraded chart and firewall logs enabled in my test environment for now.

AndrewChubatiuk · 2024-09-11T05:30:56Z

we're using helm lookup function to see if a secret with webhook certificates is already present in K8s cluster, so this might be a reason of an issue you're having as kustomize doesn't communicate with a cluster during template rendering, so it might recreate a secret during each run, so solution for you is either disable webhooks or use certmanager

mbrancato · 2024-09-11T10:32:54Z

@AndrewChubatiuk kustomize is only used to define the helm release resource (CRD). It really is helm used to run the upgrade.

I wouldn't suspect a context error for certificate errors or connection failures. I would only expect that if a firewall silently dropped packets, but that doesn't apply here.

edit: kustomize it also used to apply the VMAgent resources etc, that is what is triggering the failed webhooks.

edit2: just to clarify - there is no templating going on with helm like Argo does.

mbrancato · 2024-09-11T11:02:05Z

As an update, I haven't seen any more alerts of dry run apply on VMAgent and other resources causing failed webhooks. The process that seems to work is to:

delete the existing validating webhook config
perform the helm upgrade

mbrancato · 2024-09-13T12:55:12Z

@AndrewChubatiuk I tried deploying again, and I'm still getting a lot of timeouts.

The good news is I'm able to manually reproduce the slow validation using curl. The performance here varies a lot, and the timeout alerts I'm getting are fairly infrequent, but its enough to be a problem.

curl -k -X POST "https://victoria-metrics-operator.monitoring.svc:9443/validate-operator-victoriametrics-com-v1beta1-vmagent?timeout=10s" \
  -H "Content-Type: application/json" \
  -d '{
    "apiVersion": "admission.k8s.io/v1",
    "kind": "AdmissionReview",
    "request": {
      "uid": "b0d0cc74-565a-4d0e-acd6-eb2b82e41126",
      "kind": {
        "group": "operator.victoriametrics.com",
        "version": "v1beta1",
        "kind": "VMAgent"
      },
      "resource": {
        "group": "operator.victoriametrics.com",
        "version": "v1beta1",
        "resource": "vmagents"
      },
      "namespace": "monitoring",
      "operation": "CREATE",
      "object": {
        "apiVersion": "operator.victoriametrics.com/v1beta1",
        "kind": "VMAgent",
        "metadata": {
          "name": "vmagent",
          "namespace": "monitoring"
        },
        "spec": {
          "image": {
            "tag": "v1.103.0"
          },
          "selectAllByDefault": true,
          "scrapeInterval": "30s",
          "replicaCount": 1,
          "shardCount": 4,
          "logFormat": "json",
          "extraArgs": {
            "memory.allowedPercent": "40",
            "promscrape.cluster.memberLabel": "vmagent",
            "remoteWrite.forceVMProto": "true,true",
            "loggerJSONFields": "ts:timestamp,msg:message,level:severity",
            "promscrape.maxScrapeSize": "32MB"
          },
          "resources": {
            "requests": {
              "cpu": "5",
              "memory": "10Gi"
            }
          },
          "remoteWrite": [
            {
              "url": "http://myvm1:8080/insert/0/prometheus/api/v1/write",
              "sendTimeout": "4m",
              "inlineUrlRelabelConfig": [
                { "action": "drop_metrics", "regex": "^jsjdsds_[^:]*" },
                { "action": "drop_metrics", "regex": "^jsdujf_[^:]*$" },
                { "action": "drop_metrics", "regex": "^igsifjsf_[^:]*$" },
                { "action": "drop_metrics", "regex": "^ufshufs_$" },
                { "action": "drop_metrics", "regex": "^uwyfdjf_[^:]*$" },
                { "action": "drop_metrics", "regex": "^jsfhdhyf_[^:]*$" },
                { "action": "drop_metrics", "regex": "^event_handler_[^:]*$" },
                { "action": "drop_metrics", "regex": "^sfuhfe_[^:]*$" },
                { "action": "drop_metrics", "regex": "^node_[^:]*$" },
                { "action": "drop_metrics", "regex": "^adbfjwd_[^:]*$" },
                { "action": "drop_metrics", "regex": "^oiufjwdh_[^:]*$" }
              ]
            },
            {
              "url": "http://myvm2:8080/insert/0/prometheus/api/v1/write",
              "sendTimeout": "4m",
              "streamAggrConfig": {
                "keepInput": false,
                "rules": [
                  {
                    "match": [ "{__name__=~\"^sdfsf_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total" ],
                    "without": [ "id", "name", "instance", "pod", "node" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^ifsufhdf_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total" ],
                    "without": [ "uid", "pod", "container_id" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^iwefjsdf_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total" ],
                    "without": [ "pod", "instance" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^sdfwefw_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total", "total_prometheus" ],
                    "without": [ "pod", "instance", "controller_pod" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^ysdfsf_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total" ],
                    "without": [ "pod" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^kdodjf_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total" ],
                    "without": [ "pod", "instance" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^hdjksdf_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total", "sum_samples", "max", "total_prometheus" ],
                    "without": [ "pod", "instance" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^dsdjfwe_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total", "total_prometheus" ],
                    "without": [ "pod", "instance" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^lssd_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total" ],
                    "without": [ "pod", "instance" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^jwewjefjwf_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total", "total_prometheus" ],
                    "without": [ "pod", "instance" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^hdfhwfw_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total" ],
                    "without": [ "pod", "instance" ],
                    "staleness_interval": "5m"
                  }
                ]
              }
            }
          ],
          "remoteWriteSettings": {
            "flushInterval": "15s",
            "queues": 32
          },
          "externalLabels": {
            "rfwfewf": "ifwefwf",
            "idfjugu": "jfugrs",
            "sdiwenf": "dsfegwef"
          },
          "serviceAccountName": "my-svc-acct",
          "containers": [
            {
              "name": "urnferjsdf",
              "image": "lfdnwe/urnferjsdf:0.0.2",
              "env": [
                { "name": "FOO_OSDFSDF", "value": "juef" },
                { "name": "FOO_OISDFJ", "value": "oijfogjw" },
                { "name": "FOO_TWGSDF", "value": "urbweyfb" },
                { "name": "FOO_WGSSDJG", "value": "nsbhqkf" },
                { "name": "FOO_JQWJGTB", "value": "iusdfuh" }
              ]
            }
          ]
        }
      },
      "dryRun": true
    }
  }'

timing this with curl, its pretty slow:

{"kind":"AdmissionReview","apiVersion":"admission.k8s.io/v1","response":{"uid":"b0d0cc74-565a-4d0e-acd6-eb2b82e41126","allowed":true,"status":{"metadata":{},"code":200}}}

real	0m9.470s
user	0m0.011s
sys	0m0.006s

If I drop this from the remoteWrite section, it becomes extremely fast to validate:

            {
              "url": "http://myvm1:8080/insert/0/prometheus/api/v1/write",
              "sendTimeout": "4m",
              "inlineUrlRelabelConfig": [
                { "action": "drop_metrics", "regex": "^jsjdsds_[^:]*" },
                { "action": "drop_metrics", "regex": "^jsdujf_[^:]*$" },
                { "action": "drop_metrics", "regex": "^igsifjsf_[^:]*$" },
                { "action": "drop_metrics", "regex": "^ufshufs_$" },
                { "action": "drop_metrics", "regex": "^uwyfdjf_[^:]*$" },
                { "action": "drop_metrics", "regex": "^jsfhdhyf_[^:]*$" },
                { "action": "drop_metrics", "regex": "^event_handler_[^:]*$" },
                { "action": "drop_metrics", "regex": "^sfuhfe_[^:]*$" },
                { "action": "drop_metrics", "regex": "^node_[^:]*$" },
                { "action": "drop_metrics", "regex": "^adbfjwd_[^:]*$" },
                { "action": "drop_metrics", "regex": "^oiufjwdh_[^:]*$" }
              ]
            },

timing after removed:

{"kind":"AdmissionReview","apiVersion":"admission.k8s.io/v1","response":{"uid":"9dfd4ec3-587a-40d6-968c-9dafe3b286cb","allowed":true,"status":{"metadata":{},"code":200}}}

real	0m0.018s
user	0m0.009s
sys	0m0.003s

f41gh7 · 2024-09-13T13:03:46Z

Thanks for detailed investigation!

It's related to VictoriaMetrics issue VictoriaMetrics/VictoriaMetrics#6911

hagen1778 · 2024-09-16T10:37:25Z

@mbrancato could you try this vmagent image built from VictoriaMetrics/VictoriaMetrics#6934 with regex fix ?

mbrancato · 2024-09-16T10:40:56Z

@mbrancato could you try this vmagent image built from VictoriaMetrics/VictoriaMetrics#6934 with regex fix ?

@hagen1778 is vmagent the correct component to update here? The webhook targets the operator binary and that appears to be what is slow

hagen1778 · 2024-09-16T10:49:19Z

I'm confused. @AndrewChubatiuk @f41gh7 does operator just parses those regexps from relabeling configs or does some matching?

f41gh7 · 2024-09-16T10:56:32Z

I'm confused. @AndrewChubatiuk @f41gh7 does operator just parses those regexps from relabeling configs or does some matching?

Operator performs logic validation for relabel configs rules (aka dry-run). It reuses vmagent config parsing libs.

AndrewChubatiuk mentioned this issue Sep 9, 2024

updated webhook service port VictoriaMetrics/helm-charts#1460

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Webhooks get context deadline exceeded #1094

Webhooks get context deadline exceeded #1094

mbrancato commented Sep 4, 2024 •

edited

Loading

AndrewChubatiuk commented Sep 5, 2024

mbrancato commented Sep 5, 2024

AndrewChubatiuk commented Sep 5, 2024

mbrancato commented Sep 5, 2024

mbrancato commented Sep 5, 2024

AndrewChubatiuk commented Sep 5, 2024

mbrancato commented Sep 7, 2024

AndrewChubatiuk commented Sep 9, 2024

AndrewChubatiuk commented Sep 10, 2024

mbrancato commented Sep 11, 2024

AndrewChubatiuk commented Sep 11, 2024 •

edited

Loading

mbrancato commented Sep 11, 2024 •

edited

Loading

mbrancato commented Sep 11, 2024

mbrancato commented Sep 13, 2024 •

edited

Loading

f41gh7 commented Sep 13, 2024

hagen1778 commented Sep 16, 2024

mbrancato commented Sep 16, 2024

hagen1778 commented Sep 16, 2024

f41gh7 commented Sep 16, 2024

Webhooks get context deadline exceeded #1094

Webhooks get context deadline exceeded #1094

Comments

mbrancato commented Sep 4, 2024 • edited Loading

AndrewChubatiuk commented Sep 5, 2024

mbrancato commented Sep 5, 2024

AndrewChubatiuk commented Sep 5, 2024

mbrancato commented Sep 5, 2024

mbrancato commented Sep 5, 2024

AndrewChubatiuk commented Sep 5, 2024

mbrancato commented Sep 7, 2024

AndrewChubatiuk commented Sep 9, 2024

AndrewChubatiuk commented Sep 10, 2024

mbrancato commented Sep 11, 2024

AndrewChubatiuk commented Sep 11, 2024 • edited Loading

mbrancato commented Sep 11, 2024 • edited Loading

mbrancato commented Sep 11, 2024

mbrancato commented Sep 13, 2024 • edited Loading

f41gh7 commented Sep 13, 2024

hagen1778 commented Sep 16, 2024

mbrancato commented Sep 16, 2024

hagen1778 commented Sep 16, 2024

f41gh7 commented Sep 16, 2024

mbrancato commented Sep 4, 2024 •

edited

Loading

AndrewChubatiuk commented Sep 11, 2024 •

edited

Loading

mbrancato commented Sep 11, 2024 •

edited

Loading

mbrancato commented Sep 13, 2024 •

edited

Loading