-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Webhooks get context deadline exceeded #1094
Comments
hey @mbrancato |
@AndrewChubatiuk we are using the helm-controller which under the hood uses real
|
are you using default values for a chart? |
Here is my whole release definition @AndrewChubatiuk. What is under I added ---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: victoria-metrics-operator
namespace: flux-system
spec:
chart:
spec:
chart: victoria-metrics-operator
sourceRef:
kind: HelmRepository
name: victoria-metrics
version: 0.34.7
interval: 1m0s
targetNamespace: monitoring
releaseName: victoria-metrics-operator
timeout: "600s"
values:
image:
tag: v0.47.3
crd:
create: true
# Disabled webhooks that were added somewhere around 0.34.0 due to context timeouts
# https://github.com/VictoriaMetrics/operator/issues/1094
admissionWebhooks:
enabled: false
replicaCount: 1
rbac:
create: true
operator:
disable_prometheus_converter: false
prometheus_converter_add_argocd_ignore_annotations: false
enable_converter_ownership: false
useCustomConfigReloader: false
serviceAccount:
create: true
extraArgs:
loggerFormat: json
loggerJSONFields: "ts:timestamp,msg:message,level:severity"
resources:
limits:
cpu: 120m
memory: 320Mi
requests:
cpu: 80m
memory: 120Mi
|
And here is reading from helm, same data
|
can it be related to firewall or any connectivity issues? are other webhooks working properly? |
Other admission webhooks work fine. I triple checked - we're running GKE so we do have the 8443 and 9443 ports open from the master nodes. This did behave differently with a |
from error message you've posted
webhook service is on port 443. do you have it opened as well? |
hey @mbrancato |
we're using helm |
@AndrewChubatiuk kustomize is only used to define the helm release resource (CRD). It really is I wouldn't suspect a context error for certificate errors or connection failures. I would only expect that if a firewall silently dropped packets, but that doesn't apply here. edit: kustomize it also used to apply the edit2: just to clarify - there is no templating going on with helm like Argo does. |
As an update, I haven't seen any more alerts of dry run apply on
|
@AndrewChubatiuk I tried deploying again, and I'm still getting a lot of timeouts. The good news is I'm able to manually reproduce the slow validation using curl -k -X POST "https://victoria-metrics-operator.monitoring.svc:9443/validate-operator-victoriametrics-com-v1beta1-vmagent?timeout=10s" \ -H "Content-Type: application/json" \ -d '{ "apiVersion": "admission.k8s.io/v1", "kind": "AdmissionReview", "request": { "uid": "b0d0cc74-565a-4d0e-acd6-eb2b82e41126", "kind": { "group": "operator.victoriametrics.com", "version": "v1beta1", "kind": "VMAgent" }, "resource": { "group": "operator.victoriametrics.com", "version": "v1beta1", "resource": "vmagents" }, "namespace": "monitoring", "operation": "CREATE", "object": { "apiVersion": "operator.victoriametrics.com/v1beta1", "kind": "VMAgent", "metadata": { "name": "vmagent", "namespace": "monitoring" }, "spec": { "image": { "tag": "v1.103.0" }, "selectAllByDefault": true, "scrapeInterval": "30s", "replicaCount": 1, "shardCount": 4, "logFormat": "json", "extraArgs": { "memory.allowedPercent": "40", "promscrape.cluster.memberLabel": "vmagent", "remoteWrite.forceVMProto": "true,true", "loggerJSONFields": "ts:timestamp,msg:message,level:severity", "promscrape.maxScrapeSize": "32MB" }, "resources": { "requests": { "cpu": "5", "memory": "10Gi" } }, "remoteWrite": [ { "url": "http://myvm1:8080/insert/0/prometheus/api/v1/write", "sendTimeout": "4m", "inlineUrlRelabelConfig": [ { "action": "drop_metrics", "regex": "^jsjdsds_[^:]*" }, { "action": "drop_metrics", "regex": "^jsdujf_[^:]*$" }, { "action": "drop_metrics", "regex": "^igsifjsf_[^:]*$" }, { "action": "drop_metrics", "regex": "^ufshufs_$" }, { "action": "drop_metrics", "regex": "^uwyfdjf_[^:]*$" }, { "action": "drop_metrics", "regex": "^jsfhdhyf_[^:]*$" }, { "action": "drop_metrics", "regex": "^event_handler_[^:]*$" }, { "action": "drop_metrics", "regex": "^sfuhfe_[^:]*$" }, { "action": "drop_metrics", "regex": "^node_[^:]*$" }, { "action": "drop_metrics", "regex": "^adbfjwd_[^:]*$" }, { "action": "drop_metrics", "regex": "^oiufjwdh_[^:]*$" } ] }, { "url": "http://myvm2:8080/insert/0/prometheus/api/v1/write", "sendTimeout": "4m", "streamAggrConfig": { "keepInput": false, "rules": [ { "match": [ "{__name__=~\"^sdfsf_.+\"}" ], "interval": "30s", "outputs": [ "total" ], "without": [ "id", "name", "instance", "pod", "node" ], "staleness_interval": "5m" }, { "match": [ "{__name__=~\"^ifsufhdf_.+\"}" ], "interval": "30s", "outputs": [ "total" ], "without": [ "uid", "pod", "container_id" ], "staleness_interval": "5m" }, { "match": [ "{__name__=~\"^iwefjsdf_.+\"}" ], "interval": "30s", "outputs": [ "total" ], "without": [ "pod", "instance" ], "staleness_interval": "5m" }, { "match": [ "{__name__=~\"^sdfwefw_.+\"}" ], "interval": "30s", "outputs": [ "total", "total_prometheus" ], "without": [ "pod", "instance", "controller_pod" ], "staleness_interval": "5m" }, { "match": [ "{__name__=~\"^ysdfsf_.+\"}" ], "interval": "30s", "outputs": [ "total" ], "without": [ "pod" ], "staleness_interval": "5m" }, { "match": [ "{__name__=~\"^kdodjf_.+\"}" ], "interval": "30s", "outputs": [ "total" ], "without": [ "pod", "instance" ], "staleness_interval": "5m" }, { "match": [ "{__name__=~\"^hdjksdf_.+\"}" ], "interval": "30s", "outputs": [ "total", "sum_samples", "max", "total_prometheus" ], "without": [ "pod", "instance" ], "staleness_interval": "5m" }, { "match": [ "{__name__=~\"^dsdjfwe_.+\"}" ], "interval": "30s", "outputs": [ "total", "total_prometheus" ], "without": [ "pod", "instance" ], "staleness_interval": "5m" }, { "match": [ "{__name__=~\"^lssd_.+\"}" ], "interval": "30s", "outputs": [ "total" ], "without": [ "pod", "instance" ], "staleness_interval": "5m" }, { "match": [ "{__name__=~\"^jwewjefjwf_.+\"}" ], "interval": "30s", "outputs": [ "total", "total_prometheus" ], "without": [ "pod", "instance" ], "staleness_interval": "5m" }, { "match": [ "{__name__=~\"^hdfhwfw_.+\"}" ], "interval": "30s", "outputs": [ "total" ], "without": [ "pod", "instance" ], "staleness_interval": "5m" } ] } } ], "remoteWriteSettings": { "flushInterval": "15s", "queues": 32 }, "externalLabels": { "rfwfewf": "ifwefwf", "idfjugu": "jfugrs", "sdiwenf": "dsfegwef" }, "serviceAccountName": "my-svc-acct", "containers": [ { "name": "urnferjsdf", "image": "lfdnwe/urnferjsdf:0.0.2", "env": [ { "name": "FOO_OSDFSDF", "value": "juef" }, { "name": "FOO_OISDFJ", "value": "oijfogjw" }, { "name": "FOO_TWGSDF", "value": "urbweyfb" }, { "name": "FOO_WGSSDJG", "value": "nsbhqkf" }, { "name": "FOO_JQWJGTB", "value": "iusdfuh" } ] } ] } }, "dryRun": true } }' timing this with curl, its pretty slow:
If I drop this from the
timing after removed:
|
Thanks for detailed investigation! It's related to VictoriaMetrics issue VictoriaMetrics/VictoriaMetrics#6911 |
@mbrancato could you try this vmagent image built from VictoriaMetrics/VictoriaMetrics#6934 with regex fix ? |
@hagen1778 is |
I'm confused. @AndrewChubatiuk @f41gh7 does operator just parses those regexps from relabeling configs or does some matching? |
Operator performs logic validation for relabel configs rules (aka dry-run). It reuses vmagent config parsing libs. |
Pulling this in from Slack for tracking - After some time I've been unable to reliably resolve this. I am running the VM Operator on 6+ cluster but only getting this frequently on a few clusters, and I could not identify and correlation as to why.
After upgrading to
v0.47.3
(helm chart0.34.7
) using the helm chart, I'm getting these errors from server-side apply:I tried disabling the
vmagent
webhook, and then I started to get the same error with other resources likevmalert
. There are no errors in the logs that seem directly related.I've been able to confirm that we can connect to the service, it just seems to not respond to the webhook posts.
The text was updated successfully, but these errors were encountered: