Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MON-1666: CMO deployment: pass enabled-remote-write #1416

Closed

Conversation

jan--f
Copy link
Contributor

@jan--f jan--f commented Oct 6, 2021

in order to switch telemeter over to Prometheus remote write.

Signed-off-by: Jan Fajerski jfajersk@redhat.com

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

in order to switch telemeter over to Prometheus remote write.

Signed-off-by: Jan Fajerski <jfajersk@redhat.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 6, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jan--f

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 6, 2021
@arajkumar
Copy link
Contributor

/retest

@jan--f
Copy link
Contributor Author

jan--f commented Oct 7, 2021

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 7, 2021
@jan--f
Copy link
Contributor Author

jan--f commented Oct 7, 2021

Generally this seems to work. However some wrinkles need to be ironed out.

cc @simonpasquier @ianbillett

@bill3tt
Copy link

bill3tt commented Oct 11, 2021

Let me repost our DM here for completeness...

The default value of the limit_bytes flag is 512k - in prod we set it to 5.1M.
IIRC we don't actually use the receive endpoint in telemeter at the moment - we use the upload endpoint which is a telemeter specific thing. See the /receive panel in this dashboard

@simonpasquier
Copy link
Contributor

simonpasquier commented Oct 13, 2021

I've noticed from the logs that Prometheus sends metadata by default but I presume that we don't want this for telemeter. I believe that it should be turned off explicitly in the RemoteWrite spec.

@simonpasquier
Copy link
Contributor

IIUC the /api/v1/receive endpoint has a size limit of 15k while the Prometheus could send up to 10,000 samples. That would explain the "request too big" errors.

@jan--f
Copy link
Contributor Author

jan--f commented Oct 13, 2021

I've noticed from the logs that Prometheus sends metadata by default but I presume that we don't want this for telemeter. I believe that it should be turned off explicitly in the RemoteWrite spec.

Yeak makes sense. Tbh I'm not 100% sure yet what the impact is and whether telemeter can actually make use of this metadata. I'll investigate more, but until then lets turn it off.

Signed-off-by: Jan Fajerski <jfajersk@redhat.com>
@jan--f
Copy link
Contributor Author

jan--f commented Oct 14, 2021

2021-10-14-152738_1620x898
Going by the prometheus_remote_storage_bytes_total metric of prometheus, it seems like the requests should be around 70kB large (we have roughly 500kB over 7 requests and ~413kB over 6 requests).

Should we start with a 128kB request limit on the telemeter side?

@matej-g
Copy link

matej-g commented Oct 14, 2021

Should we start with a 128kB request limit on the telemeter side?

Nice investigation 👍 That limit sounds to be ample and reasonable.

@simonpasquier
Copy link
Contributor

Looking at the number of sent samples, we're at about 2k samples per minute. Knowing that the remote write is configured with a maximum number of samples per send = 10k and a batch deadline of 1m, it means that in the CI runs, we never reach the 10k limit. "Real" environments might generate more samples (e.g. more OLM operators = more telemetry data) and we may hit the 10k samples per send limit, meaning larger requests. I think that we should account for it by increasing the request limit on the telemeter server side (even more than 128k) and/or reducing the number of samples per send.

image

@jan--f
Copy link
Contributor Author

jan--f commented Oct 15, 2021

Agreed, the idea is to set a "reasonable" default in telemeter and deploy that in the staging environment. Then for production the limit will be explicitly set and likely a lot higher. This would be similar to how the upload endpoint is treated (512kB default and 5.1MB in production.

@jan--f
Copy link
Contributor Author

jan--f commented Nov 10, 2021

/retest

2 similar comments
@jan--f
Copy link
Contributor Author

jan--f commented Nov 15, 2021

/retest

@jan--f
Copy link
Contributor Author

jan--f commented Nov 16, 2021

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 16, 2021

@jan--f: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agnostic-operator 27fabbc link true /test e2e-agnostic-operator
ci/prow/e2e-agnostic 27fabbc link true /test e2e-agnostic
ci/prow/e2e-aws-single-node 27fabbc link false /test e2e-aws-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 27, 2022
@jan--f
Copy link
Contributor Author

jan--f commented Mar 15, 2022

/close
Needs further research

@openshift-ci openshift-ci bot closed this Mar 15, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 15, 2022

@jan--f: Closed this PR.

In response to this:

/close
Needs further research

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants