Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/otlpreceiver] Support Rate Limiting #6725

Open
blakeroberts-wk opened this issue Dec 9, 2022 · 8 comments · Fixed by #9357
Open

[receiver/otlpreceiver] Support Rate Limiting #6725

blakeroberts-wk opened this issue Dec 9, 2022 · 8 comments · Fixed by #9357

Comments

@blakeroberts-wk
Copy link
Contributor

Is your feature request related to a problem? Please describe.

The OpenTelemetry Specification outlines throttling for both gRPC and HTTP; however, the OTLP receiver does not currently support this (optional) specification.

Right now, if a processor is under pressure the only option it has is to return an error informing the receiver to tell the client that the request failed and is not retry-able.

Describe the solution you'd like

It would be neat if the receiver offered an implementation of error that could be returned to it to signal that it should return an appropriately formatted response to the client signaling that the request was rate limited. The format of the response should follow the semantic convention (e.g. the HTTP receiver should return a status code of 429 and set the "Retry-After" header.

For example, the OTLP receiver could export the following error implementation:

package errors

import "time"

type ErrorRateLimited struct {
	Backoff time.Duration
}

func (e *ErrorRateLimited) Error() string {
	return "Too Many Requests"
}

var ErrRateLimited error = &ErrorRateLimited{}

func NewErrRateLimited(backoff time.Duration) error {
	return &ErrorRateLimited{
		Backoff: backoff,
	}
}

Any processor or exporter in the pipeline could return (optionally wrapping) this error:

import "go.opentelemetry.io/collector/receiver/otlpreceiver"

func (p *processor) ConsumeTraces(ctx context.Context, td ptrace.Traces) error {
	return otlpreceiver.NewErrRateLimited(time.Minute)
}

Then when handling errors from the pipeline, the receiver could check for this error:

var (
	err error
	w http.ResponseWriter
)

errRateLimited := &ErrorRateLimited{}
if errors.As(err, &errRateLimited) {
	w.Header().Set("Retry-After", strconv.FormatInt(int64(errRateLimited.Backoff)/1e9, 10))
	w.WriteHeader(http.StatusTooManyRequests)
}

Describe alternatives you've considered

To accomplish rate limiting, a fork of the OTLP receiver will be used. Here are the changes: main...blakeroberts-wk:opentelemetry-collector:otlpreceiver-rate-limiting.

Additional context

The above example changes include the addition of an internal histogram metric which records server latency (http.server.duration or rpc.server.duration) to allow monitoring of the collector's latency, throughput, and error rate. This portion of the changes is not necessary to support rate limiting.

There exists an open issue regarding rate limiting (#3509); however, the suggested approach seems to suggest the use of Redis which goes beyond what I believe necessary for the OTLP receiver to support rate limiting.

@atoulme
Copy link
Contributor

atoulme commented Dec 17, 2022

Can you clarify this is not only for OTLP but would be applicable to any pipeline with an exporter able to return this error?

@blakeroberts-wk
Copy link
Contributor Author

Yeah that's a good point. The collector could have some general errors or receiver/errors package that any receiver (or possibly even scrappers?) could look for in the return value from their next consumer. One point to keep in mind though is that the shape of the response to the request in this case is in accordance to the OTel specification, but any non-OTLP receiver looking for these errors could handle it in accordance to their specification, if any.

@blakeroberts-wk
Copy link
Contributor Author

#9357 does not fully implement the OTel specification about OTLP HTTP throttling: there does not exist a way to set the Retry-After header.

@mx-psi mx-psi reopened this Mar 27, 2024
@mx-psi
Copy link
Member

mx-psi commented Mar 27, 2024

@TylerHelmuth can you take a look?

@TylerHelmuth
Copy link
Member

The Retry-After header is optional. If the server has a recommendation for how the client should retry it can be set, but the server is not required to provide this recommendation (and often, may not be able to give a good recommendation).

If the client receives an HTTP 429 or an HTTP 503 response and the “Retry-After” header is not present in the response, then the client SHOULD implement an exponential backoff strategy between retries.

The 429/503 response codes are enough to get a OTLP client to start enacting an retry strategy.

Reading through the issue again I agree that we could introduce more to the collector to allow components to explicitly define how they want clients to retry in known, controlled scenarios. For that use case, this issue is not completed yet.

@blakeroberts-wk
Copy link
Contributor Author

@TylerHelmuth Thank you for your response.

To support your analysis with personal experience: I have a custom processor that limits the number of unique trace IDs per service per minute. In this case, it is possible to determine the appropriate duration after which it should be permissible that the service resubmit their trace data.

Allowing an OTLP client to use exponential backoff is sufficient but not optimal. Optimal being a solution that, within system limits/limitations, reduces the duration between when an operation of a service creates a span or trace and when that span or trace is available to be queried from a backend storage system, and reduces the amount of resources (cpu, memory, network, i/o) required to report the span or trace from the originating service to a backend storage system. However, in most cases, the benefit from this optimization will be small if not negligible.

@cforce
Copy link
Contributor

cforce commented Sep 4, 2024

@blakeroberts-wk Can you opensource this custom processor that limits the number of unique trace IDs per service per minute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants