-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Prometheus metrics reader constructor #2988
Implement Prometheus metrics reader constructor #2988
Conversation
1a24c7e
to
e8e1bb2
Compare
Codecov Report
@@ Coverage Diff @@
## master #2988 +/- ##
==========================================
- Coverage 95.98% 95.97% -0.01%
==========================================
Files 224 224
Lines 9731 9747 +16
==========================================
+ Hits 9340 9355 +15
Misses 323 323
- Partials 68 69 +1
Continue to review full report at Codecov.
|
e8e1bb2
to
56b3963
Compare
Signed-off-by: albertteoh <albert.teoh@logz.io>
56b3963
to
d9c1b2d
Compare
conn.Close() | ||
|
||
return &MetricsReader{ | ||
url: fmt.Sprintf("http://%s/api/v1/query_range", hostPort), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This URL will be used to query metrics for the configured Prometheus-compatible backend, i.e. Prometheus, M3, Thanos, etc.
return nil, fmt.Errorf("no prometheus query host:port provided") | ||
} | ||
errs := make([]error, 0) | ||
for _, hostPort := range hostPorts { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of an odd behavior, why do we need to loop through hosts looking for a reachable one? Why wouldn't they all be reachable? Or to put another way, what's the point of providing unreachable host:port pairs as input?
If these addresses are somehow "unstable", then they might as well become unreachable after the reader is created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out @yurishkuro.
The initial motivation was to maintain config consistency with ES and Cassandra span stores in how they both support configuring multiple server URLs, presumably for redundancy but yes, good call that the selected address could be unreachable after the reader is created.
This constructor behaviour was based on following ES's lead in how it initiates a new client by health checking each server: https://github.com/olivere/elastic/blob/v6.2.35/client.go#L1153.
However, the piece I'm missing which you correctly picked up on, is maintaining this pool of connections and finding the first healthy connection on each request, for example: https://github.com/olivere/elastic/blob/v6.2.35/client.go#L1301
The trade-off here is that it's a fair bit more complicated.
What direction do you suggest we take? Have a simple single host:port configuration? Or support a pool of connections that we select on each request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Prom only expose HTTP API, or is there gRPC version? gRPC would've taken care of the connection pooling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To the best of my knowledge, Prometheus only exposes an HTTP API.
I found some notes in Prometheus docs referring to using gRPC for remote storage reads/writes in the future, but that's not related to the query API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative is to use the Prometheus go client; but it means including another dependency.
Worth noting is this client takes a single URL: https://github.com/prometheus/client_golang/blob/master/api/client.go#L78
Given we're only using a single Prometheus endpoint /api/v1/query_range
, I feel the Prometheus client offers more than we need and may be a bit of an overkill.
At the same time, we can adopt this client's approach of just a single connection URL, which I assume would be the more idiomatic given it's an official Prometheus client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to proceed with the Prom client library, I think it would be best we modify our metrics data model to match prometheus' to avoid mapping between OpenTelemetry and Prometheus' data models. I guess this might mean using OpenMetrics, unless if there's a good reason to stick with OpenTelemetry's.
Otherwise, should we maintain the simple http client as we have now, configured with a single Prom address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion. Which of the two formats more mature at this point? I thought OTLP was still changing its proto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just discussed this in the Jaeger call. General agreement was that OpenMetrics would be a better choice:
- OTLP is still changing, and we already took a copy of the data model which might be different now
- OTLP is supposed to be fully compatible with OpenMetrics by the end of the year
- OpenMetrics is the de-factor Prometheus format used by many different backends already
- OpenMetrics is stable
- OTEL will support OTLP<->OpenMetrics conversion in the future so we would still be able to implement support for OTLP-native backends
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thanks for raising this in the Jaeger call. Those reasons make sense so I'll make the switch over to OpenMetrics.
It also sounds like we're okay with using the Prom client for HTTP queries (even though it's in alpha), but please correct me if my assumption is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also sounds like we're okay with using the Prom client for HTTP queries
sgtm
c3d1c33
to
7e8fb2e
Compare
PR updated to use prom client and OpenMetrics. Did an initial e2e test with a POC version and the prom client looks okay so far. Still a bit of work involved to convert from the prom client's data model to OpenMetrics, and couldn't find a library that does this conversion, which was a little surprising. It wasn't too difficult in any case. |
// The OpenMetrics protobuf schema which defines the protobuf wire format. | ||
// Ensure to interpret "required" as semantically required for a valid message. | ||
// All string fields MUST be UTF-8 encoded strings. | ||
// Based on: https://github.com/OpenObservability/OpenMetrics/blob/v1.0.0/proto/openmetrics_data_model.proto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to auto-generate this file from openmetrics_data_model.proto? If not, please summarize the manual changes required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the file itself it seems that the main IDL definitions are unchanged from openmetrics, we're only adding some preamble?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The package name and gogoproto options below were added, required to marshal the metrics data over the wire in the response. A note was added in the README for posterity.
It should be possible to auto-generate this file. Off the top of my head, perhaps by copying this file from a submodule and doing some string replacements. Is there a better approach or is it getting a bit too unwieldy, and so manual instructions would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
describing additions in readme is a bit removed. If there is a concrete & single block of text that's been added compared to official file, I would mark it with comments in the file directly. When looking at the source file, it's convenient to have it be self-descriptive, e.g.
// This file is a copy of {path to source}, with the following additions
// * ...
// * ...
// -- BEGIN ADDITIONS
...
// -- END ADDITIONS
@@ -18,38 +18,53 @@ import ( | |||
"context" | |||
"time" | |||
|
|||
"github.com/prometheus/client_golang/api" | |||
"github.com/prometheus/client_golang/api/prometheus/v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"github.com/prometheus/client_golang/api/prometheus/v1" | |
prom_api "github.com/prometheus/client_golang/api/prometheus/v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or something like that, v1
is a terrible package name
// TODO: Implement me | ||
return nil, nil | ||
return metrics.MetricFamily{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to return MetricFamily by value instead of a pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong reason, it's just a continuation from the API definition which I defined as not nullable.
Perhaps it's better to pass by pointer for performance reasons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and the reason for defining MetricFamily
as not nullable is to act as a forcing-function to return an empty result for both cases where there is no data in the result as well as an error, which I think is more meaningful (as it also contains meta-data), rather than nil
.
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
3b21650
to
949c7d8
Compare
Signed-off-by: albertteoh <albert.teoh@logz.io>
func TestNewMetricsReaderInvalidAddress(t *testing.T) { | ||
logger := zap.NewNop() | ||
reader, err := NewMetricsReader(logger, "\n", defaultTimeout) | ||
const wantErrMsg = `parse "http://\n": net/url: invalid control character in URL` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better not to depend on exact error messages from stdlib, but error.Wrap
the error with our own message and assert only on our own string, e.g.:
require.Error(t, err)
assert.Contains(t, expectedMsg, err.Error())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, I'll address this in the next PR, thanks!
Signed-off-by: albertteoh albert.teoh@logz.io
Which problem is this PR solving?
Short description of the changes