-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add native OTEL ES exporter #2295
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2295 +/- ##
==========================================
- Coverage 96.21% 95.64% -0.57%
==========================================
Files 203 205 +2
Lines 10291 10529 +238
==========================================
+ Hits 9901 10070 +169
- Misses 334 391 +57
- Partials 56 68 +12
Continue to review full report at Codecov.
|
cmd/opentelemetry/app/exporter/elasticsearchexporter/esclient/client.go
Outdated
Show resolved
Hide resolved
cmd/opentelemetry/app/exporter/elasticsearchexporter/esclient/es6client.go
Outdated
Show resolved
Hide resolved
cmd/opentelemetry/app/exporter/elasticsearchexporter/esclient/es6client.go
Outdated
Show resolved
Hide resolved
cmd/opentelemetry/app/exporter/elasticsearchexporter/estranslator/modeltranslator.go
Outdated
Show resolved
Hide resolved
cmd/opentelemetry/app/exporter/elasticsearchexporter/estranslator/modeltranslator_test.go
Outdated
Show resolved
Hide resolved
} | ||
|
||
func (s storageWrapper) WriteSpan(span *model.Span) error { | ||
converter := dbmodel.FromDomain{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be created for each span?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a workaround, the purpose of the storage test is the correctness. Although It would be better to convert from jaeger to OTEL and then use the remain interface method.
cmd/opentelemetry/app/exporter/elasticsearchexporter/spanstore.go
Outdated
Show resolved
Hide resolved
// WriteTraces writes traces to the storage | ||
func (w *EsSpanWriter) WriteTraces(_ context.Context, traces pdata.Traces) (int, error) { | ||
atomic.AddInt64(&counter, int64(traces.SpanCount())) | ||
fmt.Printf("counter: %d\n", counter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove the prints once it's ready. Just ignore them for now
return len(spans), err | ||
} | ||
indexName := w.spanIndexName.get(model.EpochMicrosecondsAsTime(span.StartTime)) | ||
operationsToSpanIndices = append(operationsToSpanIndices, i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this repeated twice? on line 150
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The array holds a mapping from bulk operation to a span index number.
It will be used to construct the retry error of the bulk operation fails. It's there twice because service operation might also fail and it has a different bulk ID.
@@ -366,3 +376,23 @@ func loadToken(path string) (string, error) { | |||
} | |||
return strings.TrimRight(string(b), "\r\n"), nil | |||
} | |||
|
|||
// LoadTagsFromFile loads tags from a file | |||
func LoadTagsFromFile(filePath string) ([]string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment I don't think it is possible to specify a list of tags? Only via the file? If so, possibly this is something we may want to add to the OTel config for the exporter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could consider doing that, perhaps in a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmd/opentelemetry/app/exporter/elasticsearchexporter/esclient/es7client.go
Outdated
Show resolved
Hide resolved
cmd/opentelemetry/app/exporter/elasticsearchexporter/esclient/client.go
Outdated
Show resolved
Hide resolved
cmd/opentelemetry/app/exporter/elasticsearchexporter/esclient/es7client.go
Outdated
Show resolved
Hide resolved
cmd/opentelemetry/app/exporter/elasticsearchexporter/esclient/es7client.go
Show resolved
Hide resolved
} | ||
defer response.Body.Close() | ||
if response.StatusCode >= 400 { | ||
return nil, fmt.Errorf("bulk request failed with code %d", response.StatusCode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there cases where a retry could be used? Like, a 502
or 503
could probably be retried in a few milliseconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the retries we want to use a queued retry processor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about 3xx-class responses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ES client should retry with the provided URL.
cmd/opentelemetry/app/exporter/elasticsearchexporter/estranslator/modeltranslator.go
Outdated
Show resolved
Hide resolved
cmd/opentelemetry/app/exporter/elasticsearchexporter/factory_test.go
Outdated
Show resolved
Hide resolved
serviceIndexName: newIndexNameProvider("jaeger-service", params.IndexPrefix, params.UseReadWriteAliases), | ||
translator: estranslator.NewTranslator(params.Tags.AllAsFields, tagsKeysAsFields, params.GetTagDotReplacement()), | ||
serviceCache: cache.NewLRUWithOptions( | ||
100000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this deserves a comment :) Where did this default come from? What are the implications? If there are more than 100_000 services, what happens (other than the chief architect being forcefully admitted to the psych ward)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this default come from?
The default is copy-paste from the old ES impl.
If there are more than 100_000 services, what happens (
Based on the name of the cache, the oldest used item will be evicted. I don't expect anybody running more than 100k services ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Murphy's law dictates that someone will run into this limit. For that pool soul's sake, could you add a log message? These kinds of limitations/bugs are the hardest to sort out...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size of the cache does not have affect on the correctness of the implementation. 100k is used in the current implementation without any issues.
cmd/opentelemetry/app/exporter/elasticsearchexporter/spanstore.go
Outdated
Show resolved
Hide resolved
for i, d := range blk.Items { | ||
// ... so for any HTTP status above 201 ... | ||
// | ||
if d.Index.Status > 201 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what the ES semantics dictate? Technically, everything in the 2xx range is "success". Notoriously, 204 is used in a few systems out there instead of 200, where a response isn't really required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what all the tutorials and code examples from the official repository do, so I guess it should be like this.
cmd/opentelemetry/app/exporter/elasticsearchexporter/esclient/client.go
Outdated
Show resolved
Hide resolved
cmd/opentelemetry/app/exporter/elasticsearchexporter/esclient/elasticsearch_ping.go
Outdated
Show resolved
Hide resolved
cmd/opentelemetry/app/exporter/elasticsearchexporter/esclient/elasticsearch_ping.go
Outdated
Show resolved
Hide resolved
cmd/opentelemetry/app/exporter/elasticsearchexporter/esclient/elasticsearch_ping.go
Outdated
Show resolved
Hide resolved
if err != nil { | ||
return 0, err | ||
} | ||
esVersion, err := strconv.Atoi(string(pingResponse.Version.Number[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this looking at the first char? Meaning it will break on v10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are other parts of the code where the new version will fail
cmd/opentelemetry/app/exporter/elasticsearchexporter/esclient/es7client.go
Show resolved
Hide resolved
} | ||
var transport http.RoundTripper | ||
httpTransport := &http.Transport{ | ||
Proxy: http.ProxyFromEnvironment, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer creating objects only once. Does this Proxy setting not apply to L313?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is legacy code from this file, I will not refactor it in this PR. I have just moved it to a separate func so I can reuse it in other parts. The implementation indeed looks like that it could be refactored.
// #nosec G402 | ||
TLSClientConfig: &tls.Config{InsecureSkipVerify: c.TLS.SkipHostVerify}, | ||
} | ||
if c.TLS.CAPath != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels a bit convoluted & duplicated - why wouldn't this path be covered by L308?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see the previous comment #2295 (comment)
I cannot rebase and merge until #2319 is resolved as this PR needs changes in the core. |
The code is in the final state. The missing item from TODO:
it's not critical. The query configuration options are not exposed in the OTEL config. I will consider doing refactoring in a follow up PR. |
cmd/opentelemetry/app/exporter/elasticsearchexporter/spanstore.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Pavol Loffay <ploffay@redhat.com>
Signed-off-by: Pavol Loffay <ploffay@redhat.com>
* Reimplement OTEL elasticsearch exporter Signed-off-by: Pavol Loffay <ploffay@redhat.com> * Fix typo in comment Signed-off-by: Pavol Loffay <ploffay@redhat.com>
This PR reimplements OpenTelemetry Elasticsarch exporter. Only span exporter/writer is implemented the query still uses the current ES storage implementation.
Notable changes:
exporter/jaegerelasticsearch
Other changes to the codebase:
TODOs:
config.Configuration
Considerations:
Performance
From the perf tests I have done the new writer performs well (with batch processor). It seems to be slightly faster than the current all-in-one with similar settings.
Jaeger OTEL all-in-one with ES exporter and batch processor
Jaeger all-in-one
--es.bulk.size=10000000000000
,--collector.queue-size=300000