Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add native OTEL ES exporter #2295

Merged
merged 2 commits into from
Jul 1, 2020

Conversation

pavolloffay
Copy link
Member

@pavolloffay pavolloffay commented Jun 18, 2020

This PR reimplements OpenTelemetry Elasticsarch exporter. Only span exporter/writer is implemented the query still uses the current ES storage implementation.

Notable changes:

  • new span writer implementation versioned in exporter/jaegerelasticsearch
  • used ES client is https://github.com/elastic/go-elasticsearch
  • OTEL data model is directly transformed to Jaeger ES model
  • writer uses ES Bulk API
  • batching and retries capabilities will be used as OTEL processors (batch, queued_retry)
  • ES client is abstracted and specific implementation is created based on the supplied ES version or derived from ES ping API at startup. This allows us to easily add support for new ES versions
  • added integration tests

Other changes to the codebase:

  • storage integration test is more generic - it can be executed from any package

TODOs:

  • ES clients tests
  • enable integration storage test on travis
  • ES exporter should probably accept its own params struct and not config.Configuration

Considerations:

Performance

From the perf tests I have done the new writer performs well (with batch processor). It seems to be slightly faster than the current all-in-one with similar settings.

docker run -it --rm -e "ES_JAVA_OPTS=-Xms2g -Xmx2g" -p 9200:9200 -p 9300:9300 -e "http.host=0.0.0.0" -e "discovery.type=single-node" --name=elasticsearch docker.elastic.co/elasticsearch/elasticsearch-oss:6.8.4

Jaeger OTEL all-in-one with ES exporter and batch processor

processors:
  batch:
    send_batch_size: 1000
    timeout: 200ms

2020-06-17 10:05:26 INFO  Main - Expected number of spans 100000, actual 100000 stored in 39s
2020-06-17 10:24:00 INFO  Main - Expected number of spans 100000, actual 100000 stored in 39s
2020-06-17 10:27:46 INFO  Main - Expected number of spans 300000, actual 300000 stored in 105s
2020-06-17 10:31:17 INFO  Main - Expected number of spans 300000, actual 300000 stored in 96s
2020-06-17 10:37:41 INFO  Main - Expected number of spans 300000, actual 300000 stored in 111s
2020-06-17 10:55:03 INFO  Main - Expected number of spans 300000, actual 300000 stored in 115s
2020-06-17 11:04:25 INFO  Main - Expected number of spans 300000, actual 300000 stored in 164s
2020-06-17 11:10:55 INFO  Main - Expected number of spans 300000, actual 300000 stored in 121s

Jaeger all-in-one --es.bulk.size=10000000000000, --collector.queue-size=300000

2020-06-17 11:20:35 INFO  Main - Expected number of spans 300000, actual 300000 stored in 181s
2020-06-17 11:24:38 INFO  Main - Expected number of spans 300000, actual 300000 stored in 166s
2020-06-17 11:29:54 INFO  Main - Expected number of spans 300000, actual 300000 stored in 172s
2020-06-17 11:49:06 INFO  Main - Expected number of spans 300000, actual 300000 stored in 165s

@codecov
Copy link

codecov bot commented Jun 18, 2020

Codecov Report

Merging #2295 into master will decrease coverage by 0.56%.
The diff coverage is 93.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2295      +/-   ##
==========================================
- Coverage   96.21%   95.64%   -0.57%     
==========================================
  Files         203      205       +2     
  Lines       10291    10529     +238     
==========================================
+ Hits         9901    10070     +169     
- Misses        334      391      +57     
- Partials       56       68      +12     
Impacted Files Coverage Δ
plugin/storage/integration/trace_compare.go 36.58% <ø> (ø)
plugin/storage/integration/integration.go 80.86% <92.85%> (ø)
plugin/storage/es/factory.go 100.00% <100.00%> (ø)
cmd/query/app/server.go 90.90% <0.00%> (-2.28%) ⬇️
cmd/flags/admin.go 77.77% <0.00%> (-1.59%) ⬇️
plugin/storage/badger/spanstore/reader.go 96.79% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7c9e1d3...369d1af. Read the comment docs.

}

func (s storageWrapper) WriteSpan(span *model.Span) error {
converter := dbmodel.FromDomain{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be created for each span?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a workaround, the purpose of the storage test is the correctness. Although It would be better to convert from jaeger to OTEL and then use the remain interface method.

// WriteTraces writes traces to the storage
func (w *EsSpanWriter) WriteTraces(_ context.Context, traces pdata.Traces) (int, error) {
atomic.AddInt64(&counter, int64(traces.SpanCount()))
fmt.Printf("counter: %d\n", counter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To remove?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the prints once it's ready. Just ignore them for now

return len(spans), err
}
indexName := w.spanIndexName.get(model.EpochMicrosecondsAsTime(span.StartTime))
operationsToSpanIndices = append(operationsToSpanIndices, i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this repeated twice? on line 150

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The array holds a mapping from bulk operation to a span index number.

It will be used to construct the retry error of the bulk operation fails. It's there twice because service operation might also fail and it has a different bulk ID.

@@ -366,3 +376,23 @@ func loadToken(path string) (string, error) {
}
return strings.TrimRight(string(b), "\r\n"), nil
}

// LoadTagsFromFile loads tags from a file
func LoadTagsFromFile(filePath string) ([]string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment I don't think it is possible to specify a list of tags? Only via the file? If so, possibly this is something we may want to add to the OTel config for the exporter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider doing that, perhaps in a separate issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
defer response.Body.Close()
if response.StatusCode >= 400 {
return nil, fmt.Errorf("bulk request failed with code %d", response.StatusCode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cases where a retry could be used? Like, a 502 or 503 could probably be retried in a few milliseconds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the retries we want to use a queued retry processor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about 3xx-class responses?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ES client should retry with the provided URL.

serviceIndexName: newIndexNameProvider("jaeger-service", params.IndexPrefix, params.UseReadWriteAliases),
translator: estranslator.NewTranslator(params.Tags.AllAsFields, tagsKeysAsFields, params.GetTagDotReplacement()),
serviceCache: cache.NewLRUWithOptions(
100000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this deserves a comment :) Where did this default come from? What are the implications? If there are more than 100_000 services, what happens (other than the chief architect being forcefully admitted to the psych ward)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this default come from?

The default is copy-paste from the old ES impl.

If there are more than 100_000 services, what happens (

Based on the name of the cache, the oldest used item will be evicted. I don't expect anybody running more than 100k services ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Murphy's law dictates that someone will run into this limit. For that pool soul's sake, could you add a log message? These kinds of limitations/bugs are the hardest to sort out...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size of the cache does not have affect on the correctness of the implementation. 100k is used in the current implementation without any issues.

for i, d := range blk.Items {
// ... so for any HTTP status above 201 ...
//
if d.Index.Status > 201 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what the ES semantics dictate? Technically, everything in the 2xx range is "success". Notoriously, 204 is used in a few systems out there instead of 200, where a response isn't really required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what all the tutorials and code examples from the official repository do, so I guess it should be like this.

pkg/es/config/config.go Outdated Show resolved Hide resolved
.travis.yml Outdated Show resolved Hide resolved
if err != nil {
return 0, err
}
esVersion, err := strconv.Atoi(string(pingResponse.Version.Number[0]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this looking at the first char? Meaning it will break on v10.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are other parts of the code where the new version will fail

}
var transport http.RoundTripper
httpTransport := &http.Transport{
Proxy: http.ProxyFromEnvironment,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer creating objects only once. Does this Proxy setting not apply to L313?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is legacy code from this file, I will not refactor it in this PR. I have just moved it to a separate func so I can reuse it in other parts. The implementation indeed looks like that it could be refactored.

// #nosec G402
TLSClientConfig: &tls.Config{InsecureSkipVerify: c.TLS.SkipHostVerify},
}
if c.TLS.CAPath != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels a bit convoluted & duplicated - why wouldn't this path be covered by L308?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the previous comment #2295 (comment)

plugin/storage/integration/grpc_test.go Outdated Show resolved Hide resolved
@pavolloffay
Copy link
Member Author

I cannot rebase and merge until #2319 is resolved as this PR needs changes in the core.

@pavolloffay
Copy link
Member Author

The code is in the final state. The missing item from TODO:

ES exporter should probably accept its own params struct and not config.Configuration

it's not critical. The query configuration options are not exposed in the OTEL config. I will consider doing refactoring in a follow up PR.

Signed-off-by: Pavol Loffay <ploffay@redhat.com>
Signed-off-by: Pavol Loffay <ploffay@redhat.com>
@pavolloffay pavolloffay merged commit 00b6e96 into jaegertracing:master Jul 1, 2020
morlay pushed a commit to morlay/jaeger that referenced this pull request Jul 2, 2020
* Reimplement OTEL elasticsearch exporter

Signed-off-by: Pavol Loffay <ploffay@redhat.com>

* Fix typo in comment

Signed-off-by: Pavol Loffay <ploffay@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants