Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add query service with OTLP #3086

Merged
merged 16 commits into from
Jul 13, 2021

Conversation

pavolloffay
Copy link
Member

@pavolloffay pavolloffay commented Jun 10, 2021

Signed-off-by: Pavol Loffay p.loffay@gmail.com

Supersedes: #3051
Depends on: jaegertracing/jaeger-idl#76

This PR adds support for a new query gRPC and REST (via grpc-gateway) API. The API definition is in jaegertracing/jaeger-idl#76 and it mimics the current gRPC query API, however it uses the OTLP model.

IDs (trace, span, parent) are encoded as hex strings in JSON (see spec https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md#otlphttp) and hex IDs are also accepted as query params e.g. get trace by id. IDs are defined as byte arrays in proto and by default proto serializes them into base64. This implementation uses gogo with the custom type feature (that overrides marshalling) to overcome this. I am fine to keep using gogo and migrate to custom marshaller/vtprotobuf once OTEL does.

Example REST requests:

curl -ivX GET -H "Content-Type: application/json" http://localhost:16686/v3/traces/000000000000000061677de41fa1e1e5
curl -ivX GET -H "Content-Type: application/json" http://localhost:16686/v3/traces\?query.serviceName\=frontend

Notable changes

  • translator from Jaeger model to OTLP
  • new GRPC handler
  • gRPC gateway registration
  • copy of JSONPb - the description why is in the code

The GraphQL API will not be used because:

  • serialization of protobuf is problematic
    • protobuf uses JSONPb and not standard JSON annotations
    • custom marshaller could be used but it breaks selecting object fields - it API would always return full size objects
    • we could define a separate model that would mimic use JSONPb serialization, but this adds unnecessary maintenance overhead and we might run into other serialization incompatibilities as well.
  • all GraphQL libs seem to be supported by a single maintainer/company

TODOs

@pavolloffay
Copy link
Member Author

@yurishkuro, @joe-elliott could you please have a quick look, especially on the proto definition.

@pavolloffay pavolloffay force-pushed the otel-query-api branch 2 times, most recently from b558778 to d59f0b3 Compare June 17, 2021 11:50
@pavolloffay pavolloffay marked this pull request as ready for review June 30, 2021 09:57
@pavolloffay pavolloffay requested a review from a team as a code owner June 30, 2021 09:57
@pavolloffay pavolloffay force-pushed the otel-query-api branch 3 times, most recently from be579e5 to 5e2b9cb Compare June 30, 2021 13:10
@codecov
Copy link

codecov bot commented Jun 30, 2021

Codecov Report

Merging #3086 (0b215b3) into master (e7d7eb7) will decrease coverage by 0.03%.
The diff coverage is 95.51%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3086      +/-   ##
==========================================
- Coverage   95.92%   95.89%   -0.04%     
==========================================
  Files         236      239       +3     
  Lines       10238    10525     +287     
==========================================
+ Hits         9821    10093     +272     
- Misses        348      355       +7     
- Partials       69       77       +8     
Impacted Files Coverage Δ
cmd/query/app/grpc_handler.go 98.62% <ø> (ø)
cmd/query/app/server.go 94.07% <72.72%> (-1.76%) ⬇️
cmd/query/app/apiv3/grpc_gateway.go 86.66% <86.66%> (ø)
cmd/query/app/apiv3/grpc_handler.go 88.23% <88.23%> (ø)
cmd/query/app/apiv3/otlp_translator.go 100.00% <100.00%> (ø)
cmd/query/app/static_handler.go 96.77% <0.00%> (-1.62%) ⬇️
plugin/storage/badger/spanstore/reader.go 95.37% <0.00%> (-0.72%) ⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e7d7eb7...0b215b3. Read the comment docs.

Makefile Outdated Show resolved Hide resolved
Makefile Outdated
$(PROTO_INTERMEDIATE_DIR)/trace/v1/trace.proto

# Revert changes in OTEL proto and modify only package
# The goal here is to import opentelemetry.proto.trace.v1.ResourceSpans in the query service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to change the package again? Can't you have the service in one package and the OTLP proto in another?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added more comments to the makefile

cmd/query/app/otel/grpc_gateway_test.go Outdated Show resolved Hide resolved
cmd/query/app/otel/grpc_handler.go Outdated Show resolved Hide resolved
cmd/query/app/otel/grpc_handler.go Outdated Show resolved Hide resolved
cmd/query/app/otel/otlp_translator.go Outdated Show resolved Hide resolved
cmd/query/app/otel/otlp_translator.go Outdated Show resolved Hide resolved
cmd/query/app/otel/otlp_translator.go Outdated Show resolved Hide resolved
cmd/query/app/otel/otlp_translator_test.go Outdated Show resolved Hide resolved
// See the License for the specific language governing permissions and
// limitations under the License.

package otel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think otlp would be more accurate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use otlp for the package name where the otlp is compiled. Not for Jaeger's query service. I will use apiv3.

@pavolloffay
Copy link
Member Author

PR updated. It's ready for another round.

@pavolloffay pavolloffay force-pushed the otel-query-api branch 2 times, most recently from ae53586 to 89fc0b9 Compare July 1, 2021 13:21
# Target proto-prepare-otel modifies OTEL proto to use import path jaeger.proto.*
# The modification is needed because OTEL collector already uses opentelemetry.proto.*
# and two complied protobuf types cannot have the same import path. The root cause is that the compiled OTLP
# in the collector is in private package, hence it cannot be used in Jaeger.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a ticket in OTEL to export that package? It do we not want to use it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any open issue. It has been private from the get-go.

Copy link
Member

@joe-elliott joe-elliott Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a similar process in Tempo which is here:
https://github.com/grafana/tempo/blob/main/Makefile#L129

We have also since forked the collector and written a script to just rename internal => external so we can have access to more of the otel guts. We will probably move to using this soon instead of our current setup.
https://github.com/grafana/opentelemetry-collector/tree/0.29-grafana

I didn't see any open issue. It has been private from the get-go.

It used to be exposed b/c Tempo vendored it initially. They moved it internal and we asked about keeping it exposed. They opted to keep it internal to protect themselves from people forming a dependency on it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see it being open, but good to know. Perhaps they have reasons to keep it intenal. They even spent a lot of time on the (ugly) pdata package to make that happen.

@pavolloffay
Copy link
Member Author

PR updated and coverage is back on green. Please review again.

@pavolloffay
Copy link
Member Author

I gave it a bit more love and added cancelation for the grpc gateway

resourceSpans, ok := spansByLibrary[res]
if !ok {
resourceSpans = map[instrumentationLibrary]*v1.InstrumentationLibrarySpans{}
resourceSpans[library] = &v1.InstrumentationLibrarySpans{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function looks like it's going to do the majority of work when fetching traces; would it be worth initializing slice capacities where we know the expected sizes to avoid memory realloc. For example, here we could add:

Spans: make([]*v1.Span, len(spans))

same with rss and rs.InstrumentationLibrarySpans below, as well as with some of the other j...ToOTLP funcs which may have the opportunity for this simple optimization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make([]*v1.Span, len(spans))

Does not seem correct. The function iterates over jaeger spans and creates resource spans slice and instrumentation slice spans. before hand, the size of these two slices is not known.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay, although I thought the sizes for rss and rs.InstrumentationLibrarySpans could be determined up front as len(spansByLibrary) & len(libMap) respectively.

return rss
}

type instrumentationLibrary struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learnt from an earlier review that types should be at the top of file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it mostly applies to the exported types. This is a helper type that is bounded to the not exported function below

sort.Strings(keys)
sBuilder := strings.Builder{}
sBuilder.Grow(stringTagsLen)
for k, v := range tagMap {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering this scenario where span0 has a tag:

key = "a"
val = "bc"

and span1 has a tag:
key = "ab"
val = "c"

could these two result in the same hash? if so, what is the consequence?

similarly, for cases where the value is an empty string:

span0: a="", b="c" -> abc
span1: a="b", c="" -> abc

please ignore if these scenarios won't realistically happen :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corner case :). Then the uniqueness in the map will depend only on the service name which is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is resource used as the key in spansByLibrary if uniqueness depends only on the service name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uniqueness does not depend only on the service name but as well on the string tags.

cmd/query/app/apiv3/otlp_translator.go Show resolved Hide resolved
cmd/query/app/apiv3/otlp_translator.go Outdated Show resolved Hide resolved
cmd/query/app/apiv3/otlp_translator.go Show resolved Hide resolved
@pavolloffay
Copy link
Member Author

can somebody re-review?

cmd/query/app/apiv3/otlp_translator.go Show resolved Hide resolved
cmd/query/app/apiv3/otlp_translator.go Outdated Show resolved Hide resolved
}
if tag, ok := kvs.FindByKey(tracetranslator.TagStatusCode); ok {
statusExists = true
if code, err := getStatusCodeValFromTag(tag); err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although I agree that we should not return an error to the API caller or even return early in this function if there is an error in translating status codes, would it be worth logging (even on debug) the error so the problem can be addressed?

I imagine failure to map a jaeger span status code to OTLP status code should be more likely the exception than the norm, right?

Copy link
Member Author

@pavolloffay pavolloffay Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be an exception state, this is based on OTELcol translator which does not log the error either. Adding logger is not a problem, but it will a bit pollute translator API.

"github.com/jaegertracing/jaeger/model"
commonv1 "github.com/jaegertracing/jaeger/proto-gen/otel/common/v1"
resourcev1 "github.com/jaegertracing/jaeger/proto-gen/otel/resource/v1"
v1 "github.com/jaegertracing/jaeger/proto-gen/otel/trace/v1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to #2988 (comment), i think something like otlptrace.ResourceSpans is more readable than v1.ResourceSpans. Perhaps the same for commonv1 and resourcev1.

likewise, I think it would be easier to read jaeger.Span than model.Span.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have renamed it to v1 to tracev1

cmd/query/app/apiv3/otlp_translator.go Show resolved Hide resolved
cmd/query/app/apiv3/otlp_translator.go Outdated Show resolved Hide resolved
albertteoh
albertteoh previously approved these changes Jul 13, 2021
Copy link
Contributor

@albertteoh albertteoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

besides needing to rebase, lgtm

Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
@pavolloffay pavolloffay merged commit 29b6016 into jaegertracing:master Jul 13, 2021
@vprithvi vprithvi added this to the Release 1.25.0 milestone Aug 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants