Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto agent fails to export histogram without boundaries #359

Closed
punya opened this issue Jul 25, 2024 · 10 comments
Closed

Auto agent fails to export histogram without boundaries #359

punya opened this issue Jul 25, 2024 · 10 comments
Assignees
Labels
priority: p3 question Further information is requested

Comments

@punya
Copy link
Contributor

punya commented Jul 25, 2024

Description

We are using auto instrumentation via the opentelemetry-javaagent (v1.33.4) and the auto exporter (exporter-auto-0.31.0-alpha-shaded). We instrument our app with the following command:

ENTRYPOINT ["java", \
  "-javaagent:opentelemetry-javaagent.jar", \
  "-Dotel.javaagent.extensions=exporter-auto-0.31.0-alpha-shaded.jar", \
  "-Dotel.traces.exporter=google_cloud_trace", \
  "-Dotel.metrics.exporter=google_cloud_monitoring", \
  "-Dotel.logs.exporter=none", \
  "-Dotel.service.name=my-service", \
  "-Dotel.javaagent.logging=application", \
  "-Dorg.jboss.logging.provider=slf4j", \
  "-Dlogback.configurationFile=/app/logback.xml",\
  "-cp", \
  "app.jar", \
  "com.example.app.Main" \
]

When deploying our app to GCP, it seems to collect metrics correctly (at least some statistics show up for http.server.duration in Google Cloud Monitoring), but every minute our app prints a warning that it failed to write some TimeSeries (see below for the exact stack trace).

How can we find out which metric could not be written (since at least the http.server.duration seems to work fine) and/or how can we avoid the warning?

Steps to Reproduce

Create a java app, instrument it with the agent + auto exporter and deploy it to GCP. Wait a couple of minutes until the app has started and check its logs.

Expected Result

There should be no warnings.

Actual Result

We get warnings in the logs that some TimeSeries cannot be exported:

Exporter threw an Exception
shadow.com.google.api.gax.rpc.InvalidArgumentException: shadow.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: One or more TimeSeries could not be written: Field timeSeries[7].points[0].distributionValue had an invalid value: Distribution |explicit_buckets.bounds| does not have at least one entry.
    at shadow.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:92)
    at shadow.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:98)
    at shadow.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
    at shadow.com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
    at shadow.com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:84)
    at shadow.com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1127)
    at shadow.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at shadow.com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
    at shadow.com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
    at shadow.com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807)
    at shadow.io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:568)
    at shadow.io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:538)
    at shadow.io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
    at shadow.io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
    at shadow.io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
    at shadow.com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:570)
    at shadow.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:574)
    at shadow.io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:72)
    at shadow.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742)
    at shadow.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723)
    at shadow.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at shadow.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
caused by: shadow.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: One or more TimeSeries could not be written: Field timeSeries[7].points[0].distributionValue had an invalid value: Distribution |explicit_buckets.bounds| does not have at least one entry.
    at shadow.io.grpc.Status.asRuntimeException(Status.java:537)
    ... 14 common frames elided"

Environment information

Environment

JDK: Temurin 17
OS: Ubuntu 24.04
Undertow 2.3.12

Originally posted by @nioncode in open-telemetry/opentelemetry-collector-contrib#34250

@psx95
Copy link
Contributor

psx95 commented Jul 26, 2024

@nioncode
I've seen similar issue before, based on a prior investigation, I suspect the empty bucket histogram is being emitted by the GC metrics from JVM runtime.

Debugging logs (Java Agent 1.33.0)

type is HISTOGRAM
Encountered Histogram: ImmutableHistogramData{aggregationTemporality=CUMULATIVE, points=[ImmutableHistogramPointData{getStartEpochNanos=1717194640362180000, getEpochNanos=1717194880366632000, getAttributes={action="end of minor GC", gc="G1 Young Generation"}, getSum=0.053000000000000005, getCount=3, hasMin=true, getMin=0.016, hasMax=true, getMax=0.02, getBoundaries=[], getCounts=[3], getExemplars=[]}]}

As you can see getBoundaries is empty.

Debugging logs (Java Agent 2.5.0)

type=HISTOGRAM, data=ImmutableHistogramData{aggregationTemporality=CUMULATIVE, points=[ImmutableHistogramPointData{getStartEpochNanos=1719439739709025000, getEpochNanos=1719439799715232000, getAttributes={jvm.gc.action="end of minor GC", jvm.gc.name="G1 Young Generation"}, getSum=0.037, getCount=4, hasMin=true, getMin=0.006, hasMax=true, getMax=0.011, getBoundaries=[0.01, 0.1, 1.0, 10.0], getCounts=[2, 2, 0, 0, 0], getExemplars=[]}]}}

The boudaries are present here.

Possible Workaround:

  • Try updating the Java Agent to v2.x

Note: There were breaking changes between 1.x and 2.x of the Java Agent.

@psx95 psx95 self-assigned this Jul 27, 2024
@nioncode
Copy link

Thanks for the quick troubleshooting! I guess we'll filter out the warning then, since we cannot update to Java Agent v2, since Google Cloud Trace seems to only support v1 properly (I think it only parses the old http attributes from the traces and not the new opentelemetry attributes), but maybe we're just doing things wrong here and Cloud Trace should already support Java Agent v2 (and the latest opentelemetry http spec)?

@aabmass
Copy link
Contributor

aabmass commented Jul 29, 2024

Google Cloud Trace seems to only support v1 properly (I think it only parses the old http attributes from the traces and not the new opentelemetry attributes),

@nioncode could you explain a bit more what you mean by this? We can pass on feedback to the Cloud Trace team if you can share specifically what is wrong

@nioncode
Copy link

@aabmass I did a few requests to my service with java agent v1 and v2 and here are the attributes that were recorded in Cloud Trace for the top-level http post request:

Java Agent v1.33.4

/http/response/size	120
/http/status_code	500
/tid	51
cloud.platform	gcp_compute_engine
cloud.region	europe-west3
g.co/agent	opentelemetry-java 1.39.0; google-cloud-trace-exporter 0.31.0
g.co/r/gce_instance/instance_id	4934311152902825950
gcp.gce.instance.hostname	my-host.europe-west3-a.c.project-id.internal
host.arch	amd64
host.id	4934311152902825950
host.name	my-host
net.host.name	1.2.3.4
net.protocol.version	1.1
net.sock.host.addr	172.17.0.2
net.sock.host.port	443
net.sock.peer.addr	5.6.7.8
net.sock.peer.port	37162
os.description	Linux 6.8.0-1011-gcp
os.type	linux
otel.scope.name	io.opentelemetry.undertow-1.4
otel.scope.version	1.33.4-alpha
process.command_args	[/opt/java/openjdk/bin/java,-javaagent:opentelemetry-javaagent.jar,-Dotel.javaagent.extensions=exporter-auto-0.31.0-alpha-shaded.jar,-Dotel.traces.exporter=google_cloud_trace,-Dotel.metrics.exporter=google_cloud_monitoring,-Dotel.logs.exporter=none,-Dotel.
process.executable.path	/opt/java/openjdk/bin/java
process.pid	1
process.runtime.description	Eclipse Adoptium OpenJDK 64-Bit Server VM 17.0.10+7
process.runtime.name	OpenJDK Runtime Environment
service.instance.id	a96d0ff6-3ac1-4918-bb64-9bf2b82030ec
service.name	my-service
telemetry.sdk.language	java
telemetry.sdk.name	opentelemetry
thread.name	XNIO-1 I/O-1
user_agent.original	curl/8.1.1

Java Agent v2.6.0

client.address	5.6.7.8
cloud.account.id	my-gcp-project
cloud.availability_zone	europe-west3-a
cloud.provider	gcp
container.id	18acfc2525afeac329773e2fcd7a325243738ff8f351404dd42cb2c6c4c4b7f8
g.co/agent	opentelemetry-java 1.40.0; google-cloud-trace-exporter 0.31.0
g.co/r/gce_instance/instance_id	4373258415385174554
g.co/r/gce_instance/zone	europe-west3-a
gcp.gce.instance.hostname	my-host.europe-west3-a.c.project-id.internal
gcp.gce.instance.name	my-host
host.arch	amd64
host.id	4373258415385174554
http.request.method	POST
http.response.status_code	500
network.peer.address	5.6.7.8
network.peer.port	34520
network.protocol.version	1.1
os.type	linux
otel.scope.name	io.opentelemetry.undertow-1.4
process.command_args	[/opt/java/openjdk/bin/java,-javaagent:opentelemetry-javaagent.jar,-Dotel.javaagent.extensions=exporter-auto-0.31.0-alpha-shaded.jar,-Dotel.traces.exporter=google_cloud_trace,-Dotel.metrics.exporter=google_cloud_monitoring,-Dotel.logs.exporter=none,-Dotel.
process.executable.path	/opt/java/openjdk/bin/java
process.pid	1
process.runtime.description	Eclipse Adoptium OpenJDK 64-Bit Server VM 17.0.10+7
process.runtime.name	OpenJDK Runtime Environment
process.runtime.version	17.0.10+7
service.instance.id	53ce0cd4-5dff-404f-97d7-4f5b220b8799
service.name	my-service
telemetry.sdk.language	java
telemetry.sdk.version	1.40.0
url.path	/api/v1/myResource
url.scheme	https
user_agent.original	curl/8.1.1

Only the trace from the v1 agent is properly labeled with a red error in Cloud Trace (I assume it parses the /http/status_code for this and ignores the http.response.status_code of the v2 trace). So it seems like Cloud Trace does not support the newer opentelemetry http spec fields.

Apart from that it also seems like that for both, javaagent v1 and v2, sometimes some fields are missing (I did the same request 3 times and I always see a different set of attributes in Cloud Trace). Maybe there is some kind of maximum trace size that can be recorded and that's why random fields are dropped (depending on the order in the json payload I assume)?

@psx95
Copy link
Contributor

psx95 commented Aug 6, 2024

Hi @nioncode,
Thanks for reporting the issue with the Trace UI!
Also, Seems like there is a limit on the maximum limit on the number of trace attributes - According to the docs, max number of attributes per span is 32, which could be the reason some attributes are being dropped.
I will take a look into both these issues.

You mentioned in your comment that "Cloud Trace only support v1 properly" - is there any other issue you faced with v2 except that the spans with an error did not render with the proper UI ?

@nioncode
Copy link

nioncode commented Aug 7, 2024

Also, Seems like there is a limit on the maximum limit on the number of trace attributes - According to the docs, max number of attributes per span is 32, which could be the reason some attributes are being dropped.

Good catch. Can we control which attributes are recorded by default in the agent via an environment variable to remove some of those that we don't need? Alternatively, can you increase the limit in Cloud Trace to support all the default attributes (plus a few more as a buffer)? Otherwise it is a bit confusing for customers why the default settings don't work correctly with Cloud Trace.

You mentioned in your comment that "Cloud Trace only support v1 properly" - is there any other issue you faced with v2 except that the spans with an error did not render with the proper UI ?

No, I think the only problem that we faced is that the UI does not visualize the http status for v2 correctly and then we assumed that it only works 'properly' with v1 (without checking/knowing if there are any other downsides).

@psx95
Copy link
Contributor

psx95 commented Aug 7, 2024

Hi @nioncode ,
Currently, there is no way to increase the limit in Cloud Trace, one possible workaround could be to use the OpenTelemetry Collector as a middleware to add/remove attributes. This is not an ideal solution, but it could work.

@dashpole dashpole added question Further information is requested priority: p3 labels Aug 14, 2024
@dashpole
Copy link
Contributor

Closing based on #359 (comment)

@nioncode
Copy link

nioncode commented Oct 1, 2024

@dashpole is there any issue that we can subscribe to to get notified once Cloud Trace supports the new opentelemetry convention and / or supports more span attributes?

@dashpole
Copy link
Contributor

dashpole commented Oct 1, 2024

We are aware of this, but i'm afraid I can't share anything publicly right now. When changes are announced, it will be posted to the Cloud Trace release notes: https://cloud.google.com/trace/docs/release-notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p3 question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants