100% cpu usage by OpenCensus.Disruptor-0 #1599

danielnorberg · 2018-11-20T07:53:33Z

Please answer these questions before submitting a bug report.

What version of OpenCensus are you using?

0.17.0

What JVM are you using (`java -version`)?

java version "11.0.1" 2018-10-16 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.1+13-LTS)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.1+13-LTS, mixed mode)

What did you do?

Normal opencensus use - low volume workload.

What did you expect to see?

Low cpu usage.

What did you see instead?

The OpenCensus.Disruptor-0 thread spinning and using a whole core even when the application is otherwise idle.

This manifested after we upgraded from Java 8 to Java 11.

The text was updated successfully, but these errors were encountered:

songy23 · 2018-11-21T22:41:30Z

@danielnorberg @steveniemitz I've made a new SNAPSHOT release which included #1605. Can you try it and see if it resolved this issue?

The SNAPSHOT release is under https://oss.sonatype.org/content/repositories/snapshots/io/opencensus/. For example https://oss.sonatype.org/content/repositories/snapshots/io/opencensus/opencensus-api/0.18.0-SNAPSHOT/.

To access it, please add the following lines to ~/.m2/settings.xml:

<profiles>
  <profile>
     <id>allow-snapshots</id>
        <activation><activeByDefault>true</activeByDefault></activation>
     <repositories>
       <repository>
         <id>snapshots-repo</id>
         <url>https://oss.sonatype.org/content/repositories/snapshots</url>
         <releases><enabled>false</enabled></releases>
         <snapshots><enabled>true</enabled></snapshots>
       </repository>
     </repositories>
   </profile>
</profiles>

Then you should be able to use 0.18.0-SNAPSHOT as the OpenCensus version in your gradle/pom file.

danielnorberg · 2018-11-22T05:12:28Z

With opencensus-impl-0.18.0-20181121.224929-2.jar the OpenCensus.Disruptor-0 thread cpu usage has dropped down to around %10 for me on java 11.

bogdandrutu · 2018-11-22T19:22:16Z

@danielnorberg good to hear that. I think we should increase from 1ms to 4 or 8ms the sleeping time. Need to run some benchmarks to prove does not affect the high qps jobs. If you can try to increase that to 4 and 8 and see the impact would be great.

danielnorberg · 2018-11-26T02:27:54Z

@bogdandrutu I tried increasing the sleep from 1ms to 4ms and 8ms and got ~4% and ~2% cpu usage, respectively, when idle.

bogdandrutu · 2018-11-26T03:14:56Z

Do you think for the moment until we do a better fix 4% is acceptable? If yes I can do a PR or you can send me a PR. Thanks

steveniemitz · 2018-11-26T16:58:16Z

I've been running a custom build that replaced SleepingWaitStrategy with BlockingWaitStrategy, which completely removes the CPU overhead of the disruptor thread. However is does add a very slight overhead to the actual tracing calls (since they now need to synchronize/notify the disruptor). For my use-case this overhead was acceptable and better than burning a lot of CPU.

bogdandrutu · 2018-11-26T20:04:48Z

@danielnorberg @steveniemitz that would be another option. we can do that by having a manifest file to configure the "low-overhead" version which uses the BlockingWaitStrategy (or an environment variable). Do you prefer this? If so what do you think is better manifest or env variable? Also I will work to build a better implementation that will not require this.

steveniemitz · 2018-11-26T20:20:38Z

I do like the idea of being able to configure this more easily. It seems like the SleepingWaitStrategy does have some kinks in it that could be worked out though with some more tuning? It seems like even 4% is a very high overhead for this that most service owners wouldn't want. It also seems like things running in GKE (or AppEngine) are significantly more effected than things that aren't, are there specific kernel modifications there that are causing these issues?

EDIT: To directly address your question, I think I did see a PR around here that allowed configuring the strategy via a ServiceLoader (and manifest). That'd probably be the best solution.

bogdandrutu · 2018-11-26T23:38:25Z

Filed an issue on disruptor to allow us to support a better waiting strategy (this is for my long term solution) LMAX-Exchange/disruptor#246. This is mostly an FYI and to keep track of the progress here.

@steveniemitz that PR has not made too much progress. Happy to start a new PR for this if you think that can be a good short term solution (maybe can still be useful after I can implement my long term solution).

danielnorberg · 2018-11-28T04:31:40Z

Allowing users to choose the appropriate wait strategy seems like it would be appreciated by some, although I suspect many application developers might not be familiar enough with disruptor and busy-wait/park/yield strategies, contention costs etc to be comfortable with making this choice. I guess most users will want a non-surprising default that performs well enough in most cases.

Defaulting to BlockingWaitStrategy seems like a good choice that would mitigate the unexpected cpu usage of the consumer thread for applications with low-volume workloads. The need for producer threads to synchronize is unfortunate though. Seeing as we have a single consumer thread, perhaps we can get away with a cheaper signaling mechanism. Took a stab at that here #1618.

to avoid excessive CPU usage reported in census-instrumentation/opencensus-java#1599

xocasdashdash · 2020-04-03T11:45:55Z

We've recently managed to hit a very similar issue to this one at my company.
The difference is that we're running java8, the issue showed up when we updated from jdk:8u232 to jdk:8u242.

We're going to try to fix by updating the opencensus library.

songy23 added this to the Release 0.18.0 milestone Nov 21, 2018

songy23 added the bug label Nov 21, 2018

songy23 assigned bogdandrutu Nov 21, 2018

songy23 mentioned this issue Nov 21, 2018

Release 0.18.0 #1578

Closed

This was referenced Nov 21, 2018

Update disruptor to sleep 1ms and avoid busy waiting. #1605

Merged

CPU Hovers at 20-30% after OpenCensus has been added to a GCP AppEngine Flex service #1099

Closed

danielnorberg mentioned this issue Nov 28, 2018

lock-free single thread wait strategy #1618

Closed

felixbarny mentioned this issue Jan 24, 2019

Consider more CPU friendly disruptor wait strategies elastic/apm-agent-java#443

Closed

songy23 added the P2 label Apr 26, 2019

songy23 removed this from the Release 0.18.0 milestone Apr 26, 2019

gitlw pushed a commit to dgraph-io/dgraph4j that referenced this issue Jun 28, 2019

Moving the dependency io.opencensus.* into a test dependency only,

04ecb54

to avoid excessive CPU usage reported in census-instrumentation/opencensus-java#1599

gitlw mentioned this issue Jun 28, 2019

Moving the dependency io.opencensus.* into a test dependency only, dgraph-io/dgraph4j#89

Merged

gitlw pushed a commit to dgraph-io/dgraph4j that referenced this issue Jun 28, 2019

Moving the dependency io.opencensus.* into a test dependency only, (#89)

dac5aac

to avoid excessive CPU usage reported in census-instrumentation/opencensus-java#1599

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

100% cpu usage by OpenCensus.Disruptor-0 #1599

100% cpu usage by OpenCensus.Disruptor-0 #1599

danielnorberg commented Nov 20, 2018

songy23 commented Nov 21, 2018

danielnorberg commented Nov 22, 2018

bogdandrutu commented Nov 22, 2018

danielnorberg commented Nov 26, 2018

bogdandrutu commented Nov 26, 2018

steveniemitz commented Nov 26, 2018

bogdandrutu commented Nov 26, 2018

steveniemitz commented Nov 26, 2018 •

edited

Loading

bogdandrutu commented Nov 26, 2018

danielnorberg commented Nov 28, 2018

xocasdashdash commented Apr 3, 2020

100% cpu usage by OpenCensus.Disruptor-0 #1599

100% cpu usage by OpenCensus.Disruptor-0 #1599

Comments

danielnorberg commented Nov 20, 2018

What version of OpenCensus are you using?

What JVM are you using (java -version)?

What did you do?

What did you expect to see?

What did you see instead?

songy23 commented Nov 21, 2018

danielnorberg commented Nov 22, 2018

bogdandrutu commented Nov 22, 2018

danielnorberg commented Nov 26, 2018

bogdandrutu commented Nov 26, 2018

steveniemitz commented Nov 26, 2018

bogdandrutu commented Nov 26, 2018

steveniemitz commented Nov 26, 2018 • edited Loading

bogdandrutu commented Nov 26, 2018

danielnorberg commented Nov 28, 2018

xocasdashdash commented Apr 3, 2020

What JVM are you using (`java -version`)?

steveniemitz commented Nov 26, 2018 •

edited

Loading