Skip to content
This repository has been archived by the owner on Dec 23, 2023. It is now read-only.

100% cpu usage by OpenCensus.Disruptor-0 #1599

Open
danielnorberg opened this issue Nov 20, 2018 · 11 comments
Open

100% cpu usage by OpenCensus.Disruptor-0 #1599

danielnorberg opened this issue Nov 20, 2018 · 11 comments
Assignees
Labels

Comments

@danielnorberg
Copy link

Please answer these questions before submitting a bug report.

What version of OpenCensus are you using?

0.17.0

What JVM are you using (java -version)?

java version "11.0.1" 2018-10-16 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.1+13-LTS)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.1+13-LTS, mixed mode)

What did you do?

Normal opencensus use - low volume workload.

What did you expect to see?

Low cpu usage.

What did you see instead?

The OpenCensus.Disruptor-0 thread spinning and using a whole core even when the application is otherwise idle.

This manifested after we upgraded from Java 8 to Java 11.

@songy23
Copy link
Contributor

songy23 commented Nov 21, 2018

@danielnorberg @steveniemitz I've made a new SNAPSHOT release which included #1605. Can you try it and see if it resolved this issue?

The SNAPSHOT release is under https://oss.sonatype.org/content/repositories/snapshots/io/opencensus/. For example https://oss.sonatype.org/content/repositories/snapshots/io/opencensus/opencensus-api/0.18.0-SNAPSHOT/.

To access it, please add the following lines to ~/.m2/settings.xml:

<profiles>
  <profile>
     <id>allow-snapshots</id>
        <activation><activeByDefault>true</activeByDefault></activation>
     <repositories>
       <repository>
         <id>snapshots-repo</id>
         <url>https://oss.sonatype.org/content/repositories/snapshots</url>
         <releases><enabled>false</enabled></releases>
         <snapshots><enabled>true</enabled></snapshots>
       </repository>
     </repositories>
   </profile>
</profiles>

Then you should be able to use 0.18.0-SNAPSHOT as the OpenCensus version in your gradle/pom file.

@danielnorberg
Copy link
Author

With opencensus-impl-0.18.0-20181121.224929-2.jar the OpenCensus.Disruptor-0 thread cpu usage has dropped down to around %10 for me on java 11.

@bogdandrutu
Copy link
Contributor

@danielnorberg good to hear that. I think we should increase from 1ms to 4 or 8ms the sleeping time. Need to run some benchmarks to prove does not affect the high qps jobs. If you can try to increase that to 4 and 8 and see the impact would be great.

@danielnorberg
Copy link
Author

@bogdandrutu I tried increasing the sleep from 1ms to 4ms and 8ms and got ~4% and ~2% cpu usage, respectively, when idle.

@bogdandrutu
Copy link
Contributor

Do you think for the moment until we do a better fix 4% is acceptable? If yes I can do a PR or you can send me a PR. Thanks

@steveniemitz
Copy link
Contributor

I've been running a custom build that replaced SleepingWaitStrategy with BlockingWaitStrategy, which completely removes the CPU overhead of the disruptor thread. However is does add a very slight overhead to the actual tracing calls (since they now need to synchronize/notify the disruptor). For my use-case this overhead was acceptable and better than burning a lot of CPU.

@bogdandrutu
Copy link
Contributor

@danielnorberg @steveniemitz that would be another option. we can do that by having a manifest file to configure the "low-overhead" version which uses the BlockingWaitStrategy (or an environment variable). Do you prefer this? If so what do you think is better manifest or env variable? Also I will work to build a better implementation that will not require this.

@steveniemitz
Copy link
Contributor

steveniemitz commented Nov 26, 2018

I do like the idea of being able to configure this more easily. It seems like the SleepingWaitStrategy does have some kinks in it that could be worked out though with some more tuning? It seems like even 4% is a very high overhead for this that most service owners wouldn't want. It also seems like things running in GKE (or AppEngine) are significantly more effected than things that aren't, are there specific kernel modifications there that are causing these issues?

EDIT: To directly address your question, I think I did see a PR around here that allowed configuring the strategy via a ServiceLoader (and manifest). That'd probably be the best solution.

@bogdandrutu
Copy link
Contributor

Filed an issue on disruptor to allow us to support a better waiting strategy (this is for my long term solution) LMAX-Exchange/disruptor#246. This is mostly an FYI and to keep track of the progress here.

@steveniemitz that PR has not made too much progress. Happy to start a new PR for this if you think that can be a good short term solution (maybe can still be useful after I can implement my long term solution).

@danielnorberg
Copy link
Author

Allowing users to choose the appropriate wait strategy seems like it would be appreciated by some, although I suspect many application developers might not be familiar enough with disruptor and busy-wait/park/yield strategies, contention costs etc to be comfortable with making this choice. I guess most users will want a non-surprising default that performs well enough in most cases.

Defaulting to BlockingWaitStrategy seems like a good choice that would mitigate the unexpected cpu usage of the consumer thread for applications with low-volume workloads. The need for producer threads to synchronize is unfortunate though. Seeing as we have a single consumer thread, perhaps we can get away with a cheaper signaling mechanism. Took a stab at that here #1618.

@xocasdashdash
Copy link

We've recently managed to hit a very similar issue to this one at my company.
The difference is that we're running java8, the issue showed up when we updated from jdk:8u232 to jdk:8u242.

We're going to try to fix by updating the opencensus library.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants