Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% CPU caused by thread spike (in Fibers, Meteor) #20083

Closed
KoenLav opened this issue Apr 16, 2018 · 19 comments
Closed

100% CPU caused by thread spike (in Fibers, Meteor) #20083

KoenLav opened this issue Apr 16, 2018 · 19 comments
Labels
v8 engine Issues and PRs related to the V8 dependency.

Comments

@KoenLav
Copy link

KoenLav commented Apr 16, 2018

Version: 8.11.1
Platform: any

We have been experiencing 100% CPU in production, presumably caused by the use of Fibers in Node.js by Meteor and the way the data lookup for threads is implemented in V8.

While this issue probably should not be fixed in this repository it seemed worthwhile to track it here as well, in order to find out whether more users of Node.js are suffering from this issue.

As @kentonv describes the issue in the Meteor repository:
V8 uses a linked list to map thread IDs to some thread-local data. Fibers aren't threads, but the fibers package fakes out V8 into thinking that fibers are threads, so this applies to fibers too. If a large number of fibers are created, the list becomes long, and so every lookup into the table becomes slow, because a lookup in a linked list is O(n). These lookups happen frequently, slowing down everything. Using a hash table makes the lookups O(1).

Issue on Meteor repository:
meteor/meteor#9796

Issue on Fibers repository:
laverdet/node-fibers#371

Commit which fixes the issue in Node.js:
0b88256

V8 issue:
https://bugs.chromium.org/p/v8/issues/detail?id=5338

Proposed contribution which fixes the issue in V8:
Not available yet.

@devsnek
Copy link
Member

devsnek commented Apr 16, 2018

node already has the patch, and v8 is oss, why not just submit the patch to them? this also seems like the wrong place to be asking.

@devsnek devsnek added the v8 engine Issues and PRs related to the V8 dependency. label Apr 16, 2018
@KoenLav
Copy link
Author

KoenLav commented Apr 16, 2018

@devsnek

"While this issue probably should not be fixed in this repository it seemed worthwhile to track it here as well, in order to find out whether more users of Node.js are suffering from this issue."

I've created a PR in the V8 repo: v8/v8#24

@devsnek
Copy link
Member

devsnek commented Apr 16, 2018

@KoenLav please note v8 doesn't accept prs to that repo, they use gerrit. you can follow this guide to learn how to submit a patch: http://dev.chromium.org/developers/contributing-code

i'm going to close this for now, feel free to ping me if you have any other questions.

@devsnek devsnek closed this as completed Apr 16, 2018
@apapirovski
Copy link
Member

ping @nodejs/v8 just in case. Although seems like this is on the right track as is.

@KoenLav
Copy link
Author

KoenLav commented Apr 16, 2018

@devsnek no problem to leave this here closed (people should still be able to find it more easily).

@KoenLav
Copy link
Author

KoenLav commented Apr 24, 2018

@devsnek the V8 patch has landed (https://chromium-review.googlesource.com/c/v8/v8/+/1014407). Any indication of when it will be available in Node.js?

@ofrobots
Copy link
Contributor

@KoenLav I have requested an upstream merge for V8 6.7 and 6.6 so that that can end up in Node 10. Let's see if upstream is willing to back merge. Otherwise I'm +1 on floating it as a patch here otherwise.

@devsnek
Copy link
Member

devsnek commented Apr 24, 2018

@KoenLav you can run git node v8 backport b49206ded97c4eaac7c273ce004d840a0185d40e (using https://github.com/nodejs/node-core-utils/) in a checkout of node and then open a pr

@ofrobots
Copy link
Contributor

@KoenLav you can run git node v8 backport b49206ded97c4eaac7c273ce004d840a0185d40e (using https://github.com/nodejs/node-core-utils/) in a checkout of node and then open a pr

Yes, but wait for a disposition upstream first.

@KoenLav
Copy link
Author

KoenLav commented Apr 26, 2018

@devsnek @ofrobots awaiting merge from upstream first.

Any chance of getting this applied in Nodejs 8.* as well?

@ofrobots
Copy link
Contributor

Yes, that is possible. The LTS policy is that a fix has to be released on the current branch for a bit before we back-port to LTS branches. This helps shake any potential instability. The process as I see it, from here:

  1. If upstream approves a merge, we will pick up the fix on the next V8 update.
  2. If upstream doesn't want to merge back, we can float the fix it on master. The release team will back-port and release on 10.

If all looks good, this will subsequently be considered for a merge back to older release branches.

@KoenLav
Copy link
Author

KoenLav commented May 3, 2018

@ofrobots as the merge has been approved for V8 6.7, what would be the proper course of action from here?

@hashseed
Copy link
Member

hashseed commented May 3, 2018

Our fuzzers found a crash so I would like to understand what's going on there before moving forward with this. @ofrobots has been CC'ed on the issue.

@KoenLav
Copy link
Author

KoenLav commented May 3, 2018

@hashseed link?

@hashseed
Copy link
Member

hashseed commented May 3, 2018

@KoenLav unfortunately I can't share the link as it is security sensitive by default. Kenton has access to it though.

@KoenLav
Copy link
Author

KoenLav commented May 3, 2018

@Hasheed would be happy to try and shed some light if you CC ;)

@kentonv
Copy link

kentonv commented May 13, 2018

It appears to have been a problem with the fuzzer setup, not an actual bug in the code.

@KoenLav
Copy link
Author

KoenLav commented May 13, 2018

@kentonv good to hear!

Have they communicated any ETA on merging this commit to V8 6.7? Or is it already done?

@ofrobots
Copy link
Contributor

I've opened the 6.7 merge change upstream. Once there is LGTM, this can land upstream, and will be picked up by Node.js when we move up to V8 6.7 (~ end of the month).

For V8 6.6, we need to float the patch here. I've opened PR here: #20727

ofrobots added a commit to ofrobots/node that referenced this issue May 14, 2018
Original commit message:
  ThreadDataTable: Change global linked list to per-Isolate hash map.

  For use cases with a large number of threads or a large number of isolates (or
  both), ThreadDataTable can be a major performance bottleneck due to O(n)
  lookup time of the linked list. Switching to a hash map reduces this to O(1).

  Example 1: Sandstorm.io, a Node.js app that utilizes "fibers", was observed
  spending the majority of CPU time iterating over the ThreadDataTable.
  See: https://sandstorm.io/news/2016-09-30-fiber-bomb-debugging-story

  Example 2: Cloudflare's Workers engine, a high-multi-tenancy web server
  framework built on V8 (but not Node), creates large numbers of threads and
  isolates per-process. It saw a 34x improvement in throughput when we applied
  this patch.

  Cloudflare has been using a patch in production since the Workers launch which
  replaces the linked list with a hash map -- but still global.

  This commit builds on that but goes further and creates a separate hash map
  and mutex for each isolate, with the table being a member of the Isolate
  class. This avoids any globals and should reduce lock contention.

  Bug: v8:5338
  Change-Id: If0d11509afb2e043b888c376e36d3463db931b47
  Reviewed-on: https://chromium-review.googlesource.com/1014407
  Reviewed-by: Yang Guo <yangguo@chromium.org>
  Commit-Queue: Yang Guo <yangguo@chromium.org>
  Cr-Commit-Position: refs/heads/master@{nodejs#52753}

PR-URL: nodejs#20727
Ref: nodejs#20083
ofrobots added a commit that referenced this issue May 17, 2018
Original commit message:
  ThreadDataTable: Change global linked list to per-Isolate hash map.

  For use cases with a large number of threads or a large number of
  isolates (or both), ThreadDataTable can be a major performance
  bottleneck due to O(n) lookup time of the linked list. Switching to a
  hash map reduces this to O(1).

  Example 1: Sandstorm.io, a Node.js app that utilizes "fibers", was
  observed spending the majority of CPU time iterating over the
  ThreadDataTable.
  See: https://sandstorm.io/news/2016-09-30-fiber-bomb-debugging-story

  Example 2: Cloudflare's Workers engine, a high-multi-tenancy web
  server framework built on V8 (but not Node), creates large numbers of
  threads and isolates per-process. It saw a 34x improvement in
  throughput when we applied this patch.

  Cloudflare has been using a patch in production since the Worker
  launch which replaces the linked list with a hash map -- but still
  global.

  This commit builds on that but goes further and creates a separate
  hash map and mutex for each isolate, with the table being a member of
  the Isolate class. This avoids any globals and should reduce lock
  contention.

  Bug: v8:5338
  Change-Id: If0d11509afb2e043b888c376e36d3463db931b47
  Reviewed-on: https://chromium-review.googlesource.com/1014407
  Reviewed-by: Yang Guo <yangguo@chromium.org>
  Commit-Queue: Yang Guo <yangguo@chromium.org>
  Cr-Commit-Position: refs/heads/master@{#52753}

PR-URL: #20727
Ref: #20083

Refs: #20083
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
MylesBorins pushed a commit that referenced this issue May 22, 2018
Original commit message:
  ThreadDataTable: Change global linked list to per-Isolate hash map.

  For use cases with a large number of threads or a large number of
  isolates (or both), ThreadDataTable can be a major performance
  bottleneck due to O(n) lookup time of the linked list. Switching to a
  hash map reduces this to O(1).

  Example 1: Sandstorm.io, a Node.js app that utilizes "fibers", was
  observed spending the majority of CPU time iterating over the
  ThreadDataTable.
  See: https://sandstorm.io/news/2016-09-30-fiber-bomb-debugging-story

  Example 2: Cloudflare's Workers engine, a high-multi-tenancy web
  server framework built on V8 (but not Node), creates large numbers of
  threads and isolates per-process. It saw a 34x improvement in
  throughput when we applied this patch.

  Cloudflare has been using a patch in production since the Worker
  launch which replaces the linked list with a hash map -- but still
  global.

  This commit builds on that but goes further and creates a separate
  hash map and mutex for each isolate, with the table being a member of
  the Isolate class. This avoids any globals and should reduce lock
  contention.

  Bug: v8:5338
  Change-Id: If0d11509afb2e043b888c376e36d3463db931b47
  Reviewed-on: https://chromium-review.googlesource.com/1014407
  Reviewed-by: Yang Guo <yangguo@chromium.org>
  Commit-Queue: Yang Guo <yangguo@chromium.org>
  Cr-Commit-Position: refs/heads/master@{#52753}

PR-URL: #20727
Ref: #20083

Refs: #20083
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
ofrobots added a commit to ofrobots/node that referenced this issue Jun 28, 2018
This is the v8.x backport of nodejs#20727.

Original commit message:
  ThreadDataTable: Change global linked list to per-Isolate hash map.

  For use cases with a large number of threads or a large number of
  isolates (or both), ThreadDataTable can be a major performance
  bottleneck due to O(n) lookup time of the linked list. Switching to a
  hash map reduces this to O(1).

  Example 1: Sandstorm.io, a Node.js app that utilizes "fibers", was
  observed spending the majority of CPU time iterating over the
  ThreadDataTable.
  See: https://sandstorm.io/news/2016-09-30-fiber-bomb-debugging-story

  Example 2: Cloudflare's Workers engine, a high-multi-tenancy web
  server framework built on V8 (but not Node), creates large numbers of
  threads and isolates per-process. It saw a 34x improvement in
  throughput when we applied this patch.

  Cloudflare has been using a patch in production since the Worker
  launch which replaces the linked list with a hash map -- but still
  global.

  This commit builds on that but goes further and creates a separate
  hash map and mutex for each isolate, with the table being a member of
  the Isolate class. This avoids any globals and should reduce lock
  contention.

  Bug: v8:5338
  Change-Id: If0d11509afb2e043b888c376e36d3463db931b47
  Reviewed-on: https://chromium-review.googlesource.com/1014407
  Reviewed-by: Yang Guo <yangguo@chromium.org>
  Commit-Queue: Yang Guo <yangguo@chromium.org>
  Cr-Commit-Position: refs/heads/master@{nodejs#52753}

Backport-PR-URL: nodejs#21529
PR-URL: nodejs#20727
Refs: nodejs#20083
Refs: nodejs#20083
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
rvagg pushed a commit that referenced this issue Aug 16, 2018
This is the v8.x backport of #20727.

Original commit message:
  ThreadDataTable: Change global linked list to per-Isolate hash map.

  For use cases with a large number of threads or a large number of
  isolates (or both), ThreadDataTable can be a major performance
  bottleneck due to O(n) lookup time of the linked list. Switching to a
  hash map reduces this to O(1).

  Example 1: Sandstorm.io, a Node.js app that utilizes "fibers", was
  observed spending the majority of CPU time iterating over the
  ThreadDataTable.
  See: https://sandstorm.io/news/2016-09-30-fiber-bomb-debugging-story

  Example 2: Cloudflare's Workers engine, a high-multi-tenancy web
  server framework built on V8 (but not Node), creates large numbers of
  threads and isolates per-process. It saw a 34x improvement in
  throughput when we applied this patch.

  Cloudflare has been using a patch in production since the Worker
  launch which replaces the linked list with a hash map -- but still
  global.

  This commit builds on that but goes further and creates a separate
  hash map and mutex for each isolate, with the table being a member of
  the Isolate class. This avoids any globals and should reduce lock
  contention.

  Bug: v8:5338
  Change-Id: If0d11509afb2e043b888c376e36d3463db931b47
  Reviewed-on: https://chromium-review.googlesource.com/1014407
  Reviewed-by: Yang Guo <yangguo@chromium.org>
  Commit-Queue: Yang Guo <yangguo@chromium.org>
  Cr-Commit-Position: refs/heads/master@{#52753}

Backport-PR-URL: #21529
PR-URL: #20727
Refs: #20083
Refs: #20083
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v8 engine Issues and PRs related to the V8 dependency.
Projects
None yet
Development

No branches or pull requests

6 participants