Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix network buffering performance problems #669

Merged
merged 6 commits into from
Mar 20, 2020

Conversation

Slava
Copy link
Contributor

@Slava Slava commented Mar 18, 2020

We use Kafka.js at Figma and we have been observing systematic performance problems with our node.js services consuming from Kafka using this library.

During one of these incidents we took a CPU profile and identified a performance issue in the library, specifically the code performing network reads and low-level protocol parsing.

The culprit is the use of Buffer.concat on every chunk read from the network, resulting into O(n^2) work copying the old buffer to a new buffer of bigger size. This effects are more obvious when the message size is larger, we noticed it when our services lagged behind and started fetching large batches from Kafka.

The proposed fix is to avoid Buffer.concat operations until necessary, accumulating a list of Buffer references instead (a well optimized operation in V8).

As a first-time contributor to Kafka.js I realize that this patch might not be in the perfect shape for merging as is, but happy to discuss with the maintainers how to fix this issue upstream.

this.bufferQueue.bytesTotal += Buffer.byteLength(rawData)

const bytesTotal = Buffer.byteLength(this.buffer) + this.bufferQueue.bytesTotal
if (this.bufferQueue.bytesAwaiting <= bytesTotal) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you inverse this statement please. It would read a lot clearer if it's just "if there are more bytes to read, return" than the way it is now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I rewrote the patch to make it easier to follow

@Nevon
Copy link
Collaborator

Nevon commented Mar 19, 2020

I will properly review this and provide some real feedback as soon as I can. The TL;DR: cool!

@ankon
Copy link
Contributor

ankon commented Mar 19, 2020

This looks cool. Do you have any indication of the gains from this?

@Nevon
Copy link
Collaborator

Nevon commented Mar 19, 2020

We use Kafka.js at Figma

Cool! 🚀 Big fan of what you do.

The issue with copying buffers is a known one which I started addressing in #475, but I didn't have time to drive it to the finish line. Your approach is a bit simpler, which should make it easier to get in, even if it doesn't address every place where there's excessive buffer copying (the protocol encoding and decoding is another place).

It would be really nice if you were able to share some kind of before and after benchmarks, so that we can see if it actually has an impact (it should, but it would be great to see regardless).

On the actual fix itself, it looks pretty good to me. Ideally I would like to encapsulate this into a separate RequestBufferQueue class or something, that could hide the details of the array of buffers, how it calculates how many bytes are remaining, and how to drain the queue, but it also isn't that much logic, so maybe it can stay where it is for now. What do you think?

@Slava
Copy link
Contributor Author

Slava commented Mar 19, 2020

I think the logic of keeping bytesNeeded and bytesBuffered is very specific to how the data is fed into the Decoder. So it would make sense to keep the logic close to it.

As for metrics, we only observed this behavior in production, since it is so dependent on the way Kafka chooses to batch data.

In our case, once a process started lagging behind, it would spike the CPU usage from ~15% to 80-100% and stay there indefinitely. After we deployed a fix, the CPU usage has stabilized back to ~15% immediately and stayed there. We use single-core ECS containers.

@Slava Slava force-pushed the network-bufferring branch 2 times, most recently from 3789061 to 034bab1 Compare March 20, 2020 01:42
@Slava
Copy link
Contributor Author

Slava commented Mar 20, 2020

I reworked the patch again to add more clarity and pass the tests.

@Nevon
Copy link
Collaborator

Nevon commented Mar 20, 2020

You need to merge in latest master, and then we can merge this. Thanks!

@Slava
Copy link
Contributor Author

Slava commented Mar 20, 2020

@Nevon done, thank you for a quick turnaround!

@Nevon Nevon merged commit 9951162 into tulios:master Mar 20, 2020
@Slava Slava deleted the network-bufferring branch September 2, 2020 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants