Skip to content
This repository was archived by the owner on Nov 4, 2021. It is now read-only.
This repository was archived by the owner on Nov 4, 2021. It is now read-only.

createBuffer is bad - Why BQ isn't the problem #487

Closed
@yakkomajuri

Description

@yakkomajuri

As the Extensibility team disbanded to conquer the world, it was my turn to be assigned PostHog/bigquery-plugin#10.

Having spent a good amount of time pulling my beard hairs out, I believe I now have the answer. Here's what's happening and a mandatory story of how I got there.

Disclaimer: While I'm reasonably sure of quite a bit of this, some of it is still "hunches", so it may not be 100% technically correct

Context

@neilkakkar suggested I pick this up during the last planning session, noting that this was an error that only happens on local/self-hosted but not Cloud.

This was super weird, because it was reported on local Postgres instances, but also Netdata's ClickHouse instance. I kept thinking about all the things that could be different between these environments and testing them out, and went through a lot of the same steps others here have taken. Could it be logs? Could it be job queues?

Then I started looking into the buffer.

BigQuery ain't the problem

When I first read the issue, I immediately thought this wasn't a BigQuery thing. Why? I had experienced the same issues developing the RudderStack plugin, and the odd retries even featured on a video I made for them :D

It seemed that retry errors were only thrown on the first run, but everything would always complete as expected the second time.

What was different here? In both plugins, the first flush would come from the buffer, whereas all subsequent ones would come from jobs. Bingo, buffer.

The Evil Buffer

To find out if the buffer was the problem, I set up a buffer flush and a job to call the same async function that made a request to an endpoint that'd always wait 5s before responding.

The job always executed in around the same amount of time, but the buffer would vary a lot:

job vs buffer

I went through thousands of runs, and the job was always consistent. The buffer would vary from 5s to 29s and then came the timeouts.

So the buffer was the problem. But finding out why was where my beard hairs started to come out.

I started to notice connection reset, socket hang up, and other sorts of errors. I knew the request would take only 5s, so there were 25s to spare. This made no sense, and so my hunch was that this a problem lower down.

Here came a period of hours and hours looking through node and piscina docs, changing config options, inspecting all the possible lines of code one by one, etc. We can skip ahead from here.

Side note: I think we can improve performance by setting keepAlive to true for plugin fetch requests by default. This had a significant performance benefit, and most plugins will make requests to the same hosts over and over anyway.

0x our savior

Finally, I decided to run the plugin server with flamegraph tool https://github.com/davidmarkclements/0x.

Here's what that looked like:

Screenshot 2021-06-25 at 14 26 46

Hmmm, what the hell is atomicsWaitLoop?

I went to the piscina repo and there was the answer:

function atomicsWaitLoop (port : MessagePort, sharedBuffer : Int32Array) {
  if (!useAtomics) return;

  // This function is entered either after receiving the startup message, or
  // when we are done with a task. In those situations, the *only* thing we
  // expect to happen next is a 'message' on `port`.
  // That call would come with the overhead of a C++ → JS boundary crossing,
  // including async tracking. So, instead, if there is no task currently
  // running, we wait for a signal from the parent thread using Atomics.wait(),
  // and read the message from the port instead of generating an event,
  // in order to avoid that overhead.
  // The one catch is that this stops asynchronous operations that are still
  // running from proceeding. Generally, tasks should not spawn asynchronous
  // operations without waiting for them to finish, though.
  
  ...
}

"Generally, tasks should not spawn asynchronous operations without waiting for them to finish, though."

Hey! That's exactly what we do.

This by itself didn't fully answer the Cloud vs local/self-hosted question though, but with a bit more digging and testing, the answer came through:

This does happen on Cloud (10k+ events on Sentry) but it happens waaay less because there are always tasks to process.

What happens is that when a piscina worker is looking for a task, while it doesn't find it, it keeps blocking the event loop of that thread until a task comes through. If it does, it may have time to run though the other phases of the event loop and handle the voided steps of the buffer flush before the timeout (e.g. the fetch request may have completed, but its callback hasn't been processed yet). But if no tasks come, it will wait, and wait, and wait, and when it finally comes back, asyncGuard does its thing aaaand we're sending a request to be retried (even though it may have finished, as was pointed out with BQ).

Testing this locally confirmed this:

Screenshot 2021-06-25 at 14 26 16

The fewer events you send, the more atomicsWaitLoop will run for. This isn't much of a problem in Cloud because there's always events (and all the other tasks) to process.

Solution

The solution is to listen to the piscina people and not void promises in tasks.

While it will be adding another task each time, we're better off flushing everything via a job.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions