-
Notifications
You must be signed in to change notification settings - Fork 7
Conversation
mariusandra
commented
May 27, 2021
- Closes worth adding a bq_timestamp? #8
- Needs more testing as it works for me, but is somehow very unstable:
This is problematic: as it also leads to duplicates in bigquery (implying the timeout perhaps doesn't happen with the bigquery calls, and the retries thus insert things into bigquery more than once). Here's an example:
|
So, the problem here is entirely with bigquery - it's the specific upload ( I'm unsure what to do about this, except make minor tweaks with the options to ensure things return quickly. That's what I've done with this PR: disabled retrying by the bigquery library (so only our retry queue retries), making the dedupe option explicit ( FWIW, with these tweaks, there's no problems on cloud (been running for a few hours now on cloud, not a single batch took more than 2 seconds, while on local batches still take upto 60 seconds. shrug. It's not a question of how many events are sent, nor are we exceeding any quotas. Finally, there's intermittent problems on upgrade: since the schema is changing, the change requires time to trickle down on bigquery, which means in some cases, users would have to disable/enable the plugin again to continue ingesting. |
cc @mariusandra for review please :) Another thing (for a separate companion PR - PostHog/plugin-server#465 ) I'm doing now is making it easier to debug timeout errors like these in plugin server, by telling what exactly timed out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor points but otherwise looks fine - didn't run it though
index.ts
Outdated
} | ||
} | ||
|
||
if (global.deduplicateEvents) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given we've disabled BQ retrying, is this used for anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've disabled retrying via the bigquery library, not via the plugin-server, so yes, this does work as long as the retry doesn't happen more than a few minutes later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant why do we need to dedupe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, because exceeding timeouts via the async guard doesn't stop the request from succeeding (most times, depending on the case), so we're making two requests with the same data to bigquery, which leads to duplicates.
This helps resolve some of them. (https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now, makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 😂
plugin.json
Outdated
@@ -49,9 +49,10 @@ | |||
{ | |||
"key": "deduplicateEvents", | |||
"name": "Attempt deduplicating events", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we have yes and no, might make sense wording as a question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean something like: "Attempt deduplicating events?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for example
Regarding "deduplicateEvents", is this something we need to expose to the user? I'd assume my export to just work (TM), not that I need to know low level details like this. Reading the question, as a plugin user, I'm still not sure what do I need to choose here:
What quota limit? Can't we just choose the best default for the user? |
So, I wanted to expose this because the amount of data we can push into Bigquery depends on this parameter. We choose a default that works for most users, and for ones that have A LOT of data, they can switch this toggle. Perhaps I can change the name to something like "high event volume?" but that feels pretty misleading. |
How does the amount of data we can push depend on this parameter? What is A LOT? 😀 |
That said, I still think the inclusion of this option as it is makes it feel as if the entire export is flimsy. ("What do you mean I now have to monitor my export logs for quota errors? Where do I see them? What should I choose?" etc) Can't we make some safe assumptions that will work for everyone? Or perhaps just relax/clarify the guarantees we provide (exported "at least once") and push the burden forward? Anything that makes users think is bad usability. |
Given that: I like the idea of defaulting to no deduplication on our side, and passing it all onto the user. On the flip side, while this makes the user think less at the beginning, it may easily lead to them seeing duplicates - and thinking the export isn't working properly. On the whole, the number of users who might see duplicates are a small fraction of the total, so, yep, what you say makes sense - let's not make issues of a minority a global concern. I'll add a link to the deduplication docs on the README, but remove it as an option from the plugin. Thanks for pushing back on this :) |
Thanks! Yet if I read the line in the README, the first question is? Hmm... duplicate events? What's that? Is this a problem? How often do they happen? Should I set up alerts in bigquery to monitor for this? Can I trust this export? Can we add something to ease the mind of users? Something like "There's a very rare case that could lead to duplicates, when due to a network error, we see the export as failed, yet it actually reached BigQuery. We will then try the export again, but this may lead to duplicates in BQ" And again, can we actually quantify how big of a problem this may be? The way it's set now, I find it hard to trust the plugin. :) |
I think it's hard to quantify how big of a problem this may be? The bigquery sdk also faces the same issue, so I'm not sure how to alleviate concerns here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new line in the readme is already a lot more reassuring. Just one more nit for the code, but otherwise looks good.
I'll still give it a quick test as well to make sure I'm not seeing the issues I saw earlier.
}, | ||
} | ||
|
||
export const onEvent: BigQueryPlugin['onEvent'] = (event, { global }) => { | ||
if (!global.exportEventsToIgnore.has(event.event)) { | ||
global.exportEventsBuffer.add(event) | ||
global.exportEventsBuffer.add(event, JSON.stringify(event).length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
ip, | ||
site_url, | ||
timestamp: timestamp ? global.bigQueryClient.timestamp(timestamp) : null, | ||
const object: {json: Record<string, any>, insertId?: string} = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify this object if not using insertId?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could remove the insertId
from the type, but other than that, nope, can't do, because the sdk by default inserts a random insertId
unless I'm using the raw version, which requires the object to look like this^
It probably wouldn't fix all issues on your local: #9 (comment) - but just make them less frequent. |
I don't know why, but this plugin is still really unstable for me locally: "Adding event" is printed in There's one big elephant in the room here: the |
Hmm, where does it "take more than 10s to flush"? (having a hard time understanding from the image. Is it 12:07->12:10 ?). I thought there might be something wonky with the Regarding never finishing 👀 - was it because of a What I did find out when I raised the timeout limit (to sanity check everything), was that it's the bigquery client call that takes > 30s at times on local (didn't dive deeper into why it's a local specific problem). |
Hmm, I'll look into this specific part, see if I run into the same thing. But since this issue is unrelated to this specific PR, can we still merge this one? (i.e., if you run the existing plugin, you'll face these same issues on local dev, too) |
I'll test it one more time when using clickhouse as the database. Until now I just tested with postges. Perhaps this will avoid all the problems with logging (logs go through kafka, not straight to postgres) and everything will just work... 🤔 |
Switching to EE mode (no logs in postgres), those errors disappeared, but it's still a really flaky experience locally :'( Few things I just don't understand.
If this system is acting weird and irrationally on my intel macbook, it'll probably also act weird and irrationally for someone else, possibly in production. Now it just segfaultet for me as well (screenshots above) Thus I suggest finding out what is actually happening and fixing it as priorities number 0, 1 and 2 for the next sprint... and for this one. :) However, if it makes anything better, I'm experiencing similar problems with the latest At the end of the day, the retrying logic is also working, and eventually things are retried and sent... unless it crashes totally, which has also happened. I just don't know :D |
Okay, wow, did not expect to see a segfault here. So, to summarize, (1) isn't an issue anymore. (2) is maybe something to do with switching to EE? Definitely need to investigate this. And (3) Bigquery being flaky locally.. is still a problem, and possibly one every self hosted version would face? (Need to try a self-deployment and test this out too). Agree on investigating bigquery slowness (3) and (2), both. (but I still think neither blocks this PR 😅 ) |
Yeah, I think it's fine to merge this... but please test test test and keep an eye on it :) |