-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: compress replay data #1436
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Size Change: +11.2 kB (+0.4%) Total Size: 2.8 MB
ℹ️ View Unchanged
|
function gzipToString(data: unknown): string { | ||
return strFromU8(gzipSync(strToU8(JSON.stringify(data))), true) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered streaming, for at least the full snapshots? This seems memory intensive to do the Object->String->Array->Gzip Array->String transformation synchronously
(Though I imagine it would need a bigger refactor to the network code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Streaming in the browser? 🤔 certainly not a thing I've ever heard of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's new-ish (at least for sending, reading the response as a stream has been available for a while): https://developer.chrome.com/docs/capabilities/web-apis/fetch-streaming-requests#using_with_writable_streams
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'd not... that's interesting.
certainly we've talked about having a fast path for full snapshots at least since if all we capture are full snapshots we can play something back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't help but comment on an interesting PR...
function gzipToString(data: unknown): string { | ||
return strFromU8(gzipSync(strToU8(JSON.stringify(data))), true) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Streaming in the browser? 🤔 certainly not a thing I've ever heard of?
...event, | ||
cv: '2024-10', | ||
data: { | ||
...event.data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what gains this really gives.
Why not pull out from the data any counters you need and then just compress the entire object? In theory it would never need to be decompressed and can then be written straight to the S3 file for example and then decompressed at read time.
Or at least it would be a simple single decompression of the whole payload rather than multiple ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so... we know that when compressed the data fits into kafka but when uncompressed it doesn't always (and we reject it unnecessarily)
when I looked at pulling the metadata out (and it's fair I don't have this anywhere but in my head)
i at least had to change the SDK, django capture, blobby and maybe playback depending on how it was implemented
and we'd still have multiple compress/decompress steps since we're doing that at network boundaries anyway
this way i can test the impact with a relatively small change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what gains this really gives.
it should mean that the data that is currently not making into kafka from rust or django capture does make it in
without having to change anything else
so i can probe if that's (the last|a) piece of the puzzle for playback issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#BiasForImpact :p
We want to compress replay payloads since they're sometimes massive and very compressible
We do compress the entire payload when it is travelling over the network and when in storage, but at points in processing it is not compressed and we can leave some data compressed to make life easier on our infra
But blob ingestion needs to read some metadata from some events so we don't want to just compress everything
So, we now support posthog-js partially compressing selected payloads
This only compresses when it is enabled so we can test this ourselves to see what the what
pairs with PostHog/posthog#25183