-
-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(replays): initial replays clickhouse migration #2681
Changes from 1 commit
203aeee
72dea41
8dcbcee
5fe6dfd
b2b4498
613e4db
d0e5aef
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
from typing import Sequence | ||
|
||
from snuba.clickhouse.columns import ( | ||
UUID, | ||
Array, | ||
Column, | ||
DateTime, | ||
IPv4, | ||
IPv6, | ||
Nested, | ||
String, | ||
UInt, | ||
) | ||
from snuba.clusters.storage_sets import StorageSetKey | ||
from snuba.migrations import migration, operations, table_engines | ||
from snuba.migrations.columns import MigrationModifiers as Modifiers | ||
|
||
raw_columns: Sequence[Column[Modifiers]] = [ | ||
Column("replay_id", UUID()), | ||
Column("sequence_id", UInt(16)), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any sort of relation between There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
Column("trace_ids", Array(UUID())), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are these trace_ids? Is it supposed to be a pointer to some other piece of data in one of our systems? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. getsentry/sentry-replay#38 (comment) replays can have N trace_ids, and each update may have N of them, and trace_id will be the link between them to start (we won't be doing any joins with them) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you ever get a sense of how many trace IDs could be in this field? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. on a per row basis it likely won't be more than 10 |
||
Column("title", String()), | ||
### columns used by other sentry events | ||
Column("project_id", UInt(64)), | ||
# time columns | ||
Column("timestamp", DateTime()), | ||
# release/environment info | ||
Column("platform", String(Modifiers(low_cardinality=True))), | ||
Column("environment", String(Modifiers(nullable=True, low_cardinality=True))), | ||
Column("release", String(Modifiers(nullable=True))), | ||
Column("dist", String(Modifiers(nullable=True))), | ||
Column("ip_address_v4", IPv4(Modifiers(nullable=True))), | ||
Column("ip_address_v6", IPv6(Modifiers(nullable=True))), | ||
# user columns | ||
Column("user", String()), | ||
Column("user_hash", UInt(64)), | ||
Column("user_id", String(Modifiers(nullable=True))), | ||
Column("user_name", String(Modifiers(nullable=True))), | ||
Column("user_email", String(Modifiers(nullable=True))), | ||
# sdk info | ||
Column("sdk_name", String()), | ||
Column("sdk_version", String()), | ||
Column("tags", Nested([("key", String()), ("value", String())])), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For performance reasons, you might want to add bloom filter index on tags as we do on some of our other datasets There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good call 👍🏼 will look at adding those. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you actually going to search for replays by tag key/value ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. likely |
||
# internal data | ||
Column("retention_days", UInt(16)), | ||
Column("partition", UInt(16)), | ||
Column("offset", UInt(64)), | ||
] | ||
|
||
|
||
class Migration(migration.ClickhouseNodeMigration): | ||
blocking = False | ||
|
||
def forwards_local(self) -> Sequence[operations.SqlOperation]: | ||
return [ | ||
operations.CreateTable( | ||
storage_set=StorageSetKey.REPLAYS, | ||
table_name="replays_local", | ||
columns=raw_columns, | ||
engine=table_engines.ReplacingMergeTree( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you going to use the replacing feature for something (aside for removing duplicates, which is anyway a good idea) ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, just removing duplicates. |
||
storage_set=StorageSetKey.REPLAYS, | ||
order_by="(project_id, toStartOfDay(timestamp), cityHash64(replay_id), sequence_id)", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to confirm, items with the same replay_id can still span multiple days right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes. i'm glad you brought this up. the intention is that: replays which span across multiple days will only show up when the initial event is within the time range. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably an edge case, but if we do receive replays on different days with the same replay_id and sequence_id they will not get merged together and we'd need a strategy to deduplicate them when querying. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @JoshFerge Could you please provide some insights of the most common query pattern you expect? The order by key has to be defined based on the expected query pattern. You cannot change it once done without rebuilding the table entirely and getting it wrong will make your query performance miserable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. https://www.notion.so/sentry/Addendum-Replay-Queries-fcfd8e68679e443e87649014cf10ae62 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Re: the document you linked: "As a replays user, I want to see all replays where an error occurred" "As a performance user, I want to see if this trace has a replay associated with it" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Not a mistake. For now, we will do very rudimentary query where we take the trace ids from a single page and look them up to determine if there is an error associated, or do a search on the errors table looking for the replays tag. The issue is that since errors can be sampled / dropped, (and replays too in the future), tagging each other's events with the ids is problematic because it's not guaranteed that the tagged id will exist. we'll likely need some separate table that's generated in event post_processing that can accurately associate ingested events with replays. this will come in a future iteration.
we'll also be adding replay_id on other events, so for example this search can use transactions tagged with a replay id. (there is still the sampling problem, but not going to worry about this in first iteration) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. and for now will not add bloom filter index, will add TODO. something I can follow up on. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Please do not wait on this. Not doing that means a full table scan each time and the effort to add the index is minimal. Clickhouse tables get large very quickly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. went ahead and added the index 👍🏼 |
||
partition_by="(retention_days, toMonday(timestamp))", | ||
settings={"index_granularity": "8192"}, | ||
ttl="timestamp + toIntervalDay(retention_days)", | ||
), | ||
), | ||
] | ||
|
||
def backwards_local(self) -> Sequence[operations.SqlOperation]: | ||
return [ | ||
operations.DropTable( | ||
storage_set=StorageSetKey.REPLAYS, | ||
table_name="replays_local", | ||
), | ||
] | ||
|
||
def forwards_dist(self) -> Sequence[operations.SqlOperation]: | ||
return [ | ||
operations.CreateTable( | ||
storage_set=StorageSetKey.REPLAYS, | ||
table_name="replays_dist", | ||
columns=raw_columns, | ||
engine=table_engines.Distributed( | ||
local_table_name="replays_local", | ||
sharding_key="project_id", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the reason for sharding the data by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good point, I think i just chose this arbitrarily. I'll shard by |
||
), | ||
), | ||
] | ||
|
||
def backwards_dist(self) -> Sequence[operations.SqlOperation]: | ||
return [ | ||
operations.DropTable( | ||
storage_set=StorageSetKey.REPLAYS, table_name="replays_dist" | ||
), | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who generates the sequence_id? On the SDK? Sentry? Snuba? What's the max number allowed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the SDK will generate the sequence id. max number will be between ~100 and ~1000. (we will be capping replays max length time-wise and from there find a sane max seq_id).