Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(replays): initial replays clickhouse migration #2681

Merged
merged 7 commits into from
Jun 13, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions snuba/clusters/storage_sets.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ class StorageSetKey(Enum):
ERRORS_V2 = "errors_v2"
ERRORS_V2_RO = "errors_v2_ro"
PROFILES = "profiles"
REPLAYS = "replays"


# Storage sets enabled only when development features are enabled.
Expand Down
13 changes: 13 additions & 0 deletions snuba/migrations/groups.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ class MigrationGroup(Enum):
SESSIONS = "sessions"
QUERYLOG = "querylog"
PROFILES = "profiles"
REPLAYS = "replays"


# Migration groups are mandatory by default, unless they are on this list
Expand All @@ -25,6 +26,7 @@ class MigrationGroup(Enum):
MigrationGroup.SESSIONS,
MigrationGroup.QUERYLOG,
MigrationGroup.PROFILES,
MigrationGroup.REPLAYS,
}


Expand Down Expand Up @@ -156,6 +158,16 @@ def get_migrations(self) -> Sequence[str]:
]


class ReplaysLoader(DirectoryLoader):
def __init__(self) -> None:
super().__init__("snuba.migrations.snuba_migrations.replays")

def get_migrations(self) -> Sequence[str]:
return [
"0001_replays",
]


class MetricsLoader(DirectoryLoader):
def __init__(self) -> None:
super().__init__("snuba.migrations.snuba_migrations.metrics")
Expand Down Expand Up @@ -233,6 +245,7 @@ def get_migrations(self) -> Sequence[str]:
MigrationGroup.SESSIONS: SessionsLoader(),
MigrationGroup.QUERYLOG: QuerylogLoader(),
MigrationGroup.PROFILES: ProfilesLoader(),
MigrationGroup.REPLAYS: ReplaysLoader(),
}


Expand Down
96 changes: 96 additions & 0 deletions snuba/migrations/snuba_migrations/replays/0001_replays.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
from typing import Sequence

from snuba.clickhouse.columns import (
UUID,
Array,
Column,
DateTime,
IPv4,
IPv6,
Nested,
String,
UInt,
)
from snuba.clusters.storage_sets import StorageSetKey
from snuba.migrations import migration, operations, table_engines
from snuba.migrations.columns import MigrationModifiers as Modifiers

raw_columns: Sequence[Column[Modifiers]] = [
Column("replay_id", UUID()),
Column("sequence_id", UInt(16)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who generates the sequence_id? On the SDK? Sentry? Snuba? What's the max number allowed here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the SDK will generate the sequence id. max number will be between ~100 and ~1000. (we will be capping replays max length time-wise and from there find a sane max seq_id).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any sort of relation between sequence_id and replay_id field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sequence_id will be a monotonically increasing counter, so for each replay_id, sequence_id is unique.

Column("trace_ids", Array(UUID())),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these trace_ids? Is it supposed to be a pointer to some other piece of data in one of our systems?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getsentry/sentry-replay#38 (comment)

replays can have N trace_ids, and each update may have N of them, and trace_id will be the link between them to start (we won't be doing any joins with them)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you ever get a sense of how many trace IDs could be in this field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on a per row basis it likely won't be more than 10

Column("title", String()),
### columns used by other sentry events
Column("project_id", UInt(64)),
# time columns
Column("timestamp", DateTime()),
# release/environment info
Column("platform", String(Modifiers(low_cardinality=True))),
Column("environment", String(Modifiers(nullable=True, low_cardinality=True))),
Column("release", String(Modifiers(nullable=True))),
Column("dist", String(Modifiers(nullable=True))),
Column("ip_address_v4", IPv4(Modifiers(nullable=True))),
Column("ip_address_v6", IPv6(Modifiers(nullable=True))),
# user columns
Column("user", String()),
Column("user_hash", UInt(64)),
Column("user_id", String(Modifiers(nullable=True))),
Column("user_name", String(Modifiers(nullable=True))),
Column("user_email", String(Modifiers(nullable=True))),
# sdk info
Column("sdk_name", String()),
Column("sdk_version", String()),
Column("tags", Nested([("key", String()), ("value", String())])),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performance reasons, you might want to add bloom filter index on tags as we do on some of our other datasets

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call 👍🏼 will look at adding those.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you actually going to search for replays by tag key/value ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likely

# internal data
Column("retention_days", UInt(16)),
Column("partition", UInt(16)),
Column("offset", UInt(64)),
]


class Migration(migration.ClickhouseNodeMigration):
blocking = False

def forwards_local(self) -> Sequence[operations.SqlOperation]:
return [
operations.CreateTable(
storage_set=StorageSetKey.REPLAYS,
table_name="replays_local",
columns=raw_columns,
engine=table_engines.ReplacingMergeTree(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to use the replacing feature for something (aside for removing duplicates, which is anyway a good idea) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, just removing duplicates.

storage_set=StorageSetKey.REPLAYS,
order_by="(project_id, toStartOfDay(timestamp), cityHash64(replay_id), sequence_id)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, items with the same replay_id can still span multiple days right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. i'm glad you brought this up. the intention is that: replays which span across multiple days will only show up when the initial event is within the time range.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably an edge case, but if we do receive replays on different days with the same replay_id and sequence_id they will not get merged together and we'd need a strategy to deduplicate them when querying.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoshFerge Could you please provide some insights of the most common query pattern you expect?
What are you going to filter by most times?
What are you going to aggregate, if anything ?

The order by key has to be defined based on the expected query pattern. You cannot change it once done without rebuilding the table entirely and getting it wrong will make your query performance miserable.
Also the expected query pattern impacts which (if any) data skipping indexes should be added. We cannot add indexes to all columns as the type of index depend on the query you want to make faster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.notion.so/sentry/Addendum-Replay-Queries-fcfd8e68679e443e87649014cf10ae62
see above ^^. We will be happy to rebuild the table entirely while we are testing for the next several months, so viewing all data as temporary and will make that clear to any customers testing. we will be building several use cases on top of this initial that may require us to re-build the tables at any rate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: the document you linked:

"As a replays user, I want to see all replays where an error occurred"
The table designed here does not seem to have a reference to an issue or an error. Is that correct or a mistake ?

"As a performance user, I want to see if this trace has a replay associated with it"
If you want to search the replays table WHERE has(trace_ids, 'asdasdasdasd') please add a bloom filter index on that column (which may require you to create a materialized version with hashes of that column). Otherwise your search will be miserable.
But it would be better to add a replay id on the transaction in some way so you do not have to scan the whole replay table to associate replays to traces.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table designed here does not seem to have a reference to an issue or an error. Is that correct or a mistake ?

Not a mistake. For now, we will do very rudimentary query where we take the trace ids from a single page and look them up to determine if there is an error associated, or do a search on the errors table looking for the replays tag.

The issue is that since errors can be sampled / dropped, (and replays too in the future), tagging each other's events with the ids is problematic because it's not guaranteed that the tagged id will exist.

we'll likely need some separate table that's generated in event post_processing that can accurately associate ingested events with replays. this will come in a future iteration.

But it would be better to add a replay id on the transaction in some way so you do not have to scan the whole replay table to associate replays to traces.

we'll also be adding replay_id on other events, so for example this search can use transactions tagged with a replay id. (there is still the sampling problem, but not going to worry about this in first iteration)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and for now will not add bloom filter index, will add TODO. something I can follow up on.

Copy link
Contributor

@fpacifici fpacifici Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and for now will not add bloom filter index, will add TODO. something I can follow up on.

Please do not wait on this. Not doing that means a full table scan each time and the effort to add the index is minimal. Clickhouse tables get large very quickly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went ahead and added the index 👍🏼

partition_by="(retention_days, toMonday(timestamp))",
settings={"index_granularity": "8192"},
ttl="timestamp + toIntervalDay(retention_days)",
),
),
]

def backwards_local(self) -> Sequence[operations.SqlOperation]:
return [
operations.DropTable(
storage_set=StorageSetKey.REPLAYS,
table_name="replays_local",
),
]

def forwards_dist(self) -> Sequence[operations.SqlOperation]:
return [
operations.CreateTable(
storage_set=StorageSetKey.REPLAYS,
table_name="replays_dist",
columns=raw_columns,
engine=table_engines.Distributed(
local_table_name="replays_local",
sharding_key="project_id",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for sharding the data by project_id versus sharding randomly? One disadvantage I can see with sharding by `project_id is that if there is a big project which uses replays a lot, the shards could become imbalanced.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I think i just chose this arbitrarily. I'll shard by replay_id instead.

),
),
]

def backwards_dist(self) -> Sequence[operations.SqlOperation]:
return [
operations.DropTable(
storage_set=StorageSetKey.REPLAYS, table_name="replays_dist"
),
]
Empty file.
3 changes: 2 additions & 1 deletion snuba/settings/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@
"errors_v2",
"errors_v2_ro",
"profiles",
"replays",
},
"single_node": True,
},
Expand Down Expand Up @@ -167,7 +168,7 @@
COLUMN_SPLIT_MAX_RESULTS = 5000

# Migrations in skipped groups will not be run
SKIPPED_MIGRATION_GROUPS: Set[str] = {"querylog", "profiles"}
SKIPPED_MIGRATION_GROUPS: Set[str] = {"querylog", "profiles", "replays"}

MAX_RESOLUTION_FOR_JITTER = 60

Expand Down
1 change: 1 addition & 0 deletions snuba/settings/settings_distributed.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
"errors_v2",
"errors_v2_ro",
"profiles",
"replays",
},
"single_node": False,
"cluster_name": "cluster_one_sh",
Expand Down