Skip to content

senpai agent poc (ai hackweek)#45

Draft
sl0thentr0py wants to merge 1 commit intomainfrom
neel/senpai
Draft

senpai agent poc (ai hackweek)#45
sl0thentr0py wants to merge 1 commit intomainfrom
neel/senpai

Conversation

@sl0thentr0py
Copy link
Member

@sl0thentr0py sl0thentr0py commented Feb 17, 2026

best used as following for full output

claude --agent sentry-skills:senpai
explain relay's role in sentry's infrastructure

or like this which delegates to subagent but does an extra step of unnecessary summarizing

ask senpai to explain relay's role in sentry's infrastructure

outputs below are single question, agent invokations

@sl0thentr0py
Copy link
Member Author

sl0thentr0py commented Feb 17, 2026

❯ senpai can you explain what relay is

What is Sentry Relay?

Quick Answer

Relay is Sentry's event forwarding and ingestion service - it acts as a high-performance proxy that sits between your application (where the Sentry SDK runs) and Sentry's backend. It handles the first stage of processing events, including validation, filtering, rate limiting, and normalization before forwarding data to Kafka for further processing.

How It Works

What Problem Does Relay Solve?

Think of Relay as a smart gateway or bouncer for Sentry events. When millions of events come in from SDKs around the world, you need something that can:

  • Quickly validate that events are legitimate and properly formatted
  • Apply rate limits to prevent overwhelming the system
  • Filter out unwanted events early (before expensive processing)
  • Perform initial data normalization

Relay handles all of this at the edge, offloading work from the core Sentry backend.

Two Operating Modes

Relay can run in two different modes depending on your needs:

1. Proxy Mode (Simple Forwarding)

  • Acts as a lightweight proxy between SDKs and Sentry
  • Validates incoming events and forwards them upstream
  • Useful for customers who want to run Relay in their own infrastructure for data privacy/compliance
  • Minimal processing overhead

2. Processing Mode (Full Ingestion)

  • This is what Sentry uses internally for SaaS and self-hosted installations
  • Performs comprehensive event processing, normalization, and filtering
  • Produces events directly into Kafka for the rest of the pipeline
  • Requires connection to Redis (for project configs) and Kafka (for event output)

The Ingestion Pipeline Flow

Here's how an event flows through Relay in processing mode (the typical setup):

1. SDK sends event → Relay (via nginx)
2. Relay validates DSN and Project ID (checks Redis for project config)
3. Relay processes the envelope (parses, normalizes, applies rate limits)
4. Relay produces processed events to Kafka topics
5. Kafka → Sentry ingest consumers pick up events for further processing

From the data flow diagram I found, the complete path looks like:

Internet → nginx → Relay → Kafka → ingest-consumers → more processing
                     ↓
                   Redis (for project configs)

Key Responsibilities

Relay handles several critical functions:

  • Envelope Processing: SDKs send data in "envelopes" containing events, attachments, sessions, etc. Relay parses these and routes different data types appropriately
  • Project Config Management: Relay caches project configurations from Redis to know how to handle events for each project
  • Rate Limiting: Enforces quotas and rate limits at the edge
  • Data Scrubbing: Removes sensitive data (PII) based on project settings
  • Event Normalization: Ensures event data conforms to expected schema
  • Smart Filtering: Can drop events based on various rules before expensive processing

Technical Details

  • Written in Rust: Built for high performance and memory safety
  • Stateless: Relay instances can be horizontally scaled easily
  • Config-driven: Uses YAML configuration files and fetches project-specific configs from Redis
  • Kafka Integration: In processing mode, writes to Kafka topics for downstream consumers

Why It Matters

Relay is critical to Sentry's architecture because:

  1. Performance: Handles millions of events per second with minimal latency
  2. Scalability: Stateless design allows easy horizontal scaling
  3. Cost Efficiency: Filters and rate-limits early, reducing unnecessary processing
  4. Flexibility: Customers can run their own Relay instances for data residency requirements
  5. Reliability: Acts as a buffer between SDKs and the backend

Key Concepts

  • Envelope: The container format SDKs use to send data to Relay - can contain multiple items (events, attachments, sessions)
  • Project Config: Configuration data Relay fetches from Redis that tells it how to handle events for each project (rate limits, filters, PII rules)
  • Data Category: Classification of data types (errors, transactions, attachments) used for rate limiting and routing
  • DSN (Data Source Name): The URL/key that SDKs use to send data - Relay validates these

References

  1. Relay GitHub Repository - Main codebase
  2. Relay Developer Documentation - Code and development docs
  3. Relay Product Documentation - User-facing documentation
  4. Self-hosted Data Flow - Shows how Relay fits into the overall architecture
  5. Ingestion Documentation - Engineering docs on ingestion pipeline

Next Steps

To deepen your understanding of Relay:

  1. Explore the codebase: Check out the /tmp/relay repository structure to see how it's organized
  2. Run Relay locally: Try setting up a local development instance following the README
  3. Understand envelopes: Look into how SDKs format data and how Relay parses it
  4. Learn about project configs: Explore what configuration options Relay pulls from Redis
  5. Follow event flow: Trace a single event through the entire pipeline from SDK → Relay → Kafka → processing

Research sources: getsentry/relay repository (README.md), getsentry/sentry-docs/develop-docs (data-flow.mdx, ingestion/relay/index.mdx, self-hosted/index.mdx)

Generated by senpai agent (Agent ID: a1c1738)

@sl0thentr0py
Copy link
Member Author

sl0thentr0py commented Feb 17, 2026

 ❯  ask senpai to explain envelope formats to me

Sentry Envelope Formats

Quick Answer

Envelopes are Sentry's universal wire format for sending data from SDKs to the Sentry backend. Think of an envelope like a postal package: it has a shipping label (envelope headers) and can contain one or more items inside (events, attachments, sessions, replays, profiles, logs, and more). This format replaced older single-purpose HTTP endpoints and lets SDKs batch multiple, heterogeneous pieces of data into a single HTTP request.


Why Envelopes Exist

Before envelopes, SDKs had to use separate endpoints and separate HTTP requests for different data types -- one call for an error event, another for an attachment, yet another for session data. This was inefficient, especially when you wanted to send related data together (like a crash event plus a minidump file plus a screenshot attachment).

Envelopes solve several problems at once:

  • Batching: You can combine an error event, its attachments, and a client report all in one HTTP request.
  • Heterogeneous payloads: Binary attachments and JSON event payloads can coexist in a single submission.
  • Hop-to-hop communication: Envelopes work not just between SDK and server, but also between Relays in a chain, and between different SDKs (e.g., a Native SDK embedded in a mobile SDK).
  • Offline storage: The format is suitable for writing to disk when network connectivity is unavailable, then transmitting later.

The Structure of an Envelope

An envelope follows a simple line-delimited format (not unlike HTTP multipart form data, but simpler). Here is the grammar:

Envelope = EnvelopeHeaders "\n" { Item } [ "\n" ]
Item     = ItemHeaders "\n" Payload
Payload  = { arbitrary bytes }

In plain English, an envelope is:

  1. One line of JSON: the envelope headers
  2. Zero or more items, each consisting of:
    • One line of JSON: the item headers
    • The item payload (whose length is either declared in a length header or is implicitly "everything up to the next newline")
  3. An optional trailing newline

Key Rules

  • Newlines are always UNIX-style \n (ASCII 10). A \r before \n is treated as part of the payload, not as a line ending.
  • All headers are single-line, compact JSON objects.
  • Unknown header attributes must be preserved (not stripped) by any component that processes the envelope. This is important for forward compatibility.
  • An empty envelope (headers only, no items) is valid but meaningless and can be discarded.

Concrete Example

Here is a real envelope with two items -- an attachment and an error event:

{"event_id":"9ec79c33ec9942ab8353589fcb2e04dc","dsn":"https://examplePublicKey@o0.ingest.sentry.io/0"}\n
{"type":"attachment","length":10,"content_type":"text/plain","filename":"hello.txt"}\n
\xef\xbb\xbfHello\r\n\n
{"type":"event","length":41,"content_type":"application/json"}\n
{"message":"hello world","level":"error"}\n

Reading this line by line:

  1. Envelope header: Contains the event_id and dsn for authentication.
  2. Item 1 header: Declares an attachment of 10 bytes.
  3. Item 1 payload: The raw attachment bytes (exactly 10 bytes, including a BOM and Windows newline).
  4. Item 2 header: Declares an event of 41 bytes.
  5. Item 2 payload: A JSON error event.

Envelope Headers (the "shipping label")

The top-level envelope headers apply to the entire envelope and all its items. The most important ones are:

Header Required? Description
event_id Depends on items UUID identifying the event/transaction. Required when the envelope contains an event, transaction, or feedback.
dsn Recommended The full DSN string. Allows the envelope to be "self-authenticating" -- it carries everything needed to route and authenticate itself.
sent_at Recommended RFC 3339 UTC timestamp of when the SDK sent the envelope. Used by Relay for clock drift correction. Should be set as close to transmission time as possible, and must appear only once.
sdk Recommended SDK name and version information. Carried at the envelope level so it applies even to items like minidumps or sessions that lack their own SDK metadata.

Item Headers

Every item within an envelope has its own header line. Two headers are universal:

Header Required? Description
type Required The kind of data in this item (e.g., "event", "transaction", "attachment", "session", "replay_event", "profile", "log", etc.)
length Recommended Payload size in bytes. If omitted, the payload extends to the next newline. For payloads containing newlines (like binary data), length must be specified.

Additional item-specific headers depend on the item type (for example, attachments require filename).


All Item Types

This is where it gets interesting. Here is the full catalog of item types that an envelope can carry:

Core Event Types

Item Type Description
event An error or default event (JSON). At most one per envelope. Mutually exclusive with transaction.
transaction A performance transaction (JSON). At most one per envelope. Mutually exclusive with event.

Telemetry Data

Item Type Description
session A single session init or update for Release Health. Can appear multiple times (up to 100 per envelope).
sessions Pre-aggregated session count buckets (up to 100 buckets per item). Can appear multiple times.
span A standalone span (Version 2 spans protocol).
log A container of log entries (new structured logging). Contains an items array. Must include item_count and content_type headers.
otel_log OpenTelemetry-formatted log entries.
profile A profiling payload, associated with a transaction. Must be in the same envelope as its transaction.
profile_chunk A V2 profile sample (continuous profiling). Can exist independently.
check_in A Cron Monitor check-in payload. At most one per envelope.

Attachments

Item Type Description
attachment A raw binary or text attachment. Can appear multiple times. Supports special attachment_type values for minidumps, Apple crash reports, Unreal Engine context, and view hierarchies. Also supports a trace attachment variant via a special content type.

Replay

Item Type Description
replay_event Replay metadata (JSON). At most once per envelope. Must be paired with a replay recording.
replay_recording The actual replay DOM recording data (JSON or gzipped JSON). Must be paired with a replay event.

User Feedback

Item Type Description
feedback The current user feedback format. Contains an event with a feedback context. At most once per envelope.
user_report Deprecated. The old user report format, associating comments with an existing event.

SDK Diagnostics

Item Type Description
client_report SDK-side telemetry about what happened to events before they reached Sentry (e.g., how many events were rate-limited or dropped client-side). Can appear multiple times.

Metrics

Item Type Description
trace_metric A container of trace-scoped metric entries. Follows the same container pattern as log items with item_count and content_type headers.

Reserved (Internal Use)

These types are reserved and must not be written by SDK implementations: security, unreal_report, form_data.


How Envelopes Flow Through the Ingestion Pipeline

Here is how an envelope travels from an SDK to storage:

  SDK                    Relay                        Sentry (Django)
  ===                    =====                        ===============

  1. SDK constructs   -->  2. Relay receives        -->  5. Sentry consumers
     envelope with          envelope via POST             read from Kafka
     headers + items        /api/<project_id>/            topics and process
                            envelope/                     items into storage

                         3. Relay processes:
                            - Authenticates (DSN
                              from header, query
                              param, or envelope
                              header)
                            - Applies rate limits
                            - Filters (inbound
                              filters, sampling)
                            - Validates/normalizes
                              payloads
                            - Applies clock drift
                              correction using
                              sent_at
                            - Extracts metrics
                              from transactions

                         4. Relay forwards
                            accepted items to
                            Kafka (or upstream
                            Relay in a chain)

Step-by-step

  1. SDK creates the envelope: The SDK serializes one or more items into the envelope format. For example, when a crash occurs, it might bundle an error event, a minidump attachment, and a client report together.

  2. HTTP transmission: The SDK POSTs the envelope to POST /api/<project_id>/envelope/. The content type is application/x-sentry-envelope (though text/plain is also accepted to avoid CORS preflight requests in browsers). The entire envelope body can be gzip-compressed via standard HTTP content encoding.

  3. Relay processing: Relay is the first service to receive the envelope. It is written in Rust for high performance. Relay:

    • Authenticates the request using the DSN (from HTTP auth headers, query parameters, or the envelope's own dsn header).
    • Validates that items conform to size limits.
    • Applies rate limiting per project, per data category.
    • Filters events based on inbound filter rules and dynamic sampling decisions.
    • Normalizes event payloads (fills in defaults, trims fields, etc.).
    • Corrects clock drift using the sent_at timestamp.
    • Extracts metrics from transaction and span data.
    • Gracefully skips unknown item types (preserving them for forward compatibility).
  4. Kafka: Relay produces messages onto various Kafka topics based on item type. Events go to one topic, attachments to another, sessions to another, and so on.

  5. Sentry consumers: Django-based consumer services read from Kafka and handle final processing -- symbolication, grouping, storage in PostgreSQL/Snuba/ClickHouse, etc.

Relay Chaining

Relays can be chained. An organization might run a local Relay (in "proxy" or "static" mode) that forwards envelopes to Sentry's hosted Relay. Each hop in the chain preserves the envelope format, including unknown item types and headers it does not understand.


Size Limits

Relay enforces these limits on envelopes (values subject to change, defined in Relay's config):

Scope Limit
Entire envelope (decompressed) 200 MiB
Event / transaction / span / log / metric items 1 MiB each
Individual metric within an envelope 2 KiB
Monitor check-in item 100 KiB
Client report item 4 KiB
Profile item 50 MiB
Replay item (compressed) 10 MiB
Replay item (decompressed) 100 MiB
Sessions per envelope 100
Pre-aggregated session buckets per sessions item 100

The length Header: Explicit vs. Implicit Payloads

This is a subtlety worth understanding:

  • Explicit length (recommended): The item header includes "length": 42, meaning the next 42 bytes are the payload, regardless of whether they contain newlines. After those bytes, the parser expects a \n or EOF.
  • Implicit length (length omitted): The payload extends until the next \n or EOF. This only works for payloads guaranteed not to contain newline characters. It is used for small, simple payloads like session updates, where omitting the length can improve compression ratios when many small items are batched together.

The general advice for SDK implementors is: always include length unless you have a specific reason not to (like optimizing compression for many tiny session items).


Authentication Methods

Envelopes support three ways to authenticate:

  1. HTTP X-Sentry-Auth header: The traditional method. The DSN public key is embedded in the header value.
  2. Query parameter: The DSN key is passed as a query parameter on the URL.
  3. Envelope header dsn field: The full DSN is embedded in the envelope's own header JSON. This makes the envelope fully self-contained ("self-authenticating"). This method requires Relay v21.6.0+.

If multiple authentication methods are provided, they must all agree or the request is rejected.


Key Concepts Summary

  • Envelope: A container format with a header line followed by zero or more items. Think of it as a "package" that can hold multiple types of Sentry data.
  • Item: A single piece of data within an envelope (event, attachment, session, profile, etc.), consisting of its own header line and payload.
  • Self-authenticating: An envelope with a dsn header carries everything needed to route itself, making it suitable for offline storage and deferred sending.
  • Forward compatibility: All components must preserve unknown headers and unknown item types. This lets Sentry evolve the format without breaking existing infrastructure.
  • Clock drift correction: The sent_at header enables Relay to correct for differences between the SDK's clock and the server's clock.

References

  1. Envelopes specification: https://develop.sentry.dev/sdk/data-model/envelopes/
  2. Envelope Items specification: https://develop.sentry.dev/sdk/data-model/envelope-items/
  3. Relay implementation: getsentry/relay repository, specifically relay-server/src/envelope/mod.rs
  4. Relay config (size limits): getsentry/relay/relay-config/src/config.rs
  5. Ingestion overview: https://develop.sentry.dev/ingestion/

Next Steps

  • Dynamic Sampling Context (DSC): Envelopes can carry a trace header with sampling decisions that propagate across services.
  • Explore Relay's Rust code: Clone getsentry/relay and look at relay-server/src/envelope/mod.rs to see how envelopes are parsed and validated in practice.
  • Study client reports: Understanding the client_report item type helps you see how SDKs report their own event drop/rate-limit statistics back to Sentry.
  • Look at SDK implementations: Pick any SDK (e.g., getsentry/sentry-python or getsentry/sentry-javascript) and search for "envelope" to see how they construct and serialize envelopes.

@sl0thentr0py
Copy link
Member Author

sl0thentr0py commented Feb 17, 2026

 ❯ what are the different kinds of relays in sentry's infrastructure and how many of them do we have

Sentry Relay Types Overview

Sentry's Relay infrastructure has three conceptual tiers of Relays, plus two configuration modes that determine how each Relay behaves.


The Three Tiers of Relay

Think of Sentry's ingestion pipeline as a series of layers that an event passes through on its way from an SDK to Kafka (and ultimately to storage). Here's the flow:

  SDK (your app)
       │
       ▼
┌─────────────────┐
│ External Relay   │  ← Optional, customer-operated
│ (Customer Relay) │
└───────┬─────────┘
        │
        ▼
┌─────────────────┐
│   PoP Relay      │  ← Sentry-operated, at the edge
│ (Point of        │
│  Presence)       │
└───────┬─────────┘
        │
        ▼
┌─────────────────┐
│ Processing Relay │  ← Sentry-operated, in the core region
│ (Internal)       │
└───────┬─────────┘
        │
        ▼
     Kafka → Snuba → Storage

1. External Relay (Customer-Operated)

  • Who runs it: Customers (self-hosted or on-prem users)
  • Purpose: Gives customers a local ingestion point. Events are filtered, rate-limited, and normalized before leaving the customer's network, reducing bandwidth and providing data scrubbing at the edge.
  • Key trait: Sentry has no control over the version customers run. This is why forward compatibility is so important in the Relay codebase — newer SDKs might send data to older External Relays. The developer docs explicitly note: "we practically have no deprecation policy for old versions."
  • Config: Runs in managed or proxy mode. It receives a restricted/limited project configuration (via LimitedProjectConfig) — Sentry intentionally withholds sensitive internal config from untrusted Relays.
  • How many: Variable — as many as customers choose to deploy. We don't control this number.

2. PoP Relay (Point of Presence)

  • Who runs it: Sentry (us!)
  • Purpose: These are Sentry's edge nodes distributed geographically. They sit close to customers to minimize latency for the initial SDK → Relay request. The key insight: PoP Relays respond to the SDK asynchronously — they accept the event quickly and then forward it to the Processing Relay in the background. This means SDKs don't have to wait for the event to travel all the way to our core infrastructure.
  • What they do: Rate limiting, basic normalization, filtering, metrics extraction, dynamic sampling. They do not do full event processing (no symbolication, no grouping, etc.).
  • Config: These are "internal" Relays (the config flag relay.internal: true gives them access to the full ProjectConfig, not the limited one). They run in managed mode but do not have processing.enabled = true.
  • How many: Based on the deployment configs, PoP Relays are deployed across multiple regions including US, DE (Germany), S4S (Sentry for Sentry), and other regions. The exact number of instances scales with traffic. The deployment pipeline (relay-pop) deploys to these regions with canary and primary stages.

3. Processing Relay

  • Who runs it: Sentry (us!)
  • Purpose: This is the final Relay in the chain before events hit Kafka. It does the "heavy lifting" — full event processing including symbolication coordination, grouping, PII scrubbing, metric extraction, and ultimately producing events to Kafka topics.
  • Key trait: This is the only Relay that has processing.enabled = true in its config. This flag enables features like writing directly to Kafka, accessing Redis for project config caching, and performing full normalization.
  • Config: Runs as an internal, managed Relay with the processing feature flag compiled in and enabled. It receives the full, trusted project configuration.
  • How many: Deployed in the core regions (US, DE, S4S, single-tenant regions). The deployment pipeline is relay-processing. The exact replica count depends on traffic.

The Two Configuration Modes (relay.mode)

Orthogonal to the tier, every Relay runs in one of two modes (defined in relay-config/src/config.rs):

Mode Behavior
managed (default) Fetches project configurations from its upstream (Sentry or another Relay). Requires authentication. This is what all Sentry-operated Relays and most customer Relays use.
proxy Forwards all events with minimal processing. Does NOT fetch project configs. Rate limiting still applies. Useful when you just want a forwarding proxy without needing project-specific behavior.

There was previously a static mode (project configs read from local files), but it was deprecated and removed in Relay v25.9.0.


Other Important Distinctions

Internal vs. External (Trust Level)

This is configured via relay.internal: true/false in the Relay config:

  • Internal Relays (PoP + Processing): Receive the full ProjectConfig with all fields, including internal/sensitive settings.
  • External Relays (customer): Receive a limited LimitedProjectConfig — Sentry's endpoint has an early return that stops populating the config for untrusted Relays.

Default vs. Canary (Instance Type)

From the RelayInstance enum in the code:

  • Default: Normal production instance
  • Canary: Used for gradual rollouts and experiments before hitting the full fleet

Summary Table

Tier Operator Internal? Processing? Mode Gets Full Config?
External Relay Customer No No managed or proxy No (limited)
PoP Relay Sentry Yes No managed Yes (full)
Processing Relay Sentry Yes Yes managed Yes (full)

Key Concepts

  • processing.enabled: The boolean flag that distinguishes Processing Relays from all others. Only Processing Relays write to Kafka.
  • relay.internal: Controls trust level and what project config data the Relay receives.
  • Forward compatibility: Critical because External Relays can be running any version. Enums need catch-all variants, unknown fields must be ignored.
  • Chaining: Relays can chain — an External Relay forwards to a PoP Relay, which forwards to a Processing Relay. Each layer adds more processing.

References

  1. Relay Modes documentation — Official docs on managed vs. proxy modes
  2. Project Config Versioning — Internal vs. External Relay, naming conventions
  3. RelayMode enum in source — The actual Rust enum definition
  4. Relay Best Practices — Engineering guidelines for working on Relay
  5. GoCD deployment configsprocessing.jsonnet and pops.jsonnet show the two deployment pipelines

Next Steps

  • Explore the Relay repository to see the processing pipeline in relay-server/src/services/processor.rs
  • Read about how project configs flow from Sentry → Relay
  • Look into how dynamic sampling works differently at each Relay tier

Research sources: getsentry/sentry-docs (develop-docs/ingestion/relay/), getsentry/relay (relay-config/src/config.rs, gocd/templates/), Relay CHANGELOG.md

@sl0thentr0py
Copy link
Member Author

sl0thentr0py commented Feb 17, 2026

❯ explain the role of kafka in sentry's infra

The Role of Kafka in Sentry's Infrastructure

Quick Answer

Kafka is the central nervous system of Sentry's event pipeline. It acts as the distributed message bus that decouples event ingestion from event processing and storage, allowing Sentry to handle massive volumes of error events, transactions, profiles, and more without losing data even under heavy load.

How Kafka Fits Into Sentry's Architecture

Here's the simplified flow:

Your App -> SDK -> Relay -> KAFKA -> Sentry Consumers -> ClickHouse/Postgres
                                  -> Snuba Consumers  -> ClickHouse
                                  -> Post-Process      -> Alerts, etc.

Kafka sits right in the middle -- after ingestion, before processing. Think of it like a conveyor belt system in a factory: Relay puts items on the belt, and various workers (consumers) pick items off to do their specialized jobs.

Why Kafka? The Three Big Reasons

1. Decoupling Producers from Consumers

Without Kafka, Relay (the ingestion service) would need to directly call Sentry's processing code. If processing slowed down, ingestion would back up, and you'd start dropping events. With Kafka in between:

  • Relay just writes messages to Kafka topics and moves on
  • Consumers read at their own pace
  • If a consumer falls behind, messages queue up safely in Kafka rather than being lost

2. Handling Traffic Spikes (Backpressure)

Sentry processes billions of events. During traffic spikes (imagine a major outage hitting thousands of customers simultaneously), Kafka absorbs the burst. Consumers can catch up gradually without data loss.

3. Fan-Out to Multiple Consumers

A single event often needs to be processed by multiple systems. Kafka lets different consumer groups independently read from the same topic. For example, after an event is saved, both Snuba (for search/analytics) and the post-process pipeline (for alerting) need to see it.

Kafka Topics in Sentry

Kafka organizes messages into topics -- think of them as named channels. Sentry has many topics, each serving a specific purpose. Here are the major categories:

Ingestion Topics

These receive raw data from Relay:

Topic What flows through it
ingest-events Error events from SDKs
ingest-transactions Performance transaction data
ingest-spans Individual span data
ingest-attachments File attachments
ingest-replay-events Session replay events
ingest-feedback-events User feedback
ingest-monitors Cron monitor check-ins
profiles Profiling data
ingest-metrics / ingest-performance-metrics Metrics data

Processed/Internal Topics

After initial processing, events flow to these:

Topic Purpose
events Processed error events, consumed by Snuba
transactions Processed transactions, consumed by Snuba
generic-events Generic event stream (issue platform)
snuba-metrics / snuba-generic-metrics Metrics for ClickHouse storage

Subscription & Alerting Topics

Topic Purpose
events-subscription-results Alert query results for errors
transactions-subscription-results Alert query results for transactions

Taskworker Topics (Newer!)

Topic Purpose
taskworker, taskworker-ingest, taskworker-email, etc. Kafka-based task execution (replacing some Celery usage)

Dead Letter Queues (DLQs)

Almost every topic has a corresponding -dlq topic. When a message fails processing, it goes to the DLQ instead of being lost. This is critical for reliability -- you can investigate and replay failed messages later.

The Event Pipeline in Detail

Here's how an error event flows through Kafka:

1. SDK sends event
       |
2. Relay receives, validates, rate-limits
       |
3. Relay publishes to -----------------> Kafka topic: "ingest-events"
                                              |
4. Sentry ingest consumer reads <-------------+
       |
5. Preprocesses (symbolication, etc.)
       |
6. Saves event to nodestore (Postgres)
       |
7. Publishes to -----------------------> Kafka topic: "events"
                                              |
                            +-----------------+------------------+
                            |                                    |
8a. Snuba consumer              8b. Post-process forwarder
    writes to ClickHouse             triggers alerts, integrations

Notice Kafka appears twice -- once for ingestion, once for processed events. This two-stage design means ingestion and post-processing can scale independently.

The Consumer Framework: Arroyo

Sentry uses a custom streaming library called Arroyo to build Kafka consumers. It provides:

  • A ProcessingStrategyFactory pattern for defining how messages are processed
  • Batching, parallel processing, and backpressure handling
  • Commit management (tracking which messages have been processed)

Consumers are defined in src/sentry/consumers/__init__.py and run via:

sentry run consumer <consumer-name>

Key Concepts

  • Topic: A named channel in Kafka (e.g., ingest-events). Messages in a topic are ordered within partitions.
  • Consumer: A service that reads from a topic and does work (e.g., snuba-consumer writes events to ClickHouse).
  • Producer: A service that writes to a topic (e.g., Relay produces to ingest-events).
  • Dead Letter Queue (DLQ): A fallback topic for messages that fail processing.
  • Consumer Group: Multiple instances of a consumer share the workload of a topic -- each partition is read by only one instance.
  • Arroyo: Sentry's Python library for building robust Kafka consumers with batching and backpressure.

References

  1. Kafka consumers guide -- How to create new topics and consumers
  2. Application Architecture Overview -- High-level system diagram
  3. Self-hosted Data Flow -- Detailed data flow with Kafka's role
  4. Arroyo library -- The streaming consumer framework
  5. kafka_definition.py -- All topic definitions in code
  6. consumers/__init__.py -- All consumer definitions

Next Steps

  • Hands-on: Run sentry run consumer --help in your dev environment to see all available consumers
  • Explore topics: Look at kafka_definition.py to see the full list of ~80+ topics
  • Learn Arroyo: Check out the Arroyo getting started guide if you'll be building consumers
  • Related topics: Snuba (the query engine that reads from Kafka into ClickHouse) and Relay (the ingestion service that produces to Kafka)

Research sources: getsentry/sentry-docs/develop-docs (kafka.mdx, data-flow.mdx, overview.mdx, feedback-architecture.mdx), getsentry/sentry/src/sentry/conf/types/kafka_definition.py

@sl0thentr0py
Copy link
Member Author

sl0thentr0py commented Feb 17, 2026

❯ how do sentry and opentelemetry work together

Quick Answer

Sentry and OpenTelemetry (OTel) work together at multiple levels: Sentry's newer SDKs are actually built on top of OpenTelemetry internally, Sentry can ingest native OTLP data directly, and you can link Sentry-instrumented services with OTel-instrumented services for end-to-end distributed tracing. The relationship has evolved from "separate systems you can bridge" to "deeply integrated."

How It Works

1. Sentry SDKs Are Built on OpenTelemetry (Under the Hood)

This is the most important thing to understand. The modern Sentry JavaScript SDK (v8+) uses OpenTelemetry under the hood. From the docs:

"The Sentry SDK uses OpenTelemetry under the hood. This means that any OpenTelemetry instrumentation that emits spans will automatically be picked up by Sentry without any further configuration."

This means:

  • If a user adds any OTel-compatible instrumentation library, Sentry picks up those spans automatically
  • Context isolation and trace propagation use OTel's machinery
  • Users can optionally use OTel APIs directly alongside the Sentry SDK

Other SDKs (Python, Java, etc.) offer OTel integration as well, though the depth of integration varies by platform.

2. The Historical Context (Why This Matters)

When Sentry first built performance monitoring, OTel was still in its early stages. Sentry created its own model with a key concept OTel doesn't have: transactions (a grouping of spans representing a unit of work like an HTTP request). Over time, Sentry has been converging toward the OTel model.

The integration works through two main OTel extension points:

┌─────────────────────────────────────────────┐
│           OpenTelemetry SDK                  │
│                                              │
│  ┌──────────────────┐  ┌─────────────────┐  │
│  │  SpanProcessor   │  │   Propagator    │  │
│  │  (SentrySpan     │  │  (SentryTrace   │  │
│  │   Processor)     │  │   Propagator)   │  │
│  └────────┬─────────┘  └───────┬─────────┘  │
│           │                    │             │
└───────────┼────────────────────┼─────────────┘
            │                    │
            ▼                    ▼
   ┌─────────────────┐  ┌──────────────────┐
   │  Transforms OTel │  │ Propagates       │
   │  spans → Sentry  │  │ sentry-trace +   │
   │  spans/txns      │  │ baggage headers  │
   └─────────────────┘  └──────────────────┘
  • SentrySpanProcessor: Converts OTel spans into Sentry's data model (spans + transactions)
  • SentryPropagator: Ensures sentry-trace and baggage headers are injected/extracted alongside OTel's W3C traceparent headers, enabling distributed tracing and dynamic sampling

3. Three Ways to Get OTel Data Into Sentry

There are three primary integration patterns:

A) Use a Sentry SDK (OTel built-in)

The simplest path. The Sentry SDK handles everything — it uses OTel internally and sends data to Sentry in Sentry's format.

Your App → Sentry SDK (OTel under the hood) → Sentry

B) Direct OTLP Export (No Sentry SDK)

Sentry has native OTLP endpoints that accept standard OpenTelemetry data. You can point any OTel SDK's OTLP exporter directly at Sentry — no Sentry SDK needed at all.

Your App → OTel SDK → OTLP Exporter → Sentry OTLP Endpoint

Use this when:

  • You're already fully invested in OTel and don't want another SDK
  • You want the simplest setup with no Sentry-specific code

C) Forwarding via OTel Collector / Vector / Fluent Bit

For infrastructure-level telemetry, you can forward data through a pipeline:

Infrastructure Sources → OTel Collector / Vector / Fluent Bit → Sentry

Use this when:

  • You need to collect logs from infrastructure (CloudWatch, Nginx, Kafka, syslog)
  • You want to transform or route telemetry before it reaches Sentry
  • You're aggregating data from multiple sources

4. Linking Sentry + OTel Across Services (Distributed Tracing)

A common real-world scenario: you have a Sentry SDK on the frontend and an OTel-instrumented backend. To get end-to-end traces:

┌──────────────────┐         ┌──────────────────────┐
│  Frontend         │  W3C   │  Backend              │
│  (Sentry SDK)     │──────→ │  (OTel instrumented)  │
│                   │traceparent│                     │
│  propagate        │ header │  picks up trace       │
│  TraceparentOn    │        │  context              │
└──────────────────┘         └──────────────────────┘

The Sentry SDK option propagateTraceparent sends the standard W3C traceparent header with outgoing requests, which OTel-instrumented backends pick up automatically. This links everything into a single distributed trace visible in Sentry's trace view.

5. Key Requirements for Full OTel Support in an SDK

For an SDK to be "fully OTel compatible," it must:

  1. Link errors to traces via trace context
  2. Propagate both sentry-trace and baggage headers (via the SentryPropagator)
  3. Match Span/Trace IDs between Sentry and OTel
  4. Filter out spans representing requests to Sentry itself (avoid recursion)

Key Concepts

  • OTLP (OpenTelemetry Protocol): The wire protocol for sending telemetry data. Sentry now accepts this natively.
  • SpanProcessor: An OTel extension point that lets Sentry transform OTel spans into Sentry's data model.
  • Propagator: An OTel extension point for injecting/extracting trace context headers across service boundaries.
  • sentry-trace header: Sentry's custom trace propagation header (used alongside W3C traceparent).
  • traceparent header: The W3C standard trace context header that both OTel and Sentry understand.
  • Dynamic Sampling Context (DSC): Carried in the baggage header, allows Sentry to make consistent sampling decisions across a distributed trace.

References

  1. Sentry OTel SDK Dev Docs — Technical spec for how SDKs implement OTel support (SpanProcessor, Propagator, etc.)
  2. Sentry with OTel (User Docs) — Guide for linking Sentry SDK traces with OTel-instrumented services
  3. Direct OTLP Export — Sending OTel traces/logs directly to Sentry's OTLP endpoints
  4. OTLP Forwarding — Forwarding telemetry via OTel Collector, Vector, or Fluent Bit
  5. JS SDK OpenTelemetry Docs — How the JavaScript SDK uses OTel under the hood

Next Steps

  • If you're working on SDKs: Read the OTel SDK spec to understand the SpanProcessor and Propagator implementation details
  • If you're working on ingestion (Relay): Look at how Relay handles OTLP data — search for OTLP-related code in getsentry/relay
  • If you're curious about the data model: Check out the performance monitoring research doc for historical context on how Sentry's model differs from (and is converging with) OTel's model
  • Hands-on: Try setting up a Node.js app with the Sentry SDK and adding a third-party OTel instrumentation library — you'll see the spans appear in Sentry automatically

Research sources: getsentry/sentry-docs — specifically develop-docs/sdk/telemetry/traces/opentelemetry.mdx, docs/concepts/otlp/sentry-with-otel.mdx, docs/concepts/otlp/direct/index.mdx, docs/concepts/otlp/forwarding/index.mdx, and docs/platforms/javascript/common/opentelemetry/index.mdx

@sl0thentr0py sl0thentr0py changed the title senpai poc (ai hackweek) senpai agent poc (ai hackweek) Feb 17, 2026
@sl0thentr0py
Copy link
Member Author

Test Prompt

what is dynamic sampling and how does it work

Senpai Response

Quick Answer

Dynamic Sampling is Sentry's system for intelligently deciding which events to keep and which to drop after they arrive at the ingestion pipeline. Since Sentry extracts metrics from 100% of incoming events (giving you accurate counts, durations, and web vitals), Dynamic Sampling's job is to retain a representative subset of full event payloads — biased toward the most valuable data — so you can still dig into traces, spans, and custom tags without storing everything.


How It Works — The Big Picture

Where It Sits in the Pipeline

Dynamic Sampling lives in Relay (our Rust-based ingestion service), and it's step 4 of 5 in the event lifecycle:

SDK → Relay
        │
        ├─ 1. Inbound Data Filters  (e.g., deny old browsers, blocked releases)
        ├─ 2. Quota Enforcement      (billing — all events past here count toward quota)
        ├─ 3. Metrics Extraction      (extract counts, durations, web vitals from 100% of events)
        ├─ 4. ⭐ Dynamic Sampling    (decide: keep or drop the full event payload?)
        └─ 5. Rate Limiting           (safety valve — rarely triggered)
              │
              ▼
           Stored Events (indexed, searchable)

Key insight: Metrics are extracted before sampling, so your dashboards, alerts, and performance graphs see all your data. Dynamic Sampling only affects whether you can see the full event details (traces, spans, tags) in tools like Trace Explorer or Discover.

The Two-Layer Data Model

Layer Source Affected by Sampling? Used For
Metrics Extracted from 100% of events No SPM/TPM, web vitals, alerts, dashboards
Stored Events Kept/dropped by Dynamic Sampling Yes Trace Explorer, Discover, span details, custom tags

Core Concepts

1. Fidelity (Target Sample Rate)

Fidelity = the overall target sample rate for an organization. For example, a fidelity of 20% means Sentry aims to store ~20% of incoming events.

There are two modes:

  • Automatic Mode (default): Sentry manages sample rates per project, boosting low-volume projects so they aren't drowned out. Set at the org level via sentry:target_sample_rate.
  • Manual Mode: You set a static sample rate per project. Sentry won't adjust them.

2. Rules and Biases

Dynamic Sampling works through a rule-based system. Rules live in the project configuration that Relay fetches from Sentry. Each rule has:

  • A condition (which events it matches)
  • A sampling value (either a sampleRate or a factor that multiplies with other rules)
  • An optional time range and decaying function

Rules are evaluated top-to-bottom. factor rules accumulate multipliers; sampleRate rules produce a final decision.

3. Trace vs. Transaction Sampling

Type Scope Guarantees
Trace Sampling Entire trace (all related events) All-or-nothing: keep the whole trace or drop it
Transaction Sampling Individual transaction No trace completeness guarantee

Trace sampling works by seeding the random number generator with the trace ID, so every event in the same trace gets the same keep/drop decision. The Dynamic Sampling Context (DSC) propagated by SDKs makes this possible.


The Biases (Smart Prioritization)

Within the target sample rate, Sentry applies biases to retain more valuable data:

Bias What It Does Configurable?
Prioritize New Releases Boosts sample rate for new releases, decaying as adoption increases ✅ Yes
Prioritize Dev Environments 100% sample rate for *dev*, *local*, *test*, etc. ✅ Yes
Prioritize Low-Volume Projects Boosts small projects so they're not overshadowed (Automatic Mode only) No (automatic)
Prioritize Low-Volume Transactions Rebalances so rare transactions get more samples No (automatic)
Deprioritize Health Checks Reduces rate for /healthz, /heartbeat, etc. ✅ Yes

Architecture: How Rules Get to Relay

┌─────────────────────────────────────────────────────────┐
│                        Sentry                           │
│                                                         │
│  ┌──────────┐    ┌───────────┐    ┌──────────────────┐  │
│  │  Celery   │───▶│   Redis   │◀───│  Rule Generator  │  │
│  │  Tasks    │    │  (shared  │    │  (project config │  │
│  │ (periodic │    │   state)  │    │   recomputation) │  │
│  │  cron)    │    └───────────┘    └────────┬─────────┘  │
│  └──────────┘                               │            │
│   Computes:                                 │            │
│   - Low-volume project rates                ▼            │
│   - Low-volume txn rates        ┌──────────────────┐    │
│   - Org recalibration           │ Project Config   │    │
│                                 │ (cached in Redis)│    │
│                                 └────────┬─────────┘    │
└──────────────────────────────────────────│──────────────┘
                                           │ pull (periodic)
                                           ▼
                                    ┌─────────────┐
                                    │    Relay     │
                                    │              │
                                    │ Evaluates    │
                                    │ rules per    │
                                    │ event, makes │
                                    │ keep/drop    │
                                    │ decision     │
                                    └─────────────┘
  1. Celery tasks run periodically, querying Snuba for traffic volumes and computing per-project and per-transaction sample rates. Results go to Redis.
  2. Rule generation in Sentry reads from Redis and DB, assembles sampling rules, and caches the project configuration.
  3. Relay pulls project configs periodically and evaluates rules for every incoming event.
  4. If no rule matches, Relay keeps the event (better to oversample than lose important data).

Key Concepts Glossary

  • DSC (Dynamic Sampling Context): Metadata propagated by SDKs across a trace so all events in a trace can be sampled consistently.
  • Fidelity: The target overall sample rate for an organization.
  • Factor rule: A multiplier that adjusts sample rate without making a final decision (matching continues).
  • SampleRate rule: Sets the base rate and terminates matching (final decision).
  • Recalibration: Periodic adjustment to ensure the actual sample rate stays close to the target sample rate.

References

  1. Developer Docs (primary source): https://develop.sentry.dev/application-architecture/dynamic-sampling/
  2. Fidelity & Biases: https://develop.sentry.dev/application-architecture/dynamic-sampling/fidelity-and-biases/
  3. Architecture: https://develop.sentry.dev/application-architecture/dynamic-sampling/architecture/
  4. Code — Rule generation: src/sentry/dynamic_sampling/rules/base.py in getsentry/sentry
  5. Code — Biases: src/sentry/dynamic_sampling/rules/biases/ directory in getsentry/sentry
  6. Code — Celery tasks: src/sentry/dynamic_sampling/tasks/ directory in getsentry/sentry
  7. Relay sampling implementation: relay_sampling docs

Next Steps

  • Explore the biases code at src/sentry/dynamic_sampling/rules/biases/ to see how each bias generates rules
  • Read about the DSC at https://develop.sentry.dev/sdk/telemetry/traces/dynamic-sampling-context/ to understand how SDKs propagate trace context
  • Look at the Celery tasks in src/sentry/dynamic_sampling/tasks/ to understand how sample rates are periodically recalculated
  • Try the UI: Go to Project Settings → Performance to see which biases are enabled and how they can be toggled

Research sources: getsentry/sentry-docs develop-docs (dynamic-sampling/index.mdx, the-big-picture.mdx, architecture.mdx, fidelity-and-biases.mdx), getsentry/sentry code search for dynamic_sampling

@sl0thentr0py
Copy link
Member Author

Prompt: where is the EAP schema located

Quick Answer

The EAP (Events Analytics Platform) schema lives in the getsentry/snuba repository. There are two key places to look:

Where the Schema is Defined

1. Storage Configuration (YAML) — the "current truth"

snuba/datasets/configuration/events_analytics_platform/storages/eap_items.yaml

This is the main storage definition that Snuba uses at runtime. It declares the full column schema, query processors, stream loader config, and deletion settings. There are also related storage configs for downsampled tables:

  • eap_items.yaml — main table
  • eap_items_downsample_8.yaml
  • eap_items_downsample_64.yaml
  • eap_items_downsample_512.yaml
  • eap_item_co_occurring_attrs.yaml

All live under:

snuba/datasets/configuration/events_analytics_platform/storages/

2. Migrations (Python) — how the ClickHouse tables are actually created

snuba/snuba_migrations/events_analytics_platform/0024_items.py

This is the initial migration that creates the eap_items_1_local and eap_items_1_dist ClickHouse tables. It defines the columns, engine (ReplacingMergeTree), partition key, sort order, and indices. Subsequent migrations (0026 through 0053) add columns, indices, downsampled views, etc.

3. Entity Configuration

snuba/datasets/configuration/events_analytics_platform/entities/eap_items.yaml

This defines the logical entity that maps to the storage(s) — it's what the query layer uses to route queries.

Schema Structure at a Glance

The eap_items table has this structure:

Column Type Purpose
organization_id UInt64 Org scoping
project_id UInt64 Project scoping
item_type UInt8 Type of item (span, log, etc.)
timestamp DateTime When the event occurred
trace_id UUID Distributed trace ID
item_id UInt128 Unique item identifier
sampling_weight UInt64 For extrapolation
sampling_factor Float64 Sampling correction factor
retention_days UInt16 TTL control
attributes_bool Map(String, Bool) Boolean attributes
attributes_int Map(String, Int64) Integer attributes
attributes_string_0..39 Map(String, String) String attributes (40 buckets!)
attributes_float_0..39 Map(String, Float64) Float attributes (40 buckets!)

The 40 buckets for string/float attributes are a sharding strategy — attributes are hashed into buckets so ClickHouse doesn't have to scan one massive Map column for every query.

Key Concepts

  • EAP = Events Analytics Platform — Sentry's generalized storage for trace items (spans, logs, etc.) in ClickHouse
  • ReplacingMergeTree — The ClickHouse engine used, which deduplicates rows with the same sort key
  • Downsampled tables — Pre-aggregated views at 8x, 64x, and 512x reduction for faster queries on large datasets

References

  1. Storage config: snuba/datasets/configuration/events_analytics_platform/storages/eap_items.yaml
  2. Initial migration: snuba/snuba_migrations/events_analytics_platform/0024_items.py
  3. Rust processor: rust_snuba/src/processors/eap_items.rs (how Kafka messages are processed into rows)
  4. RPC resolvers: snuba/web/rpc/v1/resolvers/R_eap_items/ (how queries are served)

Research sources: getsentry/snuba repository — storage YAML configs, migration files, and directory structure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant