feat(ingestion): adds kafka consumer breadcrumbs #31201

nickbest-ph · 2025-04-14T19:01:31Z

Important

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Problem

We are potentially duplicating up to 3-4% of events being ingested into clickhouse. To figure out where in our ingestion/processing pipeline these events are being duplicated, we'd like to add some breadcrumbs to help with the analysis.

The portion of our pipeline we need observability into is the following:

[client] -> [capture api] -> [events_plugin_ingestion topic] -> [ingestion-events consumer] -> [events_plugin_ingestion_* topic] -> [ingestion-events-* consumer]

(NOTE: * can be any of dlq, historical, or overflow. These are potential destinations other than the clickhouse topic. We need to ensure we track duplicates across these as well)

Client -> Capture API

A now timestamp is already being set on the event when it arrives at the server. If this timestamp differs between two events with an otherwise unique identifer, then the client sent the event multiple times.

Capture API -> ingestion-topic or ingestion-consumer -> ingestion-topic
The offset in the breadcrumb should indicate whether or not a producer enqueued a single event to a kafka topic multiple times. A kafka topic is also written with the breadcrumb, so we know which topic was being produced to when the duplication happens.

ingestion-topic -> ingestion-consumer
The processed_at timestamp that we set in the breadcrumb will allow us to understand if our consumer de-queued the same event off the ingestion-topic twice. A kafka topic and consumer id is also written with the breadcrumb, so we know which topic was being consumed from and the consumer group id.

Changes

Adds a kafka_consumer_breadcrumbs array to event.properties while processing events in our ingestion consumers. This array is a list of objects with the following structure: {topic, offset, partition, processed_at, consumer_id}.

How did you test this code?

Added unit tests and updated the snapshots

greptile-apps

PR Summary

This PR adds Kafka consumer breadcrumbs to track event flow through the ingestion pipeline, helping identify potential sources of event duplication (currently affecting 3-4% of events).

Adds kafka_consumer_breadcrumbs array to event properties containing topic, offset, partition, timestamp, and consumer_id for each processing step
Preserves existing breadcrumbs when an event passes through multiple consumers in the pipeline
Implements comprehensive test coverage verifying breadcrumb structure and content
Enables tracing of events across all potential destinations (main pipeline, DLQ, historical, overflow)
Maintains the breadcrumb chain when events are reprocessed, providing a complete audit trail

_{💡 (3/5) Reply to the bot's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!}

_{2 file(s) reviewed, no comment(s)}
_{Edit PR Review Bot Settings | Greptile}

pl · 2025-04-14T19:46:53Z

plugin-server/src/ingestion/ingestion-consumer.ts

+                ? eventProperties.kafka_consumer_breadcrumbs
+                : []
+
+            // Store the breadcrumb in event properties


suggestion: We should store the breadcrumbs outside of the event properties, as we don't really want to expose them to our customers.

IMO, they should be stored in a separate column in Clickhouse, so they should be at the top level of the produced Kafka message. It will be easier to aggregate them later into a single array of breadcrumbs for querying.

Cool, sounds good!

pl · 2025-04-15T22:35:52Z

plugin-server/src/ingestion/ingestion-consumer.ts

-            kafkaMessages.map((message) =>
-                this.kafkaOverflowProducer!.produce({
+            incomingEvents.map(({ message, event }) => {
+                const { data: dataStr, ...rawEvent } = parseJSON(message.value!.toString())


question: Is there a way we could do it without adding another pair of JSON parse/stringify calls? JSON serialization in Node is quite expensive and we might see a visible bump in CPU usage if we do this.

I think I'd have to pass the rawEvent object from parseKafkaBatch to this point in the code. Then I should be able to avoid the const { data: dataStr, ...rawEvent } = parseJSON(message.value!.toString()) line, but pretty sure there's no way around having to make the two stringify calls, as I need to reserialize the modified data so I can send it.

I checked to see if I could just reconstruct rawEvent from the event passed into emitToOverflow and unfortunately we are overwriting some of the fields that we need in the normalizeEvent step...(the ip field)

Similar stringify calls are made when we do the emitEvent calls, in other cases where we've changed the event.

Added the breadcrumbs to the message headers in subsequent commit. Don't have to deser/ser the entire message anymore to tack on these breadcrumbs

plugin-server/src/ingestion/ingestion-consumer.ts

eli-r-ph

Might want to get a 2nd pair of 👁️ from @pl but this LGTM 👍

pl · 2025-04-16T22:10:21Z

plugin-server/src/ingestion/ingestion-consumer.test.ts

@@ -97,6 +106,7 @@ describe('IngestionConsumer', () => {
    beforeEach(async () => {
        fixedTime = DateTime.fromObject({ year: 2025, month: 1, day: 1 }, { zone: 'UTC' })
        jest.spyOn(Date, 'now').mockReturnValue(fixedTime.toMillis())
+        jest.spyOn(Date.prototype, 'toISOString').mockReturnValue(fixedTime.toISO()!)


nit: Jest has mocks for everything time-related: https://jestjs.io/docs/timer-mocks

hmm, my tests were stalling when I tried using useFakeTimers() not sure why....so i opted to use this spyOn

Ok, let's leave it as it is then – there might be a process.nextTick call or something like that, but it looks like it's not worth the effort.

plugin-server/src/ingestion/ingestion-consumer.ts

nickbest-ph · 2025-04-16T23:15:03Z

plugin-server/src/ingestion/ingestion-consumer.ts

+                            if (validatedBreadcrumbs.success) {
+                                existingBreadcrumbs.push(...validatedBreadcrumbs.data)
+                            } else {
+                                console.log('yes')


whoopps, just rm'ed

greptile-apps bot reviewed Apr 14, 2025

View reviewed changes

nickbest-ph assigned tkaemming and unassigned tkaemming Apr 14, 2025

nickbest-ph requested a review from a team April 14, 2025 19:15

nickbest-ph self-assigned this Apr 14, 2025

nickbest-ph requested a review from tkaemming April 14, 2025 19:45

nickbest-ph changed the title ~~feat(ingestion):adds kafka consumer breadcrumbs~~ feat(ingestion): adds kafka consumer breadcrumbs Apr 14, 2025

pl requested changes Apr 14, 2025

View reviewed changes

nickbest-ph requested review from pl and a team April 14, 2025 23:51

nickbest-ph added 9 commits April 15, 2025 14:04

feat(ingestion):adds kafka consumer breadcrumbs

9a481c4

Changes processed_at_timestamp to processed_at

c74ea32

Remove check on undefined property

feeec0a

Clearer comment

8b60475

Adds breadcrumbs to raw kafka event

b367012

Add consumer breadcrumbs when writing to overflow

d9c5a48

Updates createEvent tests to include new arg

4cd7120

Update snapshot with new args

0f1704b

Uses events rather than messages in emitToOverflow

aff712d

nickbest-ph force-pushed the feat/kafka-offset-breadcrumbs branch from ba41180 to aff712d Compare April 15, 2025 21:49

nickbest-ph added 2 commits April 15, 2025 15:28

Adds around merging kafka breadcrumbs

96c5f4a

Remove console.log

f1203c9

pl reviewed Apr 15, 2025

View reviewed changes

nickbest-ph added 6 commits April 16, 2025 12:30

Adds breadcrumbs to header for performance

b554339

Fixes a few tests

783f029

Only create headers for tests that need them

965963b

Fix type error

bb34647

Fixes snapshot from new arg

040f024

log level warn on parse failure

7fde5a2

nickbest-ph requested a review from pl April 16, 2025 21:35

rafaeelaudibert reviewed Apr 16, 2025

View reviewed changes

plugin-server/src/ingestion/ingestion-consumer.ts Show resolved Hide resolved

eli-r-ph approved these changes Apr 16, 2025

View reviewed changes

pl reviewed Apr 16, 2025

View reviewed changes

nickbest-ph added 3 commits April 16, 2025 15:57

Addresses parsed string validation

75fbd6a

fix type

a12a6fc

remove console.log

96e98d3

nickbest-ph commented Apr 16, 2025

View reviewed changes

Fixes type

b14ab5b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingestion): adds kafka consumer breadcrumbs #31201

feat(ingestion): adds kafka consumer breadcrumbs #31201

nickbest-ph commented Apr 14, 2025 •

edited

Loading

greptile-apps bot left a comment

pl Apr 14, 2025

nickbest-ph Apr 14, 2025

pl Apr 15, 2025

nickbest-ph Apr 15, 2025 •

edited

Loading

nickbest-ph Apr 16, 2025 •

edited

Loading

eli-r-ph left a comment

pl Apr 16, 2025

nickbest-ph Apr 16, 2025 •

edited

Loading

pl Apr 16, 2025

nickbest-ph Apr 16, 2025

feat(ingestion): adds kafka consumer breadcrumbs #31201

Are you sure you want to change the base?

feat(ingestion): adds kafka consumer breadcrumbs #31201

Conversation

nickbest-ph commented Apr 14, 2025 • edited Loading

Problem

Changes

How did you test this code?

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

pl Apr 14, 2025

Choose a reason for hiding this comment

nickbest-ph Apr 14, 2025

Choose a reason for hiding this comment

pl Apr 15, 2025

Choose a reason for hiding this comment

nickbest-ph Apr 15, 2025 • edited Loading

Choose a reason for hiding this comment

nickbest-ph Apr 16, 2025 • edited Loading

Choose a reason for hiding this comment

eli-r-ph left a comment

Choose a reason for hiding this comment

pl Apr 16, 2025

Choose a reason for hiding this comment

nickbest-ph Apr 16, 2025 • edited Loading

Choose a reason for hiding this comment

pl Apr 16, 2025

Choose a reason for hiding this comment

nickbest-ph Apr 16, 2025

Choose a reason for hiding this comment

nickbest-ph commented Apr 14, 2025 •

edited

Loading

nickbest-ph Apr 15, 2025 •

edited

Loading

nickbest-ph Apr 16, 2025 •

edited

Loading

nickbest-ph Apr 16, 2025 •

edited

Loading