This is the source code that accompanies the solution: Deduplication of messages with Cloud PubSub and Cloud Dataflow. This sample code demonstrates three approaches for deduplication:
- PubSubIO:
com.google.examples.dfdedup.DedupWithPubSubIO
- Distinct transform:
com.google.examples.dfdedup.DedupWithDistinct
- Custom state based deduplication:
com.google.examples.dfdedup.DedupWithStateAndGC
You can run the following end to end pipeline to explore deduplication behavior across all three approaches:
NOTE: If you're new to GCP, please see quickstarts for Cloud PubSub, BigQuery and Cloud Dataflow
Use the schema files under bqschemas/
to create
Blah blah