This repository contains an end-to-end data platform "playground" simulating a e-commerce marketplace. It's designed for data engineers to quickly experiment with various data engineering use cases in a semi-realistic environment.
The architecture combines a local stack (Kafka, Debezium, MySQL) with cloud services. The primary goal is to provide a complete, replicable infrastructure setup for different cloud providers (starting with AWS).
Note: This is a learning playground, not a production-ready system. It is not intended for production use and lacks proper monitoring, comprehensive error handling, and other operational requirements.
- Infrastructure: Terraform
- Cloud: AWS (S3, Lake Formation, Glue, Athena, API Gateway, Firehose, Lambda, CloudWatch), GCP (forthcoming), Azure (forthcoming)
- Streaming: Debezium, Kafka
- Databases: MySQL (local)
- Transformation: DBT (local)
- Orchestration: Dagster (local)
- Core Language: Python (with
uv) - Local Stack: Docker & Docker Compose
The platform's data is generated by three distinct Python services that all pull from a shared configuration (src/mcdp/project_common/config/presets/base.yaml).
- Generator:
mcdp-mysql-transactional - Description: This service first generates the schema (DDL) and seed data (customers, products) for a transactional MySQL database. It then simulates a live e-commerce workload by inserting new
orders,order_items, andorder_status_events. - Key Feature: It includes an
OrderStatusAdvancerthat applies configurable status transitions (e.g.,created->paid->shipped) as in-place updates, generating a realistic Change Data Capture (CDC) stream. - Flow: MySQL -> Debezium (Kafka Connect) -> Kafka -> Kafka Sink Connector -> Cloud Storage (S3)
Click to see sample SQL output
insert into orders (order_id,customer_id,order_ts,status,currency,subtotal,tax,shipping_fee,total_amount,channel,campaign_id,created_at,updated_at)
values ('o_6202402934632','c_489','2025-11-03 11:34:30','created','EUR',36.36,8.36,3.49,48.21,'direct',NULL,'2025-11-03 11:34:30','2025-11-03 11:34:30');
insert into order_items (order_id,product_id,qty,unit_price,line_amount,updated_at_timestamp)
values ('o_6202402934632','p_1011',2,18.18,36.36,'2025-11-03 11:34:30');- Generator:
mcdp-clickstream - Description: This service generates persona-driven browsing sessions as a stream of JSON events. It simulates a full user funnel, including
page_view,search,add_to_cart,checkout_start, andpurchase, as well as negative events likepayment_failed. - Key Feature: Events are enriched with realistic marketing context (channel, UTMs), device data, and A/B test assignments. It can emit events to stdout, a local file, or a live HTTP endpoint.
- Flow: Generator -> HTTP API Endpoint (Cloud) -> (Downstream processing)
Click to see sample JSON event
{
"schema_version": "1.1",
"event_type": "page_view",
"session_id": "s_aa31a874",
"channel": "direct",
"campaign_id": null,
"page_url": "[https://shop.example.com/category/electronics](https://shop.example.com/category/electronics)",
"geo_country": "DK",
"ab_test_assignments": {"homepage_hero": "variant_a", "checkout_flow": "one_page"},
"event_properties": {
"page_category": "Electronics",
"campaign_landing": "[https://shop.example.com/category/electronics](https://shop.example.com/category/electronics)",
"utm_source": "direct",
"utm_medium": "direct"
},
...
}- Generator:
mcdp-ad-spend - Description: This service produces daily marketing spend and performance reports in CSV format. It simulates metrics like cost, impressions, clicks, and revenue per campaign.
- Key Feature: The generator dynamically models seasonality, trends, weekend uplift, and campaign fatigue to produce realistic, non-static data. It can write files locally or upload them directly to S3, GCS, or Azure Blob Storage.
- Flow: Generator -> CSV Files -> Cloud Storage (S3)
Click to see sample CSV output
date,channel,campaign_id,cost,impressions,clicks,conversions,revenue,currency
2025-01-15,google,cmp_42,116.81,29301,287,5,260.24,EUR
2025-01-15,facebook,cmp_ret,140.98,27246,330,3,174.38,EUR
.
├─ assets/ # Architecture diagrams and reference images
├─ config/ # Presets and environment overrides
├─ dbt/ # dbt projects (one per cloud provider)
├─ infra/
│ ├─ aws/ # AWS Implementation
│ ├─ gcp/ # Forthcoming Google Cloud implementation
│ └─ azure/ # Forthcoming Azure implementation
├─ orchestration/dagster/ # Dagster workspace + assets
├─ src/mcdp/ # Python package (generators, CDC stack, CLIs)
├─ tests/ # Unit and smoke tests
├─ Makefile # Convenience workflows backed by uv
└─ pyproject.toml # Project metadata and dependencies
Before you begin, please ensure you have the following tools installed:
- Python 3.10+ and
uv - Docker & Docker Compose
- Terraform (1.0.0+)
Please follow the guide for your target cloud provider. (Note: Only AWS is currently implemented).
- ➡️ AWS Deployment Guide
- (Coming Soon) GCP Deployment Guide
- (Coming Soon) Azure Deployment Guide
Contributions are welcome! Open a pull request that explains the motivation, highlights key changes, and notes any follow-up work. Please also document any new infrastructure or pipeline behaviors in the appropriate README.
