Skip to content

Multi-cloud e-commerce data platform playground with synthetic generators, CDC pipelines, and Terraform-driven infra.

Notifications You must be signed in to change notification settings

adavoudi/multi-cloud-data-platform

Repository files navigation

Multi-Cloud Data Platform

This repository contains an end-to-end data platform "playground" simulating a e-commerce marketplace. It's designed for data engineers to quickly experiment with various data engineering use cases in a semi-realistic environment.

The architecture combines a local stack (Kafka, Debezium, MySQL) with cloud services. The primary goal is to provide a complete, replicable infrastructure setup for different cloud providers (starting with AWS).

Note: This is a learning playground, not a production-ready system. It is not intended for production use and lacks proper monitoring, comprehensive error handling, and other operational requirements.


🛠️ Tech Stack

  • Infrastructure: Terraform
  • Cloud: AWS (S3, Lake Formation, Glue, Athena, API Gateway, Firehose, Lambda, CloudWatch), GCP (forthcoming), Azure (forthcoming)
  • Streaming: Debezium, Kafka
  • Databases: MySQL (local)
  • Transformation: DBT (local)
  • Orchestration: Dagster (local)
  • Core Language: Python (with uv)
  • Local Stack: Docker & Docker Compose

🏗️ Architecture & Data Generators

The platform's data is generated by three distinct Python services that all pull from a shared configuration (src/mcdp/project_common/config/presets/base.yaml).

Data Generators Architecture

🛍️ Transactional DB (CDC)

  • Generator: mcdp-mysql-transactional
  • Description: This service first generates the schema (DDL) and seed data (customers, products) for a transactional MySQL database. It then simulates a live e-commerce workload by inserting new orders, order_items, and order_status_events.
  • Key Feature: It includes an OrderStatusAdvancer that applies configurable status transitions (e.g., created -> paid -> shipped) as in-place updates, generating a realistic Change Data Capture (CDC) stream.
  • Flow: MySQL -> Debezium (Kafka Connect) -> Kafka -> Kafka Sink Connector -> Cloud Storage (S3)
Click to see sample SQL output
insert into orders (order_id,customer_id,order_ts,status,currency,subtotal,tax,shipping_fee,total_amount,channel,campaign_id,created_at,updated_at)
values ('o_6202402934632','c_489','2025-11-03 11:34:30','created','EUR',36.36,8.36,3.49,48.21,'direct',NULL,'2025-11-03 11:34:30','2025-11-03 11:34:30');

insert into order_items (order_id,product_id,qty,unit_price,line_amount,updated_at_timestamp)
values ('o_6202402934632','p_1011',2,18.18,36.36,'2025-11-03 11:34:30');

🖱️ Clickstream (Real-time Events)

  • Generator: mcdp-clickstream
  • Description: This service generates persona-driven browsing sessions as a stream of JSON events. It simulates a full user funnel, including page_view, search, add_to_cart, checkout_start, and purchase, as well as negative events like payment_failed.
  • Key Feature: Events are enriched with realistic marketing context (channel, UTMs), device data, and A/B test assignments. It can emit events to stdout, a local file, or a live HTTP endpoint.
  • Flow: Generator -> HTTP API Endpoint (Cloud) -> (Downstream processing)
Click to see sample JSON event
{
  "schema_version": "1.1",
  "event_type": "page_view",
  "session_id": "s_aa31a874",
  "channel": "direct",
  "campaign_id": null,
  "page_url": "[https://shop.example.com/category/electronics](https://shop.example.com/category/electronics)",
  "geo_country": "DK",
  "ab_test_assignments": {"homepage_hero": "variant_a", "checkout_flow": "one_page"},
  "event_properties": {
    "page_category": "Electronics",
    "campaign_landing": "[https://shop.example.com/category/electronics](https://shop.example.com/category/electronics)",
    "utm_source": "direct",
    "utm_medium": "direct"
  },
  ...
}

💸 Ad Spend (Batch)

  • Generator: mcdp-ad-spend
  • Description: This service produces daily marketing spend and performance reports in CSV format. It simulates metrics like cost, impressions, clicks, and revenue per campaign.
  • Key Feature: The generator dynamically models seasonality, trends, weekend uplift, and campaign fatigue to produce realistic, non-static data. It can write files locally or upload them directly to S3, GCS, or Azure Blob Storage.
  • Flow: Generator -> CSV Files -> Cloud Storage (S3)
Click to see sample CSV output
date,channel,campaign_id,cost,impressions,clicks,conversions,revenue,currency
2025-01-15,google,cmp_42,116.81,29301,287,5,260.24,EUR
2025-01-15,facebook,cmp_ret,140.98,27246,330,3,174.38,EUR

🗺️ Repository Structure

.
├─ assets/                 # Architecture diagrams and reference images
├─ config/                 # Presets and environment overrides
├─ dbt/                    # dbt projects (one per cloud provider)
├─ infra/
│  ├─ aws/                 # AWS Implementation
│  ├─ gcp/                 # Forthcoming Google Cloud implementation
│  └─ azure/               # Forthcoming Azure implementation
├─ orchestration/dagster/  # Dagster workspace + assets
├─ src/mcdp/               # Python package (generators, CDC stack, CLIs)
├─ tests/                  # Unit and smoke tests
├─ Makefile                # Convenience workflows backed by uv
└─ pyproject.toml          # Project metadata and dependencies

🚀 Getting Started

Prerequisites

Before you begin, please ensure you have the following tools installed:

  • Python 3.10+ and uv
  • Docker & Docker Compose
  • Terraform (1.0.0+)

Deploy the Infrastructure

Please follow the guide for your target cloud provider. (Note: Only AWS is currently implemented).

  • ➡️ AWS Deployment Guide
  • (Coming Soon) GCP Deployment Guide
  • (Coming Soon) Azure Deployment Guide

🤝 Contributing

Contributions are welcome! Open a pull request that explains the motivation, highlights key changes, and notes any follow-up work. Please also document any new infrastructure or pipeline behaviors in the appropriate README.

About

Multi-cloud e-commerce data platform playground with synthetic generators, CDC pipelines, and Terraform-driven infra.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published