Multi-Cloud Data Platform

This repository contains an end-to-end data platform "playground" simulating a e-commerce marketplace. It's designed for data engineers to quickly experiment with various data engineering use cases in a semi-realistic environment.

The architecture combines a local stack (Kafka, Debezium, MySQL) with cloud services. The primary goal is to provide a complete, replicable infrastructure setup for different cloud providers (starting with AWS).

Note: This is a learning playground, not a production-ready system. It is not intended for production use and lacks proper monitoring, comprehensive error handling, and other operational requirements.

🛠️ Tech Stack

Infrastructure: Terraform
Cloud: AWS (S3, Lake Formation, Glue, Athena, API Gateway, Firehose, Lambda, CloudWatch), GCP (forthcoming), Azure (forthcoming)
Streaming: Debezium, Kafka
Databases: MySQL (local)
Transformation: DBT (local)
Orchestration: Dagster (local)
Core Language: Python (with uv)
Local Stack: Docker & Docker Compose

🏗️ Architecture & Data Generators

The platform's data is generated by three distinct Python services that all pull from a shared configuration (src/mcdp/project_common/config/presets/base.yaml).

🛍️ Transactional DB (CDC)

Generator: mcdp-mysql-transactional
Description: This service first generates the schema (DDL) and seed data (customers, products) for a transactional MySQL database. It then simulates a live e-commerce workload by inserting new orders, order_items, and order_status_events.
Key Feature: It includes an OrderStatusAdvancer that applies configurable status transitions (e.g., created -> paid -> shipped) as in-place updates, generating a realistic Change Data Capture (CDC) stream.
Flow: MySQL -> Debezium (Kafka Connect) -> Kafka -> Kafka Sink Connector -> Cloud Storage (S3)

Click to see sample SQL output

insert into orders (order_id,customer_id,order_ts,status,currency,subtotal,tax,shipping_fee,total_amount,channel,campaign_id,created_at,updated_at)
values ('o_6202402934632','c_489','2025-11-03 11:34:30','created','EUR',36.36,8.36,3.49,48.21,'direct',NULL,'2025-11-03 11:34:30','2025-11-03 11:34:30');

insert into order_items (order_id,product_id,qty,unit_price,line_amount,updated_at_timestamp)
values ('o_6202402934632','p_1011',2,18.18,36.36,'2025-11-03 11:34:30');

🖱️ Clickstream (Real-time Events)

Generator: mcdp-clickstream
Description: This service generates persona-driven browsing sessions as a stream of JSON events. It simulates a full user funnel, including page_view, search, add_to_cart, checkout_start, and purchase, as well as negative events like payment_failed.
Key Feature: Events are enriched with realistic marketing context (channel, UTMs), device data, and A/B test assignments. It can emit events to stdout, a local file, or a live HTTP endpoint.
Flow: Generator -> HTTP API Endpoint (Cloud) -> (Downstream processing)

Click to see sample JSON event

{
  "schema_version": "1.1",
  "event_type": "page_view",
  "session_id": "s_aa31a874",
  "channel": "direct",
  "campaign_id": null,
  "page_url": "[https://shop.example.com/category/electronics](https://shop.example.com/category/electronics)",
  "geo_country": "DK",
  "ab_test_assignments": {"homepage_hero": "variant_a", "checkout_flow": "one_page"},
  "event_properties": {
    "page_category": "Electronics",
    "campaign_landing": "[https://shop.example.com/category/electronics](https://shop.example.com/category/electronics)",
    "utm_source": "direct",
    "utm_medium": "direct"
  },
  ...
}

💸 Ad Spend (Batch)

Generator: mcdp-ad-spend
Description: This service produces daily marketing spend and performance reports in CSV format. It simulates metrics like cost, impressions, clicks, and revenue per campaign.
Key Feature: The generator dynamically models seasonality, trends, weekend uplift, and campaign fatigue to produce realistic, non-static data. It can write files locally or upload them directly to S3, GCS, or Azure Blob Storage.
Flow: Generator -> CSV Files -> Cloud Storage (S3)

Click to see sample CSV output

date,channel,campaign_id,cost,impressions,clicks,conversions,revenue,currency
2025-01-15,google,cmp_42,116.81,29301,287,5,260.24,EUR
2025-01-15,facebook,cmp_ret,140.98,27246,330,3,174.38,EUR

🗺️ Repository Structure

.
├─ assets/                 # Architecture diagrams and reference images
├─ config/                 # Presets and environment overrides
├─ dbt/                    # dbt projects (one per cloud provider)
├─ infra/
│  ├─ aws/                 # AWS Implementation
│  ├─ gcp/                 # Forthcoming Google Cloud implementation
│  └─ azure/               # Forthcoming Azure implementation
├─ orchestration/dagster/  # Dagster workspace + assets
├─ src/mcdp/               # Python package (generators, CDC stack, CLIs)
├─ tests/                  # Unit and smoke tests
├─ Makefile                # Convenience workflows backed by uv
└─ pyproject.toml          # Project metadata and dependencies

🚀 Getting Started

Prerequisites

Before you begin, please ensure you have the following tools installed:

Python 3.10+ and uv
Docker & Docker Compose
Terraform (1.0.0+)

Deploy the Infrastructure

Please follow the guide for your target cloud provider. (Note: Only AWS is currently implemented).

➡️ AWS Deployment Guide
(Coming Soon) GCP Deployment Guide
(Coming Soon) Azure Deployment Guide

🤝 Contributing

Contributions are welcome! Open a pull request that explains the motivation, highlights key changes, and notes any follow-up work. Please also document any new infrastructure or pipeline behaviors in the appropriate README.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
assets		assets
dbt		dbt
infra		infra
orchestration/dg-aws		orchestration/dg-aws
src/mcdp		src/mcdp
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
README_AWS.md		README_AWS.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multi-Cloud Data Platform

🛠️ Tech Stack

🏗️ Architecture & Data Generators

🛍️ Transactional DB (CDC)

🖱️ Clickstream (Real-time Events)

💸 Ad Spend (Batch)

🗺️ Repository Structure

🚀 Getting Started

Prerequisites

Deploy the Infrastructure

🤝 Contributing

About

Uh oh!

Releases

Packages

Languages

adavoudi/multi-cloud-data-platform

Folders and files

Latest commit

History

Repository files navigation

Multi-Cloud Data Platform

🛠️ Tech Stack

🏗️ Architecture & Data Generators

🛍️ Transactional DB (CDC)

🖱️ Clickstream (Real-time Events)

💸 Ad Spend (Batch)

🗺️ Repository Structure

🚀 Getting Started

Prerequisites

Deploy the Infrastructure

🤝 Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages