🔄 Data Migration Pipeline to Microsoft Fabric

Idempotent ETL pipeline preventing data reprocessing during large-scale migrations

idempotence is the property that ensures running the same pipeline multiple times with the same input data will have the exact same result as running it only once.

🎯 Current Condition (The Problem)

Traditional ETL pipelines lack state persistence. When failures occur during large-scale migrations (50+ files), the system cannot track which files were already processed, forcing a complete re-run.

Impact: Duplicates, wasted time (45min recovery), unnecessary costs.

Before: Traditional Pipeline (No Metadata)

┌─────────────────────────────────────────────────────┐
│  SFTP Source: 50 CSV Files                          │
└────────────┬────────────────────────────────────────┘
             │
             │ Execution 1
             ▼
    ┌────────────────────┐      ❌ NO METADATA
    │ Processing...      │         (No memory)
    │ ✓ File 1-29        │
    │ ❌ FAILURE at 30   │
    └────────────────────┘
             │
             │ Execution 2 (RE-RUN)
             ▼
    ┌────────────────────┐      ❌ Cannot query progress
    │ ⚠️  Reprocesses    │      ❌ No file status tracking
    │    ALL 50 files!   │
    │                    │      Result: Reprocess everything
    │ ❌ Duplicates      │
    │ ❌ Wasted 45min    │
    └────────────────────┘

✅ Solution (Proposed Condition)

Idempotent pipeline with SQLite-based FileTracker that maintains state across executions.

Key change: Metadata tracking - Store file processing state (file name, status, timestamp) in SQLite database.

How it works:

Before processing: Query metadata → Get list of completed files
During processing: Download + Upload → Mark status = LOADED_TO_FABRIC in metadata
On retry: Query metadata → Skip completed → Process only remaining files

After: FileTracker Pipeline (With Metadata)

┌─────────────────────────────────────────────────────┐
│  SFTP Source: 50 CSV Files                          │
└────────────┬────────────────────────────────────────┘
             │
             │ Step 1: Query metadata
             ▼
    ┌────────────────────┐      ┌─────────────────────────────┐
    │ Query FileTracker  │ ←──→ │ ✅ METADATA DATABASE (SQLite)│
    │ Get pending files  │      │ ┌─────────────────────────┐ │
    └────────┬───────────┘      │ │ file_name | status     │ │
             │                  │ │ file_1.csv| LOADED ✅  │ │
             │                  │ │ file_2.csv| LOADED ✅  │ │
             │                  │ │ ...                    │ │
             │                  │ │ file_30.csv| LOADED ✅ │ │
             │                  │ │ file_31.csv| PENDING ⏳│ │
             │                  │ └─────────────────────────┘ │
             │                  └─────────────────────────────┘
             │ Step 2: Filter (skip completed)
             ▼
    ┌────────────────────┐
    │ Smart Filter       │      ✅ Queries metadata:
    │ Already done: 30   │         "SELECT * WHERE status != 'LOADED'"
    │ To process: 20     │      ✅ ONLY NEW FILES
    └────────┬───────────┘
             │
             │ Step 3: Process + Update metadata
             ▼
    ┌────────────────────┐      ┌─────────────────────────────┐
    │ FOR EACH file:     │      │ UPDATE metadata:            │
    │ 1. Download        │  ──→ │ SET status='LOADED'         │
    │ 2. Upload to Fabric│      │ SET uploaded_at=NOW()       │
    │ 3. Mark complete   │      │ WHERE file_name='file_31'   │
    └────────────────────┘      └─────────────────────────────┘
             │
             ▼
    ┌────────────────────┐
    │ ✅ Success         │      Result: Metadata enables
    │ No duplicates      │              idempotent retries
    │ Recovery: 5min     │
    └────────────────────┘

3. Reason for Change

Metric	Traditional	FileTracker	Improvement
Recovery time	~45 min	~5 min	⚡ 90% faster
Duplicates	High risk	Zero	🛡️ Eliminated
Safe retries	❌ No	✅ Yes	Idempotent

📊 Full Problem Definition with Diagrams

🏗️ Technical Stack

Core Technologies:

Python 3.8+: Orchestration
Paramiko: SFTP client
SQLite: State management
Microsoft Fabric: Cloud data platform (OneLake REST API)

Architecture Pattern: Idempotent design with persistent state tracking

📁 Project Structure

fabric-data-migration/
├── src/
│   ├── ingestion.py          # SFTP download logic
│   ├── fabric_client.py      # Microsoft Fabric upload (REST API)
│   └── utils/
│       └── file_tracker.py   # SQLite state management
├── scripts/
│   ├── simulate_partial_failure.py  # Automated demo
│   └── demo_idempotency.py 
│   └── manage_tracker.py
├── docs/
│   ├── problem_definition.md         # Manual demo
├── data/
│   ├── staging/              # Temporary: Downloaded CSVs
│   └── tracker.db            # Persistent: File processing state
├── run_pipeline.py           # Main entry point
├── config.py                 # Environment configuration
└── .env.example              # Configuration template

🚀 Quick Start

1. Install Dependencies

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

2. Configure Environment

Create .env file (use .env.example as template):

# SFTP Configuration
SFTP_HOST=your-server.com
SFTP_USERNAME=your_user
SFTP_PASSWORD=your_password
SFTP_SERVER_PATH=data/raw

# Microsoft Fabric (get from Azure Portal)
FABRIC_WORKSPACE_ID=your-workspace-guid
FABRIC_LAKEHOUSE_ID=your-lakehouse-guid
AZURE_TENANT_ID=your-tenant-guid
AZURE_CLIENT_ID=your-client-guid
AZURE_CLIENT_SECRET=your-secret

3. Run Pipeline

# Full migration
python run_pipeline.py

# Test idempotency (simulate failure at file #31)
python run_pipeline.py --max-files 30
python run_pipeline.py  # Processes only remaining 20 files

📘 Idempotency Demo & Testing

🔑 Key Features

✅ Idempotent: Safe to re-run without duplicates
✅ Resumable: Continues from last successful file
✅ Testable: --max-files flag for failure simulation
✅ Auditable: SQLite tracks all processing history
✅ Production-ready: Error handling, logging, retry logic

📸 Visual Evidence

Real Microsoft Fabric integration validated with production screenshots:

Execution 1 → 30 files uploaded | Execution 2 → 50 files total (no duplicates)

📊 Full Visual Evidence

👤 Author

Daniel Garcia Belman Data Engineer | Python Developer | Big Data

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Ora et labora, ahora

Soli Deo gloria

My gratitude to the open-source community for generously sharing their knowledge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

🔄 Data Migration Pipeline to Microsoft Fabric

🎯 Current Condition (The Problem)

Before: Traditional Pipeline (No Metadata)

✅ Solution (Proposed Condition)

After: FileTracker Pipeline (With Metadata)

3. Reason for Change

🏗️ Technical Stack

📁 Project Structure

🚀 Quick Start

1. Install Dependencies

2. Configure Environment

3. Run Pipeline

🔑 Key Features

📸 Visual Evidence

👤 Author

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
docs		docs
scripts		scripts
sql		sql
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

Uh oh!

License

Uh oh!

Daniel-jcVv/pipeline-data-migration-tracker

Folders and files

Latest commit

History

Repository files navigation

🔄 Data Migration Pipeline to Microsoft Fabric

🎯 Current Condition (The Problem)

Before: Traditional Pipeline (No Metadata)

✅ Solution (Proposed Condition)

After: FileTracker Pipeline (With Metadata)

3. Reason for Change

🏗️ Technical Stack

📁 Project Structure

🚀 Quick Start

1. Install Dependencies

2. Configure Environment

3. Run Pipeline

🔑 Key Features

📸 Visual Evidence

👤 Author

📄 License

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages