Skip to content

Pipeline for migrating data from SFTP server to Microsoft Fabric with SQLite-based file tracking to prevent reprocessing

License

Daniel-jcVv/pipeline-data-migration-tracker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

37 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”„ Data Migration Pipeline to Microsoft Fabric

Python ETL Paramiko SQLite Microsoft Fabric Idempotent License: MIT

Idempotent ETL pipeline preventing data reprocessing during large-scale migrations

idempotence is the property that ensures running the same pipeline multiple times with the same input data will have the exact same result as running it only once.


🎯 Current Condition (The Problem)

Traditional ETL pipelines lack state persistence. When failures occur during large-scale migrations (50+ files), the system cannot track which files were already processed, forcing a complete re-run.

Impact: Duplicates, wasted time (45min recovery), unnecessary costs.

Before: Traditional Pipeline (No Metadata)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  SFTP Source: 50 CSV Files                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β”‚ Execution 1
             β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      ❌ NO METADATA
    β”‚ Processing...      β”‚         (No memory)
    β”‚ βœ“ File 1-29        β”‚
    β”‚ ❌ FAILURE at 30   β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β”‚ Execution 2 (RE-RUN)
             β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      ❌ Cannot query progress
    β”‚ ⚠️  Reprocesses    β”‚      ❌ No file status tracking
    β”‚    ALL 50 files!   β”‚
    β”‚                    β”‚      Result: Reprocess everything
    β”‚ ❌ Duplicates      β”‚
    β”‚ ❌ Wasted 45min    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

βœ… Solution (Proposed Condition)

Idempotent pipeline with SQLite-based FileTracker that maintains state across executions.

Key change: Metadata tracking - Store file processing state (file name, status, timestamp) in SQLite database.

How it works:

  1. Before processing: Query metadata β†’ Get list of completed files
  2. During processing: Download + Upload β†’ Mark status = LOADED_TO_FABRIC in metadata
  3. On retry: Query metadata β†’ Skip completed β†’ Process only remaining files

After: FileTracker Pipeline (With Metadata)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  SFTP Source: 50 CSV Files                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β”‚ Step 1: Query metadata
             β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Query FileTracker  β”‚ ←──→ β”‚ βœ… METADATA DATABASE (SQLite)β”‚
    β”‚ Get pending files  β”‚      β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚ β”‚ file_name | status     β”‚ β”‚
             β”‚                  β”‚ β”‚ file_1.csv| LOADED βœ…  β”‚ β”‚
             β”‚                  β”‚ β”‚ file_2.csv| LOADED βœ…  β”‚ β”‚
             β”‚                  β”‚ β”‚ ...                    β”‚ β”‚
             β”‚                  β”‚ β”‚ file_30.csv| LOADED βœ… β”‚ β”‚
             β”‚                  β”‚ β”‚ file_31.csv| PENDING ⏳│ β”‚
             β”‚                  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
             β”‚                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚ Step 2: Filter (skip completed)
             β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Smart Filter       β”‚      βœ… Queries metadata:
    β”‚ Already done: 30   β”‚         "SELECT * WHERE status != 'LOADED'"
    β”‚ To process: 20     β”‚      βœ… ONLY NEW FILES
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β”‚ Step 3: Process + Update metadata
             β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ FOR EACH file:     β”‚      β”‚ UPDATE metadata:            β”‚
    β”‚ 1. Download        β”‚  ──→ β”‚ SET status='LOADED'         β”‚
    β”‚ 2. Upload to Fabricβ”‚      β”‚ SET uploaded_at=NOW()       β”‚
    β”‚ 3. Mark complete   β”‚      β”‚ WHERE file_name='file_31'   β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ βœ… Success         β”‚      Result: Metadata enables
    β”‚ No duplicates      β”‚              idempotent retries
    β”‚ Recovery: 5min     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3. Reason for Change

Metric Traditional FileTracker Improvement
Recovery time ~45 min ~5 min ⚑ 90% faster
Duplicates High risk Zero πŸ›‘οΈ Eliminated
Safe retries ❌ No βœ… Yes Idempotent

πŸ“Š Full Problem Definition with Diagrams


πŸ—οΈ Technical Stack

Core Technologies:

  • Python 3.8+: Orchestration
  • Paramiko: SFTP client
  • SQLite: State management
  • Microsoft Fabric: Cloud data platform (OneLake REST API)

Architecture Pattern: Idempotent design with persistent state tracking


πŸ“ Project Structure

fabric-data-migration/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ ingestion.py          # SFTP download logic
β”‚   β”œβ”€β”€ fabric_client.py      # Microsoft Fabric upload (REST API)
β”‚   └── utils/
β”‚       └── file_tracker.py   # SQLite state management
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ simulate_partial_failure.py  # Automated demo
β”‚   └── demo_idempotency.py 
β”‚   └── manage_tracker.py
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ problem_definition.md         # Manual demo
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ staging/              # Temporary: Downloaded CSVs
β”‚   └── tracker.db            # Persistent: File processing state
β”œβ”€β”€ run_pipeline.py           # Main entry point
β”œβ”€β”€ config.py                 # Environment configuration
└── .env.example              # Configuration template

πŸš€ Quick Start

1. Install Dependencies

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

2. Configure Environment

Create .env file (use .env.example as template):

# SFTP Configuration
SFTP_HOST=your-server.com
SFTP_USERNAME=your_user
SFTP_PASSWORD=your_password
SFTP_SERVER_PATH=data/raw

# Microsoft Fabric (get from Azure Portal)
FABRIC_WORKSPACE_ID=your-workspace-guid
FABRIC_LAKEHOUSE_ID=your-lakehouse-guid
AZURE_TENANT_ID=your-tenant-guid
AZURE_CLIENT_ID=your-client-guid
AZURE_CLIENT_SECRET=your-secret

3. Run Pipeline

# Full migration
python run_pipeline.py

# Test idempotency (simulate failure at file #31)
python run_pipeline.py --max-files 30
python run_pipeline.py  # Processes only remaining 20 files

πŸ“˜ Idempotency Demo & Testing


πŸ”‘ Key Features

  • βœ… Idempotent: Safe to re-run without duplicates
  • βœ… Resumable: Continues from last successful file
  • βœ… Testable: --max-files flag for failure simulation
  • βœ… Auditable: SQLite tracks all processing history
  • βœ… Production-ready: Error handling, logging, retry logic

πŸ“Έ Visual Evidence

Real Microsoft Fabric integration validated with production screenshots:

Execution 1 β†’ 30 files uploaded | Execution 2 β†’ 50 files total (no duplicates)

Idempotency validation

πŸ“Š Full Visual Evidence


πŸ‘€ Author

Daniel Garcia Belman Data Engineer | Python Developer | Big Data


πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

Ora et labora, ahora

Soli Deo gloria

My gratitude to the open-source community for generously sharing their knowledge.

About

Pipeline for migrating data from SFTP server to Microsoft Fabric with SQLite-based file tracking to prevent reprocessing

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages