Idempotent ETL pipeline preventing data reprocessing during large-scale migrations
idempotence is the property that ensures running the same pipeline multiple times with the same input data will have the exact same result as running it only once.
Traditional ETL pipelines lack state persistence. When failures occur during large-scale migrations (50+ files), the system cannot track which files were already processed, forcing a complete re-run.
Impact: Duplicates, wasted time (45min recovery), unnecessary costs.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SFTP Source: 50 CSV Files β
ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
β Execution 1
βΌ
ββββββββββββββββββββββ β NO METADATA
β Processing... β (No memory)
β β File 1-29 β
β β FAILURE at 30 β
ββββββββββββββββββββββ
β
β Execution 2 (RE-RUN)
βΌ
ββββββββββββββββββββββ β Cannot query progress
β β οΈ Reprocesses β β No file status tracking
β ALL 50 files! β
β β Result: Reprocess everything
β β Duplicates β
β β Wasted 45min β
ββββββββββββββββββββββ
Idempotent pipeline with SQLite-based FileTracker that maintains state across executions.
Key change: Metadata tracking - Store file processing state (file name, status, timestamp) in SQLite database.
How it works:
- Before processing: Query metadata β Get list of completed files
- During processing: Download + Upload β Mark status =
LOADED_TO_FABRICin metadata - On retry: Query metadata β Skip completed β Process only remaining files
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SFTP Source: 50 CSV Files β
ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
β Step 1: Query metadata
βΌ
ββββββββββββββββββββββ βββββββββββββββββββββββββββββββ
β Query FileTracker β ββββ β β
METADATA DATABASE (SQLite)β
β Get pending files β β βββββββββββββββββββββββββββ β
ββββββββββ¬ββββββββββββ β β file_name | status β β
β β β file_1.csv| LOADED β
β β
β β β file_2.csv| LOADED β
β β
β β β ... β β
β β β file_30.csv| LOADED β
β β
β β β file_31.csv| PENDING β³β β
β β βββββββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββββ
β Step 2: Filter (skip completed)
βΌ
ββββββββββββββββββββββ
β Smart Filter β β
Queries metadata:
β Already done: 30 β "SELECT * WHERE status != 'LOADED'"
β To process: 20 β β
ONLY NEW FILES
ββββββββββ¬ββββββββββββ
β
β Step 3: Process + Update metadata
βΌ
ββββββββββββββββββββββ βββββββββββββββββββββββββββββββ
β FOR EACH file: β β UPDATE metadata: β
β 1. Download β βββ β SET status='LOADED' β
β 2. Upload to Fabricβ β SET uploaded_at=NOW() β
β 3. Mark complete β β WHERE file_name='file_31' β
ββββββββββββββββββββββ βββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββ
β β
Success β Result: Metadata enables
β No duplicates β idempotent retries
β Recovery: 5min β
ββββββββββββββββββββββ
| Metric | Traditional | FileTracker | Improvement |
|---|---|---|---|
| Recovery time | ~45 min | ~5 min | β‘ 90% faster |
| Duplicates | High risk | Zero | π‘οΈ Eliminated |
| Safe retries | β No | β Yes | Idempotent |
π Full Problem Definition with Diagrams
Core Technologies:
- Python 3.8+: Orchestration
- Paramiko: SFTP client
- SQLite: State management
- Microsoft Fabric: Cloud data platform (OneLake REST API)
Architecture Pattern: Idempotent design with persistent state tracking
fabric-data-migration/
βββ src/
β βββ ingestion.py # SFTP download logic
β βββ fabric_client.py # Microsoft Fabric upload (REST API)
β βββ utils/
β βββ file_tracker.py # SQLite state management
βββ scripts/
β βββ simulate_partial_failure.py # Automated demo
β βββ demo_idempotency.py
β βββ manage_tracker.py
βββ docs/
β βββ problem_definition.md # Manual demo
βββ data/
β βββ staging/ # Temporary: Downloaded CSVs
β βββ tracker.db # Persistent: File processing state
βββ run_pipeline.py # Main entry point
βββ config.py # Environment configuration
βββ .env.example # Configuration template
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtCreate .env file (use .env.example as template):
# SFTP Configuration
SFTP_HOST=your-server.com
SFTP_USERNAME=your_user
SFTP_PASSWORD=your_password
SFTP_SERVER_PATH=data/raw
# Microsoft Fabric (get from Azure Portal)
FABRIC_WORKSPACE_ID=your-workspace-guid
FABRIC_LAKEHOUSE_ID=your-lakehouse-guid
AZURE_TENANT_ID=your-tenant-guid
AZURE_CLIENT_ID=your-client-guid
AZURE_CLIENT_SECRET=your-secret# Full migration
python run_pipeline.py
# Test idempotency (simulate failure at file #31)
python run_pipeline.py --max-files 30
python run_pipeline.py # Processes only remaining 20 filesπ Idempotency Demo & Testing
- β Idempotent: Safe to re-run without duplicates
- β Resumable: Continues from last successful file
- β
Testable:
--max-filesflag for failure simulation - β Auditable: SQLite tracks all processing history
- β Production-ready: Error handling, logging, retry logic
Real Microsoft Fabric integration validated with production screenshots:
Execution 1 β 30 files uploaded | Execution 2 β 50 files total (no duplicates)
π Full Visual Evidence
Daniel Garcia Belman Data Engineer | Python Developer | Big Data
- Email: danielgb331@outlook.com
- GitHub: @Daniel-jcVv
- LinkedIn: My LinkenIn Profile
This project is licensed under the MIT License - see the LICENSE file for details.
Ora et labora, ahora
Soli Deo gloria
My gratitude to the open-source community for generously sharing their knowledge.