Enterprise-Grade Python Data Lineage Tracking
- Automatic, column-level lineage tracking for all pandas DataFrames
- Enterprise performance: memory-optimized, scalable, and production-ready
- Stunning visualizations: interactive dashboards, HTML, PNG, SVG, and more
- Plug-and-play connectors: MySQL, PostgreSQL, SQLite, and custom sources
- Security & compliance: RBAC, AES-256 encryption, audit trails
- Real-time collaboration: WebSocket server/client for team workflows
- ML/AI pipeline tracking: Full auditability for machine learning steps
- Cloud-native deployment: Docker, Kubernetes, Helm, Terraform
- Quick Start
- Installation
- Core Features
- Usage Guide
- Database Connectors
- Visualization & Reporting
- Performance Monitoring
- Security & Compliance
- ML/AI Pipeline Tracking
- Enterprise Deployment
- Use Cases
- Documentation
- Contributing
- License
pip install datalineagepy
from datalineagepy import LineageTracker, LineageDataFrame
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
tracker = LineageTracker(name="demo")
ldf = LineageDataFrame(df, name="my_df", tracker=tracker)
ldf2 = ldf.filter(ldf._df['a'] > 1)
ldf3 = ldf2.assign(c=ldf2._df['a'] + ldf2._df['b'])
tracker.visualize() # Interactive HTML dashboard
tracker.export_lineage("lineage.json")
- PyPI:
pip install datalineagepy
- With visualization:
pip install datalineagepy[viz]
- All features:
pip install datalineagepy[all]
- Conda:
conda install -c conda-forge datalineagepy
(coming soon) - Docker:
docker pull datalineagepy/datalineagepy:latest
See Installation Guide for advanced and enterprise setup.
- Automatic lineage tracking for pandas DataFrames
- Data validation: completeness, uniqueness, range, custom rules
- Profiling & analytics: quality scoring, missing data, correlations
- Visualization: HTML, PNG, SVG, interactive dashboards
- Performance monitoring: execution time, memory, alerts
- Security: RBAC, AES-256 encryption, audit trail
- Custom connectors: SDK for any data source
- Versioning: save, diff, rollback lineage graphs
- Collaboration: real-time editing/viewing
- ML/AI pipeline tracking: AutoMLTracker for full auditability
from datalineagepy import LineageTracker, LineageDataFrame
import pandas as pd
tracker = LineageTracker(name="my_pipeline")
df = pd.DataFrame({'x': [1,2,3], 'y': [4,5,6]})
ldf = LineageDataFrame(df, name="input", tracker=tracker)
ldf2 = ldf.assign(z=ldf._df['x'] + ldf._df['y'])
print(tracker.export_graph())
from datalineagepy.core.validation import DataValidator
validator = DataValidator(tracker)
rules = {'completeness': {'threshold': 0.9}, 'uniqueness': {'columns': ['x']}}
results = validator.validate_dataframe(ldf, rules)
print(results)
from datalineagepy.core.analytics import DataProfiler
profiler = DataProfiler(tracker)
profile = profiler.profile_dataset(ldf, include_correlations=True)
print(profile)
from datalineagepy.visualization.graph_visualizer import GraphVisualizer
visualizer = GraphVisualizer(tracker)
visualizer.generate_html("lineage.html")
visualizer.generate_png("lineage.png")
from datalineagepy.core.performance import PerformanceMonitor
monitor = PerformanceMonitor(tracker)
monitor.start_monitoring()
_ = ldf._df.sum()
monitor.stop_monitoring()
print(monitor.get_performance_summary())
from datalineagepy.security.rbac import RBACManager
rbac = RBACManager()
rbac.add_role('admin', ['read', 'write'])
rbac.add_user('alice', ['admin'])
print(rbac.check_access('alice', 'write'))
from datalineagepy.security.encryption.data_encryption import EncryptionManager
import os
os.environ['MASTER_ENCRYPTION_KEY'] = 'supersecretkey1234567890123456'
enc_mgr = EncryptionManager()
secret = 'Sensitive Data'
encrypted = enc_mgr.encrypt_sensitive_data(secret)
decrypted = enc_mgr.decrypt_sensitive_data(encrypted)
print(decrypted)
from datalineagepy.connectors.database.mysql_connector import MySQLConnector
from datalineagepy.core import LineageTracker
db_config = {'host': 'localhost', 'user': 'root', 'password': 'password', 'database': 'test_db'}
tracker = LineageTracker()
conn = MySQLConnector(**db_config, lineage_tracker=tracker)
conn.execute_query('SELECT * FROM test_table')
conn.close()
from datalineagepy import AutoMLTracker
tracker = AutoMLTracker(name='ml_pipeline')
tracker.log_step('fit', model='LogisticRegression', params={'solver': 'lbfgs'})
tracker.log_step('predict', model='LogisticRegression')
print(tracker.export_ai_ready_format())
- Interactive HTML dashboards:
tracker.visualize()
- Export formats: JSON, DOT, PNG, SVG, Excel, CSV
- Custom visualizations: Use
GraphVisualizer
for advanced needs
- MySQL, PostgreSQL, SQLite: Full lineage tracking for every query
- Custom connectors: Build your own with the SDK
- See Database Connectors Guide
- Track execution time, memory, and operation stats
- Alerting: Slack, Email, custom hooks
- Production monitoring: Integrate with Prometheus, Grafana, etc.
- RBAC: Role-based access control for users and actions
- AES-256 encryption: At-rest and in-transit data protection
- Audit trail: Full operation history for compliance
- AutoMLTracker: Log, audit, and export every ML pipeline step
- Explainability: Export pipeline steps for downstream analysis
- Docker, Kubernetes, Helm, Terraform: Cloud-native ready
- Production scripts: See
deploy/
for examples
- Data science: Reproducibility, experiment tracking, Jupyter integration
- Enterprise ETL: Production pipelines, data quality, compliance
- Data governance: Impact analysis, documentation, audit trails
- ML/AI: Pipeline explainability, model audit, feature tracking
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
MIT License. See LICENSE for details.
DataLineagePy 3.0 β The new standard for Python data lineage
Beautiful. Powerful. Effortless.
Beautiful. Powerful. Effortless.