Skip to content

86% faster data lineage tracking for pandas DataFrames with zero infrastructure. Real-time monitoring, ML anomaly detection, and enterprise compliance features.

License

Notifications You must be signed in to change notification settings

Arbaznazir/DataLineagePy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ DataLineagePy 3.0

Enterprise-Grade Python Data Lineage Tracking

Python 3.8+ License: MIT Production Ready Performance Score Enterprise Grade


DataLineagePy Banner

Beautiful, Powerful, and Effortless Data Lineage for Python

Track, visualize, and govern your data pipelines with zero friction.


🌟 Why DataLineagePy?

  • Automatic, column-level lineage tracking for all pandas DataFrames
  • Enterprise performance: memory-optimized, scalable, and production-ready
  • Stunning visualizations: interactive dashboards, HTML, PNG, SVG, and more
  • Plug-and-play connectors: MySQL, PostgreSQL, SQLite, and custom sources
  • Security & compliance: RBAC, AES-256 encryption, audit trails
  • Real-time collaboration: WebSocket server/client for team workflows
  • ML/AI pipeline tracking: Full auditability for machine learning steps
  • Cloud-native deployment: Docker, Kubernetes, Helm, Terraform

πŸ“‹ Table of Contents


πŸš€ Quick Start

pip install datalineagepy
from datalineagepy import LineageTracker, LineageDataFrame
import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
tracker = LineageTracker(name="demo")
ldf = LineageDataFrame(df, name="my_df", tracker=tracker)
ldf2 = ldf.filter(ldf._df['a'] > 1)
ldf3 = ldf2.assign(c=ldf2._df['a'] + ldf2._df['b'])
tracker.visualize()  # Interactive HTML dashboard
tracker.export_lineage("lineage.json")

πŸ’Ύ Installation

  • PyPI: pip install datalineagepy
  • With visualization: pip install datalineagepy[viz]
  • All features: pip install datalineagepy[all]
  • Conda: conda install -c conda-forge datalineagepy (coming soon)
  • Docker: docker pull datalineagepy/datalineagepy:latest

See Installation Guide for advanced and enterprise setup.


πŸ“š Core Features

  • Automatic lineage tracking for pandas DataFrames
  • Data validation: completeness, uniqueness, range, custom rules
  • Profiling & analytics: quality scoring, missing data, correlations
  • Visualization: HTML, PNG, SVG, interactive dashboards
  • Performance monitoring: execution time, memory, alerts
  • Security: RBAC, AES-256 encryption, audit trail
  • Custom connectors: SDK for any data source
  • Versioning: save, diff, rollback lineage graphs
  • Collaboration: real-time editing/viewing
  • ML/AI pipeline tracking: AutoMLTracker for full auditability

πŸ”§ Usage Guide

1. Lineage Tracking

from datalineagepy import LineageTracker, LineageDataFrame
import pandas as pd
tracker = LineageTracker(name="my_pipeline")
df = pd.DataFrame({'x': [1,2,3], 'y': [4,5,6]})
ldf = LineageDataFrame(df, name="input", tracker=tracker)
ldf2 = ldf.assign(z=ldf._df['x'] + ldf._df['y'])
print(tracker.export_graph())

2. Data Validation

from datalineagepy.core.validation import DataValidator
validator = DataValidator(tracker)
rules = {'completeness': {'threshold': 0.9}, 'uniqueness': {'columns': ['x']}}
results = validator.validate_dataframe(ldf, rules)
print(results)

3. Profiling & Analytics

from datalineagepy.core.analytics import DataProfiler
profiler = DataProfiler(tracker)
profile = profiler.profile_dataset(ldf, include_correlations=True)
print(profile)

4. Visualization & Reporting

from datalineagepy.visualization.graph_visualizer import GraphVisualizer
visualizer = GraphVisualizer(tracker)
visualizer.generate_html("lineage.html")
visualizer.generate_png("lineage.png")

5. Performance Monitoring

from datalineagepy.core.performance import PerformanceMonitor
monitor = PerformanceMonitor(tracker)
monitor.start_monitoring()
_ = ldf._df.sum()
monitor.stop_monitoring()
print(monitor.get_performance_summary())

6. Security & Compliance

from datalineagepy.security.rbac import RBACManager
rbac = RBACManager()
rbac.add_role('admin', ['read', 'write'])
rbac.add_user('alice', ['admin'])
print(rbac.check_access('alice', 'write'))

from datalineagepy.security.encryption.data_encryption import EncryptionManager
import os
os.environ['MASTER_ENCRYPTION_KEY'] = 'supersecretkey1234567890123456'
enc_mgr = EncryptionManager()
secret = 'Sensitive Data'
encrypted = enc_mgr.encrypt_sensitive_data(secret)
decrypted = enc_mgr.decrypt_sensitive_data(encrypted)
print(decrypted)

7. Database Connectors

from datalineagepy.connectors.database.mysql_connector import MySQLConnector
from datalineagepy.core import LineageTracker
db_config = {'host': 'localhost', 'user': 'root', 'password': 'password', 'database': 'test_db'}
tracker = LineageTracker()
conn = MySQLConnector(**db_config, lineage_tracker=tracker)
conn.execute_query('SELECT * FROM test_table')
conn.close()

8. ML/AI Pipeline Tracking

from datalineagepy import AutoMLTracker
tracker = AutoMLTracker(name='ml_pipeline')
tracker.log_step('fit', model='LogisticRegression', params={'solver': 'lbfgs'})
tracker.log_step('predict', model='LogisticRegression')
print(tracker.export_ai_ready_format())

πŸ“Š Visualization & Reporting

  • Interactive HTML dashboards: tracker.visualize()
  • Export formats: JSON, DOT, PNG, SVG, Excel, CSV
  • Custom visualizations: Use GraphVisualizer for advanced needs

πŸ—„οΈ Database Connectors

  • MySQL, PostgreSQL, SQLite: Full lineage tracking for every query
  • Custom connectors: Build your own with the SDK
  • See Database Connectors Guide

⚑ Performance Monitoring

  • Track execution time, memory, and operation stats
  • Alerting: Slack, Email, custom hooks
  • Production monitoring: Integrate with Prometheus, Grafana, etc.

πŸ”’ Security & Compliance

  • RBAC: Role-based access control for users and actions
  • AES-256 encryption: At-rest and in-transit data protection
  • Audit trail: Full operation history for compliance

πŸ€– ML/AI Pipeline Tracking

  • AutoMLTracker: Log, audit, and export every ML pipeline step
  • Explainability: Export pipeline steps for downstream analysis

☁️ Enterprise Deployment

  • Docker, Kubernetes, Helm, Terraform: Cloud-native ready
  • Production scripts: See deploy/ for examples

πŸ’‘ Use Cases

  • Data science: Reproducibility, experiment tracking, Jupyter integration
  • Enterprise ETL: Production pipelines, data quality, compliance
  • Data governance: Impact analysis, documentation, audit trails
  • ML/AI: Pipeline explainability, model audit, feature tracking

πŸ“– Documentation


🀝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.


πŸ“„ License

MIT License. See LICENSE for details.


DataLineagePy 3.0 β€” The new standard for Python data lineage
Beautiful. Powerful. Effortless.

About

86% faster data lineage tracking for pandas DataFrames with zero infrastructure. Real-time monitoring, ML anomaly detection, and enterprise compliance features.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Languages