🔍 Lineage Lens

Intelligent Data Lineage Analysis Platform

Lineage Lens is an AI-powered platform that automatically extracts and explains data lineage from SQL scripts, providing interactive visualizations and natural language explanations to help data teams understand complex data flows.

🚀 Quick Start

Prerequisites

Python 3.8+
Claude API key from Anthropic

Choose Your Interface

Streamlit Backend

cd Lineage_Lens/code
./run_app.sh

Installation

Clone and navigate to the project:
```
cd Lineage_Lens/code
```
Install dependencies:
```
pip install -r requirements.txt
```
Note: Requirements use flexible version ranges (>=) to avoid dependency conflicts and ensure compatibility with your existing Python environment.

Set up environment:

# Copy environment template
cp config/env_example.txt .env

# Edit .env and add your Claude API key
echo "ANTHROPIC_API_KEY=your_api_key_here" >> .env

Run the application:
```
streamlit run streamlit_app.py
```
Open your browser to: http://localhost:8501

📁 Project Structure

Lineage_Lens/
├── code/
│   ├── src/
│   │   ├── lineage_lens/
│   │   │   ├── models/          # Pydantic data models
│   │   │   ├── parsers/         # SQL parsing logic
│   │   │   ├── visualizers/     # Graph visualization
│   │   │   ├── llm/             # Claude API integration
│   │   │   └── utils/           # Configuration & utilities
│   │   └── tests/               # Unit tests
│   ├── data/                    # Sample SQL files for testing
│   ├── config/                  # Configuration files
│   ├── streamlit_app.py         # Main Streamlit application
│   ├── requirements.txt         # Python dependencies
│   └── README.md               # This file
├── demo/                       # Demo materials
├── presentation/               # Presentation files
└── prompt_screenshots/         # Screenshots for demo

🎯 Features

Core Capabilities

📊 SQL Lineage Parsing: Automatically extract table and column dependencies from SQL scripts
🔗 Interactive Visualization: View data flow as interactive graphs with multiple layout options
🤖 AI-Powered Explanations: Ask natural language questions about your data lineage
📈 Business Impact Analysis: Understand downstream effects of data changes

Technical Features

Production-Ready Code: Built with Pydantic models, proper error handling, and pylint compliance
Modern UI: Clean Streamlit interface with responsive design
Multiple Layout Algorithms: Spring, hierarchical, circular, and force-directed graph layouts
Comprehensive Test Data: Realistic e-commerce, marketing, and financial pipeline examples

🧪 Demo Data

The platform includes three comprehensive test datasets:

1. E-commerce Pipeline

Raw sources: customers, products, orders, order_items
Staging layer: data cleaning and validation
Marts: customer_summary, product_performance, monthly_sales_summary
Reports: executive_dashboard, customer_retention_cohorts

2. Marketing Pipeline

Raw sources: campaigns, ad_clicks, conversions
Analysis: attribution modeling, user journey analysis
Reports: campaign_performance, channel_effectiveness

3. Financial Pipeline

Raw sources: transactions, accounts, budget_allocations
Processing: monthly_financials, budget_vs_actual
Reports: profit_loss_statement, executive_financial_summary

🎭 Demo Scenarios

Quick Test Prompts

Basic Questions:
- "Where does customer_summary get its data from?"
- "What tables depend on raw_orders?"
- "How does data flow from source to executive dashboard?"
Business Impact:
- "If raw_customers has quality issues, what reports are affected?"
- "What's the lineage behind profit margin calculations?"
- "Show me all dependencies for marketing ROI analysis"
Technical Analysis:
- "Which tables have no upstream dependencies?"
- "What's the maximum depth of the lineage graph?"
- "Which intermediate tables are critical junction points?"

Full Demo Script

Upload sample_ecommerce_pipeline.sql
Analyze the lineage graph visualization
Ask: "Explain how customer lifetime value is calculated"
Switch to hierarchical layout to see data flow layers
Upload additional marketing pipeline for cross-analysis
Ask: "What are the main data sources across all pipelines?"

🛠️ Development

Code Quality

# Run linting
pylint src/lineage_lens/

# Run tests
pytest src/tests/

# Format code
black src/

Architecture Highlights

Pydantic Models: Type-safe data validation and serialization
Modular Design: Separate concerns for parsing, visualization, and AI
Error Handling: Graceful degradation with user-friendly error messages
Configuration Management: Environment-based settings with sensible defaults

🎯 Hackathon Goals

Business Impact:

Reduces data discovery time from hours to minutes
Enables non-technical users to understand complex data flows
Improves data governance and impact analysis capabilities

Technical Excellence:

Production-ready code with proper testing and documentation
Modern Python practices with type hints and validation
Scalable architecture ready for enterprise deployment

🔧 Configuration

Key settings in src/lineage_lens/utils/config.py:

# API Configuration
ANTHROPIC_API_KEY=your_key_here

# App Configuration  
APP_TITLE="Lineage Lens"
DEBUG_MODE=false

# Visualization
DEFAULT_LAYOUT="spring"
MAX_NODES_DISPLAY=100

📊 Sample Outputs

The platform generates:

Interactive Graph Visualizations: Nodes colored by table type (source/intermediate/target)
Natural Language Explanations: Business-friendly descriptions of data flows
Summary Statistics: Table counts, connection metrics, complexity scores
Detailed Lineage Tables: Complete dependency information with SQL queries

🚀 Next Steps

For production deployment:

Add database connectivity for live metadata extraction
Implement caching for large lineage graphs
Add user authentication and workspace management
Integrate with data catalogs (dbt, Apache Atlas)
Add real-time lineage tracking and alerts

📈 Business Value

For Data Engineers: Faster debugging and impact analysis

For Business Analysts: Self-service data discovery without SQL knowledge

For Data Governance: Automated documentation and compliance tracking

For Leadership: Clear visibility into data dependencies and risks

Built with ❤️ by the Mindstream Makers Hack-a-Prompt - Transforming complex data lineage into actionable insights

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
data		data
src		src
.gitignore		.gitignore
README.md		README.md
create_video_script.py		create_video_script.py
debug_config.py		debug_config.py
demo_cytoscape.py		demo_cytoscape.py
generate_presentation.py		generate_presentation.py
generate_visual_presentation.py		generate_visual_presentation.py
requirements.txt		requirements.txt
run_app.sh		run_app.sh
streamlit_app.py		streamlit_app.py
test_app.py		test_app.py
test_rakuten_api.py		test_rakuten_api.py
test_rakuten_endpoints.py		test_rakuten_endpoints.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 Lineage Lens

🚀 Quick Start

Prerequisites

Choose Your Interface

Streamlit Backend

Installation

📁 Project Structure

🎯 Features

Core Capabilities

Technical Features

🧪 Demo Data

1. E-commerce Pipeline

2. Marketing Pipeline

3. Financial Pipeline

🎭 Demo Scenarios

Quick Test Prompts

Full Demo Script

🛠️ Development

Code Quality

Architecture Highlights

🎯 Hackathon Goals

🔧 Configuration

📊 Sample Outputs

🚀 Next Steps

📈 Business Value

About

Uh oh!

Releases

Packages

Languages

Vibhuarvind/Lineage_Lens

Folders and files

Latest commit

History

Repository files navigation

🔍 Lineage Lens

🚀 Quick Start

Prerequisites

Choose Your Interface

Streamlit Backend

Installation

📁 Project Structure

🎯 Features

Core Capabilities

Technical Features

🧪 Demo Data

1. E-commerce Pipeline

2. Marketing Pipeline

3. Financial Pipeline

🎭 Demo Scenarios

Quick Test Prompts

Full Demo Script

🛠️ Development

Code Quality

Architecture Highlights

🎯 Hackathon Goals

🔧 Configuration

📊 Sample Outputs

🚀 Next Steps

📈 Business Value

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages