Skip to content

Node based app to transform natural language queries into production-ready Apache Flink SQL jobs with AI-powered validation, deployment, and continuous learning that gets smarter with every successful job.

License

8BitTacoSupreme/flinkenstein

Repository files navigation

Logo Flinkenstein: Natural Language to Flink SQL

Transform natural language queries into production-ready Apache Flink SQL jobs with AI-powered validation, deployment, and continuous learning that gets smarter with every successful job.

Logo

🌟 What is Flinkenstein?

Flinkenstein is an intelligent streaming data platform that combines:

  • Natural Language Processing: Convert business questions to Flink SQL
  • AI-Powered Validation: Automatic SQL syntax and schema validation
  • Click-to-Deploy: Once validated, push SQL directly to Flink and preview with one click
  • Streaming Analytics: Real-time data processing with Apache Flink
  • Multi-Domain Data: Retail, Financial Services, and Fraud Detection scenarios
  • RAG-Enhanced AI: Context-aware SQL generation using knowledge base
  • Continuous Learning: System gets smarter with every successful Flink job

🧠 RAG Knowledge Base

Flinkenstein includes a comprehensive RAG (Retrieval-Augmented Generation) knowledge base stored in Qdrant that provides:

  • Flink SQL Patterns: Common streaming analytics patterns and examples
  • Schema Information: Topic structures and data formats for all domains
  • Best Practices: Performance tips, windowing strategies, and join patterns
  • Error Solutions: Common Flink errors and their resolutions
  • Domain Context: Business logic and relationships for retail, fraud, and crossover scenarios

The knowledge base is automatically populated when you run the demo command, ensuring your AI has the context needed to generate accurate, production-ready Flink SQL.

📚 Knowledge Base Components

  1. Comprehensive Flink SQL Knowledge (flinkenstein-knowledge)

    • 23 advanced patterns covering windowing, joins, CTEs, UDFs
    • Complex query patterns for time-filtering, multi-topic joins, aggregations
    • Top 10 Flink error prevention patterns
    • Advanced use case patterns (anomaly detection, pattern recognition)
    • Demo flow patterns (Phase 1-3 progression)
  2. Topic Metadata (flinkenstein-topic-metadata)

    • 22 metadata items covering all topics in your system
    • Source topic details: fields, descriptions, use cases, schemas
    • Sink topic information: output patterns, notes
    • Topic categorization: retail, fraud, crossover
    • Schema availability and status
  3. Cluster-Specific Learning (flinkenstein-cluster-*)

    • Dynamically created collections for each session
    • Successful Flink SQL statements with full context
    • Job validation results and sink topic data
    • Environment-specific patterns that improve over time

🧠 Continuous Learning System

Flinkenstein gets smarter with every successful Flink job!

🎯 How It Works

The system continuously learns from your environment through success reinforcement:

  1. Deploy Flink SQL through the natural language interface
  2. Automatic Validation monitors job status and data flow
  3. Success Learning captures successful SQL when:
    • ✅ Flink job is running successfully
    • ✅ Sink topic contains actual data
    • ✅ No errors or failures detected
  4. Knowledge Accumulation saves successful patterns to cluster-specific collections
  5. Persistent Learning preserves all learning across restarts

🚀 Per-Cluster Intelligence

Each environment you point Flinkenstein at becomes increasingly intelligent:

  • First Run: 44 generic Flink SQL patterns + dynamic topic discovery
  • After Successes: Generic patterns + environment-specific successful SQL
  • Over Time: System learns your exact topics, schemas, and data patterns
  • Portability: Take your learned intelligence to new clusters

💡 Learning Benefits

  • Higher Success Rate: Each successful job improves future SQL generation
  • Environment Awareness: System learns your specific Kafka topics and schemas
  • Pattern Recognition: Identifies what works in your data environment
  • Continuous Improvement: No manual intervention needed - learns automatically

🔄 Learning Lifecycle

Startup → Load Generic Patterns (23) + Topic Metadata (22) + Restore Cluster Learning
   ↓
Runtime → Monitor Flink Jobs + Validate Success + Learn from Success
   ↓
Teardown → Backup All Collections (Generic + Topic + Cluster Learning)
   ↓
Next Startup → Restore All Learning + Continue Improving

💾 Learning Data Persistence

All learning data is automatically preserved across demo sessions:

  • Automatic Backup: teardown-with-backup command backs up all Qdrant collections
  • Learning Restoration: restore-learning command restores previous learning
  • Continuous Improvement: Each session builds upon previous learning
  • No Data Loss: Learning accumulates over time, improving success rates

🏗️ Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Frontend      │    │    Backend      │    │   Flink Cluster │
│   (React)       │◄──►│   (Node.js)     │◄──►│   + Kafka       │
│   Port: 5173    │    │   Port: 3001    │    │   + Schema Reg  │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Monaco Editor │    │   LLM Service   │    │   Qdrant        │
│   SQL Editor    │    │   (Gemini/      │    │   Knowledge     │
│   + Validation  │    │    Claude/GPT)  │    │   Base          │
└─────────────────┘    └─────────────────┘    └─────────────────┘

🚀 Quick Start with Flox

Prerequisites

  • Flox installed and configured (for package management)
  • Docker Desktop running
  • Git

1. Clone and Setup

git clone https://github.com/8BitTacoSupreme/flinkenstein
cd flinkenstein

2. Start Everything with Two Commands

# Activate Flox environment and start everything
flox activate
demo

That's it! The demo command will automatically:

  • Set up the environment (directories, env files, scripts)
  • Start all infrastructure (Kafka, Flink, Qdrant, Schema Registry)
  • Populate comprehensive Flink SQL knowledge base (23 patterns)
  • Populate topic metadata (22 items)
  • Start the backend service
  • Start the frontend service
  • Generate demo data for all domains
  • Verify everything is working

🧠 Knowledge Base: The system now loads both generic Flink SQL patterns AND topic-specific metadata, giving your LLM comprehensive context for better SQL generation.

💡 Pro tip: Run help to see all available commands!

3. Manual Alternative (if needed)

# Setup environment first
setup

# Start infrastructure manually
start

# Start backend manually  
dev-backend

# Start frontend manually
dev-frontend

# Run demo manually
demo

6. Access Applications

🎯 Demo Scenarios

Retail Domain

  • Customer lifecycle analysis
  • Order status tracking
  • Inventory management
  • Return pattern analysis
  • Customer feedback insights

Fraud Detection (FSI)

  • Transaction risk scoring
  • Behavioral pattern analysis
  • KYC compliance monitoring
  • Device fingerprinting
  • Risk alert generation

Crossover (Retail + Fraud)

  • Payment fraud detection
  • Account takeover detection
  • Suspicious order patterns
  • Return fraud analysis

🧠 Learning-Enhanced Experience

As you use Flinkenstein, the system becomes increasingly intelligent:

  • First Query: System uses generic patterns + discovers your topics
  • Successful Jobs: System learns what works in your environment
  • Repeated Patterns: System suggests optimizations based on learned success
  • Environment Mastery: System becomes an expert on your specific data landscape

🔧 Configuration

Environment Variables

Create .env files in each service directory:

Backend (backend/.env):

PORT=3001
NODE_ENV=development
FLINK_SQL_GATEWAY_URL=http://localhost:8083
KAFKA_INTERNAL_BOOTSTRAP_SERVERS=broker:9092
KAFKA_BOOTSTRAP_SERVERS=localhost:19092
SCHEMA_REGISTRY_URL=http://localhost:8081
RUNNING_IN_DOCKER=false

# LLM API Keys (optional - can be set per request)
OPENAI_API_KEY=
GOOGLE_API_KEY=
ANTHROPIC_API_KEY=

LLM Configuration

The system supports multiple AI providers:

  • Google Gemini: Fast, reliable SQL generation
  • Anthropic Claude: High-quality, context-aware responses
  • OpenAI GPT: Versatile, well-tested performance

API keys can be configured per request in the frontend.

📊 Data Generation

The system includes three data generators:

  1. Retail Data (scripts/generate-retail-data.js)

    • Customers, products, orders, returns, feedback, inventory
    • Referential integrity and business logic
  2. Fraud Data (scripts/generate-fraud-data.js)

    • Transactions, risk alerts, compliance events, KYC
    • FSI-specific fraud patterns
  3. Crossover Data (scripts/generate-crossover-data.js)

    • Retail-specific fraud detection
    • Combines retail and fraud domains

🛠️ Development

Project Structure

flinkenstein/
├── backend/           # Node.js API server
├── frontend/          # React application
├── infrastructure/    # Docker Compose setup
├── scripts/           # Data generation & utilities
└── flox.nix          # Flox environment definition

Key Commands

# Infrastructure
./scripts/flox-scripts.sh start        # Start Docker services
./scripts/flox-scripts.sh stop         # Stop Docker services
./scripts/flox-scripts.sh restart      # Restart Docker services

# Development
./scripts/flox-scripts.sh dev-backend  # Start backend in dev mode
./scripts/flox-scripts.sh dev-frontend # Start frontend in dev mode

# Demo
./scripts/flox-scripts.sh demo         # Run full demo setup
./scripts/flox-scripts.sh teardown     # Clean up demo environment

# Learning System
./scripts/startup-with-learning.js     # Start with learning system
./scripts/teardown-with-learning-backup.sh  # Teardown with learning backup
./scripts/restore-learning-data.sh     # Restore learning data from backups

# Health checks
./scripts/flox-scripts.sh health       # Check service health

Adding New Features

  1. New Data Domains: Add generators in scripts/
  2. SQL Patterns: Extend knowledge base in scripts/populate-knowledge-base.js
  3. UI Components: Add React components in frontend/src/components/
  4. API Endpoints: Extend Express routes in backend/src/routes/

🔍 Troubleshooting

Common Issues

Services not starting:

./scripts/flox-scripts.sh health
docker-compose logs <service-name>

Data generation errors:

cd scripts
./teardown-domain-separated.sh
./demo-domain-separated.sh

SQL validation failures:

  • Check topic names match domain prefixes
  • Verify schema registry is healthy
  • Review LLM API key configuration

Flink job failures:

  • Check Kafka topic existence
  • Verify schema compatibility
  • Review connector properties

Learning system issues:

  • Verify Qdrant is running: curl http://localhost:6333/collections
  • Check knowledge base collections: Look for flinkenstein-knowledge, flinkenstein-topic-metadata, and flinkenstein-cluster-* collections
  • Verify knowledge base population: curl http://localhost:6333/collections/flinkenstein-knowledge | jq '.result.points_count'
  • Check topic metadata: curl http://localhost:6333/collections/flinkenstein-topic-metadata | jq '.result.points_count'
  • Review learning logs in backend console for job monitoring messages
  • Ensure Flink UI is accessible for job status checks
  • Use kb-restore command to repopulate knowledge bases if needed

Logs

  • Backend: scripts/logs/log-backend.txt
  • Frontend: scripts/logs/log-frontend.txt
  • Data Generators: scripts/logs/log-*.txt
  • Docker: docker-compose logs <service>

📚 Learning Resources

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

📄 License

[Your License Here]

🆘 Support

For issues and questions:

  1. Check the troubleshooting section
  2. Review logs and error messages
  3. Open an issue with detailed information
  4. Include environment details and reproduction steps

Happy Streaming! 🚀

About

Node based app to transform natural language queries into production-ready Apache Flink SQL jobs with AI-powered validation, deployment, and continuous learning that gets smarter with every successful job.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published