Flinkenstein: Natural Language to Flink SQL

Transform natural language queries into production-ready Apache Flink SQL jobs with AI-powered validation, deployment, and continuous learning that gets smarter with every successful job.

🌟 What is Flinkenstein?

Flinkenstein is an intelligent streaming data platform that combines:

Natural Language Processing: Convert business questions to Flink SQL
AI-Powered Validation: Automatic SQL syntax and schema validation
Click-to-Deploy: Once validated, push SQL directly to Flink and preview with one click
Streaming Analytics: Real-time data processing with Apache Flink
Multi-Domain Data: Retail, Financial Services, and Fraud Detection scenarios
RAG-Enhanced AI: Context-aware SQL generation using knowledge base
Continuous Learning: System gets smarter with every successful Flink job

🧠 RAG Knowledge Base

Flinkenstein includes a comprehensive RAG (Retrieval-Augmented Generation) knowledge base stored in Qdrant that provides:

Flink SQL Patterns: Common streaming analytics patterns and examples
Schema Information: Topic structures and data formats for all domains
Best Practices: Performance tips, windowing strategies, and join patterns
Error Solutions: Common Flink errors and their resolutions
Domain Context: Business logic and relationships for retail, fraud, and crossover scenarios

The knowledge base is automatically populated when you run the demo command, ensuring your AI has the context needed to generate accurate, production-ready Flink SQL.

📚 Knowledge Base Components

Comprehensive Flink SQL Knowledge (flinkenstein-knowledge)
- 23 advanced patterns covering windowing, joins, CTEs, UDFs
- Complex query patterns for time-filtering, multi-topic joins, aggregations
- Top 10 Flink error prevention patterns
- Advanced use case patterns (anomaly detection, pattern recognition)
- Demo flow patterns (Phase 1-3 progression)
Topic Metadata (flinkenstein-topic-metadata)
- 22 metadata items covering all topics in your system
- Source topic details: fields, descriptions, use cases, schemas
- Sink topic information: output patterns, notes
- Topic categorization: retail, fraud, crossover
- Schema availability and status
Cluster-Specific Learning (flinkenstein-cluster-*)
- Dynamically created collections for each session
- Successful Flink SQL statements with full context
- Job validation results and sink topic data
- Environment-specific patterns that improve over time

🧠 Continuous Learning System

Flinkenstein gets smarter with every successful Flink job!

🎯 How It Works

The system continuously learns from your environment through success reinforcement:

Deploy Flink SQL through the natural language interface
Automatic Validation monitors job status and data flow
Success Learning captures successful SQL when:
- ✅ Flink job is running successfully
- ✅ Sink topic contains actual data
- ✅ No errors or failures detected
Knowledge Accumulation saves successful patterns to cluster-specific collections
Persistent Learning preserves all learning across restarts

🚀 Per-Cluster Intelligence

Each environment you point Flinkenstein at becomes increasingly intelligent:

First Run: 44 generic Flink SQL patterns + dynamic topic discovery
After Successes: Generic patterns + environment-specific successful SQL
Over Time: System learns your exact topics, schemas, and data patterns
Portability: Take your learned intelligence to new clusters

💡 Learning Benefits

Higher Success Rate: Each successful job improves future SQL generation
Environment Awareness: System learns your specific Kafka topics and schemas
Pattern Recognition: Identifies what works in your data environment
Continuous Improvement: No manual intervention needed - learns automatically

🔄 Learning Lifecycle

Startup → Load Generic Patterns (23) + Topic Metadata (22) + Restore Cluster Learning
   ↓
Runtime → Monitor Flink Jobs + Validate Success + Learn from Success
   ↓
Teardown → Backup All Collections (Generic + Topic + Cluster Learning)
   ↓
Next Startup → Restore All Learning + Continue Improving

💾 Learning Data Persistence

All learning data is automatically preserved across demo sessions:

Automatic Backup: teardown-with-backup command backs up all Qdrant collections
Learning Restoration: restore-learning command restores previous learning
Continuous Improvement: Each session builds upon previous learning
No Data Loss: Learning accumulates over time, improving success rates

🏗️ Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Frontend      │    │    Backend      │    │   Flink Cluster │
│   (React)       │◄──►│   (Node.js)     │◄──►│   + Kafka       │
│   Port: 5173    │    │   Port: 3001    │    │   + Schema Reg  │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Monaco Editor │    │   LLM Service   │    │   Qdrant        │
│   SQL Editor    │    │   (Gemini/      │    │   Knowledge     │
│   + Validation  │    │    Claude/GPT)  │    │   Base          │
└─────────────────┘    └─────────────────┘    └─────────────────┘

🚀 Quick Start with Flox

Prerequisites

Flox installed and configured (for package management)
Docker Desktop running
Git

1. Clone and Setup

git clone https://github.com/8BitTacoSupreme/flinkenstein
cd flinkenstein

2. Start Everything with Two Commands

# Activate Flox environment and start everything
flox activate
demo

That's it! The demo command will automatically:

Set up the environment (directories, env files, scripts)
Start all infrastructure (Kafka, Flink, Qdrant, Schema Registry)
Populate comprehensive Flink SQL knowledge base (23 patterns)
Populate topic metadata (22 items)
Start the backend service
Start the frontend service
Generate demo data for all domains
Verify everything is working

🧠 Knowledge Base: The system now loads both generic Flink SQL patterns AND topic-specific metadata, giving your LLM comprehensive context for better SQL generation.

💡 Pro tip: Run help to see all available commands!

3. Manual Alternative (if needed)

# Setup environment first
setup

# Start infrastructure manually
start

# Start backend manually  
dev-backend

# Start frontend manually
dev-frontend

# Run demo manually
demo

6. Access Applications

Frontend: http://localhost:5173
Flink UI: http://localhost:8082
Kafka Control Center: http://localhost:9021
Schema Registry: http://localhost:8081

🎯 Demo Scenarios

Retail Domain

Customer lifecycle analysis
Order status tracking
Inventory management
Return pattern analysis
Customer feedback insights

Fraud Detection (FSI)

Transaction risk scoring
Behavioral pattern analysis
KYC compliance monitoring
Device fingerprinting
Risk alert generation

Crossover (Retail + Fraud)

Payment fraud detection
Account takeover detection
Suspicious order patterns
Return fraud analysis

🧠 Learning-Enhanced Experience

As you use Flinkenstein, the system becomes increasingly intelligent:

First Query: System uses generic patterns + discovers your topics
Successful Jobs: System learns what works in your environment
Repeated Patterns: System suggests optimizations based on learned success
Environment Mastery: System becomes an expert on your specific data landscape

🔧 Configuration

Environment Variables

Create .env files in each service directory:

Backend (backend/.env):

PORT=3001
NODE_ENV=development
FLINK_SQL_GATEWAY_URL=http://localhost:8083
KAFKA_INTERNAL_BOOTSTRAP_SERVERS=broker:9092
KAFKA_BOOTSTRAP_SERVERS=localhost:19092
SCHEMA_REGISTRY_URL=http://localhost:8081
RUNNING_IN_DOCKER=false

# LLM API Keys (optional - can be set per request)
OPENAI_API_KEY=
GOOGLE_API_KEY=
ANTHROPIC_API_KEY=

LLM Configuration

The system supports multiple AI providers:

Google Gemini: Fast, reliable SQL generation
Anthropic Claude: High-quality, context-aware responses
OpenAI GPT: Versatile, well-tested performance

API keys can be configured per request in the frontend.

📊 Data Generation

The system includes three data generators:

Retail Data (scripts/generate-retail-data.js)
- Customers, products, orders, returns, feedback, inventory
- Referential integrity and business logic
Fraud Data (scripts/generate-fraud-data.js)
- Transactions, risk alerts, compliance events, KYC
- FSI-specific fraud patterns
Crossover Data (scripts/generate-crossover-data.js)
- Retail-specific fraud detection
- Combines retail and fraud domains

🛠️ Development

Project Structure

flinkenstein/
├── backend/           # Node.js API server
├── frontend/          # React application
├── infrastructure/    # Docker Compose setup
├── scripts/           # Data generation & utilities
└── flox.nix          # Flox environment definition

Key Commands

# Infrastructure
./scripts/flox-scripts.sh start        # Start Docker services
./scripts/flox-scripts.sh stop         # Stop Docker services
./scripts/flox-scripts.sh restart      # Restart Docker services

# Development
./scripts/flox-scripts.sh dev-backend  # Start backend in dev mode
./scripts/flox-scripts.sh dev-frontend # Start frontend in dev mode

# Demo
./scripts/flox-scripts.sh demo         # Run full demo setup
./scripts/flox-scripts.sh teardown     # Clean up demo environment

# Learning System
./scripts/startup-with-learning.js     # Start with learning system
./scripts/teardown-with-learning-backup.sh  # Teardown with learning backup
./scripts/restore-learning-data.sh     # Restore learning data from backups

# Health checks
./scripts/flox-scripts.sh health       # Check service health

Adding New Features

New Data Domains: Add generators in scripts/
SQL Patterns: Extend knowledge base in scripts/populate-knowledge-base.js
UI Components: Add React components in frontend/src/components/
API Endpoints: Extend Express routes in backend/src/routes/

🔍 Troubleshooting

Common Issues

Services not starting:

./scripts/flox-scripts.sh health
docker-compose logs <service-name>

Data generation errors:

cd scripts
./teardown-domain-separated.sh
./demo-domain-separated.sh

SQL validation failures:

Check topic names match domain prefixes
Verify schema registry is healthy
Review LLM API key configuration

Flink job failures:

Check Kafka topic existence
Verify schema compatibility
Review connector properties

Learning system issues:

Verify Qdrant is running: curl http://localhost:6333/collections
Check knowledge base collections: Look for flinkenstein-knowledge, flinkenstein-topic-metadata, and flinkenstein-cluster-* collections
Verify knowledge base population: curl http://localhost:6333/collections/flinkenstein-knowledge | jq '.result.points_count'
Check topic metadata: curl http://localhost:6333/collections/flinkenstein-topic-metadata | jq '.result.points_count'
Review learning logs in backend console for job monitoring messages
Ensure Flink UI is accessible for job status checks
Use kb-restore command to repopulate knowledge bases if needed

Logs

Backend: scripts/logs/log-backend.txt
Frontend: scripts/logs/log-frontend.txt
Data Generators: scripts/logs/log-*.txt
Docker: docker-compose logs <service>

📚 Learning Resources

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Test thoroughly
Submit a pull request

📄 License

[Your License Here]

🆘 Support

For issues and questions:

Check the troubleshooting section
Review logs and error messages
Open an issue with detailed information
Include environment details and reproduction steps

Happy Streaming! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.flox		.flox
assets		assets
backend		backend
frontend		frontend
infrastructure		infrastructure
scripts		scripts
.gitignore		.gitignore
ENHANCED-SYSTEM-SUMMARY.md		ENHANCED-SYSTEM-SUMMARY.md
FLOX.md		FLOX.md
KNOWLEDGE-BASE-PERSISTENCE.md		KNOWLEDGE-BASE-PERSISTENCE.md
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
sanity_check_cluster.sh		sanity_check_cluster.sh
update-flox-manifest.sh		update-flox-manifest.sh

Uh oh!

License

Uh oh!

8BitTacoSupreme/flinkenstein

Folders and files

Latest commit

History

Repository files navigation

Flinkenstein: Natural Language to Flink SQL

🌟 What is Flinkenstein?

🧠 RAG Knowledge Base

📚 Knowledge Base Components

🧠 Continuous Learning System

🎯 How It Works

🚀 Per-Cluster Intelligence

💡 Learning Benefits

🔄 Learning Lifecycle

💾 Learning Data Persistence

🏗️ Architecture

🚀 Quick Start with Flox

Prerequisites

1. Clone and Setup

2. Start Everything with Two Commands

3. Manual Alternative (if needed)

6. Access Applications

🎯 Demo Scenarios

Retail Domain

Fraud Detection (FSI)

Crossover (Retail + Fraud)

🧠 Learning-Enhanced Experience

🔧 Configuration

Environment Variables

LLM Configuration

📊 Data Generation

🛠️ Development

Project Structure

Key Commands

Adding New Features

🔍 Troubleshooting

Common Issues

Logs

📚 Learning Resources

🤝 Contributing

📄 License

🆘 Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages