Enterprise Data Lakehouse SoAI Capabilities (NEW):**
- 🤖 Local LLM server with Ollama (Llama 3.1, Mistral, Phi)
- 🧠 Vector database with Milvus for semantic search
- 📚 RAG (Retrieval Augmented Generation) system
- 💬 Interactive Chat UI for querying data with natural language
- 📄 Document upload support (PDF, Word, Excel, CSV, JSON, TXT, Markdown)
- 📦 Automatic S3/MinIO storage for all uploaded documents
- 🔄 Automatic data ingestion from PostgreSQL/Dremio to vector DB
- 🔒 100% on-premise - no cloud dependencies, complete data privacy
Created by: Mustapha Fonsau | GitHub
Supported by Talentys | LinkedIn - Data Engineering & Analytics Excellence
📖 Main documentation in English. Translations available in 17 additional languages below.
🇬🇧 English (You are here) | 🇫🇷 Français | 🇪🇸 Español | 🇵🇹 Português | 🇨🇳 中文 | 🇯🇵 日本語 | 🇷🇺 Русский | 🇸🇦 العربية | 🇩🇪 Deutsch | 🇰🇷 한국어 | 🇮🇳 हिन्दी | 🇮🇩 Indonesia | 🇹🇷 Türkçe | 🇻🇳 Tiếng Việt | 🇮🇹 Italiano | 🇳🇱 Nederlands | 🇵🇱 Polski | 🇸🇪 Svenska
AI-Ready professional data platform combining Airbyte, Dremio, dbt, Apache Superset, and Local LLM (Ollama) for enterprise-grade data integration, transformation, quality assurance, business intelligence, and AI-powered insights. Built with multilingual support for global teams.
graph TB
A[Data Sources] --> B[Airbyte ETL]
B --> C[Dremio Lakehouse]
C --> D[dbt Transformations]
D --> E[Apache Superset]
E --> F[Business Insights]
C --> G[Vector DB<br/>Milvus]
D --> G
G --> H[RAG System]
I[Local LLM<br/>Ollama] --> H
H --> J[AI Chat UI]
J --> K[AI-Powered<br/>Insights]
style B fill:#615EFF,color:#fff,stroke:#333,stroke-width:2px
style C fill:#f5f5f5,stroke:#333,stroke-width:2px
style D fill:#e8e8e8,stroke:#333,stroke-width:2px
style E fill:#d8d8d8,stroke:#333,stroke-width:2px
style G fill:#FF6B6B,color:#fff,stroke:#333,stroke-width:2px
style I fill:#4ECDC4,color:#fff,stroke:#333,stroke-width:2px
style H fill:#95E1D3,stroke:#333,stroke-width:2px
style J fill:#AA96DA,color:#fff,stroke:#333,stroke-width:2px
Data Platform:
- Data integration with Airbyte 1.8.0 (300+ connectors)
- Data lakehouse architecture with Dremio 26.0
- Automated transformations with dbt 1.10+
- Business intelligence with Apache Superset 3.0
- Comprehensive data quality testing (21 automated tests)
- Real-time synchronization via Arrow Flight
- Multilingual documentation (18 languages)
AI Capabilities (NEW):
- 🤖 Local LLM server with Ollama (Llama 3.1, Mistral, Phi)
- 🧠 Vector database with Milvus for semantic search
- 📚 RAG (Retrieval Augmented Generation) system
- 💬 Interactive Chat UI for querying data with natural language
- � Document upload support (PDF, Word, Excel, CSV, JSON, TXT, Markdown)
- �🔄 Automatic data ingestion from PostgreSQL/Dremio to vector DB
- 🔒 100% on-premise - no cloud dependencies, complete data privacy
- Docker 20.10+ and Docker Compose 2.0+
- Python 3.11 or higher
- Minimum 8 GB RAM (16 GB recommended for AI services)
- 30 GB available disk space (includes LLM models)
- Optional: NVIDIA GPU for faster LLM inference
Use the orchestrate_platform.py script for automatic setup:
# Full deployment (Data Platform + AI Services)
python orchestrate_platform.py
# Windows PowerShell
$env:PYTHONIOENCODING="utf-8"
python -u orchestrate_platform.py
# Skip AI services if not needed
python orchestrate_platform.py --skip-ai
# Skip infrastructure (if already running)
python orchestrate_platform.py --skip-infrastructure
What it does:
- ✅ Validates prerequisites
- ✅ Starts all Docker services
- ✅ Deploys AI services (Ollama LLM, Milvus Vector DB, RAG API)
- ✅ Configures Airbyte, Dremio, dbt
- ✅ Runs data transformations
- ✅ Creates Superset dashboards
- ✅ Provides deployment summary with service URLs
# Clone repository
git clone https://github.com/Monsau/data-platform-iso-opensource.git
cd data-platform-iso-opensource
# Install dependencies
pip install -r requirements.txt
# Start infrastructure (Data Platform + AI Services)
docker-compose -f docker-compose.yml -f docker-compose-airbyte-stable.yml -f docker-compose-ai.yml up -d
# Or just data platform (no AI)
docker-compose -f docker-compose.yml -f docker-compose-airbyte-stable.yml up -d
# Or use make commands
make up
# Verify installation
make status
# Run quality tests
make dbt-test
Data Platform:
Service | URL | Credentials |
---|---|---|
Airbyte | http://localhost:8000 | airbyte / password |
Dremio | http://localhost:9047 | admin / admin123 |
Superset | http://localhost:8088 | admin / admin |
MinIO Console | http://localhost:9001 | minioadmin / minioadmin123 |
PostgreSQL | localhost:5432 | postgres / postgres123 |
AI Services (NEW):
Service | URL | Description |
---|---|---|
AI Chat UI | http://localhost:8501 | Chat with your data using natural language |
RAG API | http://localhost:8002 | REST API for AI queries |
RAG API Docs | http://localhost:8002/docs | Interactive API documentation |
Ollama LLM | http://localhost:11434 | Local LLM server |
Milvus Vector DB | localhost:19530 | Vector database for embeddings |
Embedding Service | http://localhost:8001 | Text-to-vector conversion |
Component | Version | Port | Description |
---|---|---|---|
Airbyte | 1.8.0 | 8000, 8001 | Data integration platform (300+ connectors) |
Dremio | 26.0 | 9047, 32010 | Data lakehouse platform |
dbt | 1.10+ | - | Data transformation tool |
Superset | 3.0.0 | 8088 | Business intelligence platform |
PostgreSQL | 15 | 5432 | Transactional database |
MinIO | Latest | 9000, 9001 | S3-compatible object storage |
Elasticsearch | 7.17.0 | 9200 | Search and analytics engine |
MySQL | 8.0 | 3307 | OpenMetadata database |
Component | Version | Port | Description |
---|---|---|---|
Ollama | Latest | 11434 | Local LLM server (Llama 3.1 - 8B parameters) |
Milvus | 2.3.3 | 19530 | Vector database for semantic search |
RAG API | 1.0 | 8002 | RAG orchestration & query API (FastAPI) |
Embedding Service | 1.0 | 8001 | Text-to-vector conversion (all-MiniLM-L6-v2) |
AI Chat UI | 1.0 | 8501 | Natural language query interface (Streamlit) |
Data Ingestion | 1.0 | - | Scheduled data loading service |
This project provides complete documentation in 18 languages, covering 5.2B+ people (70% of global population):
Language | Documentation | Data Generation | Native Speakers |
---|---|---|---|
🇬🇧 English | README.md | --language en |
1.5B |
🇫🇷 Français | docs/i18n/fr/ | --language fr |
280M |
🇪🇸 Español | docs/i18n/es/ | --language es |
559M |
🇵🇹 Português | docs/i18n/pt/ | --language pt |
264M |
🇸🇦 العربية | docs/i18n/ar/ | --language ar |
422M |
🇨🇳 中文 | docs/i18n/cn/ | --language cn |
1.3B |
🇯🇵 日本語 | docs/i18n/jp/ | --language jp |
125M |
🇷🇺 Русский | docs/i18n/ru/ | --language ru |
258M |
🇩🇪 Deutsch | docs/i18n/de/ | --language de |
134M |
🇰🇷 한국어 | docs/i18n/ko/ | --language ko |
81M |
🇮🇳 हिन्दी | docs/i18n/hi/ | --language hi |
602M |
🇮🇩 Indonesia | docs/i18n/id/ | --language id |
199M |
🇹🇷 Türkçe | docs/i18n/tr/ | --language tr |
88M |
🇻🇳 Tiếng Việt | docs/i18n/vi/ | --language vi |
85M |
🇮🇹 Italiano | docs/i18n/it/ | --language it |
85M |
🇳🇱 Nederlands | docs/i18n/nl/ | --language nl |
25M |
🇵🇱 Polski | docs/i18n/pl/ | --language pl |
45M |
🇸🇪 Svenska | docs/i18n/se/ | --language se |
13M |
# Generate French customer data (CSV format)
python config/i18n/data_generator.py --language fr --records 1000 --format csv
# Generate Spanish product data (JSON format)
python config/i18n/data_generator.py --language es --records 500 --format json
# Generate Chinese user data (Parquet format)
python config/i18n/data_generator.py --language cn --records 2000 --format parquet
Configuration: config/i18n/config.json
The platform includes a complete AI/LLM stack for natural language data querying and insights.
-
Deploy Platform (includes AI services):
python orchestrate_platform.py
-
Access AI Chat Interface:
- Open http://localhost:8501
- Use the sidebar to ingest data from your PostgreSQL or Dremio tables
-
Ingest Your Data (via sidebar):
Option 1: Upload Documents (NEW!) - Click "Choose files to upload" - Select PDF, Word, Excel, CSV, or other files - Add optional tags/source - Click "🚀 Upload & Ingest Documents" Option 2: From Database Table: customers Text column: description Metadata: customer_id,name,segment → Click "Ingest PostgreSQL"
-
Ask Questions (examples):
- "What are the key trends in our sales data?"
- "Show me customer segments with highest revenue"
- "Are there any data quality issues in the orders table?"
- "Generate a SQL query to find recent high-value customers"
- "Explain the ETL pipeline for product data"
User Question → Chat UI → RAG API → Query Embedding
↓
Vector Search (Milvus)
↓
Retrieve Context Documents
↓
Build Prompt with Context
↓
Local LLM (Ollama/Llama 3.1)
↓
AI-Generated Answer + Sources
Service | URL | Purpose |
---|---|---|
AI Chat UI | http://localhost:8501 | Interactive Q&A interface |
RAG API | http://localhost:8002 | REST API for AI queries |
RAG API Docs | http://localhost:8002/docs | Interactive API documentation |
Ollama LLM | http://localhost:11434 | Local LLM server (Llama 3.1) |
Milvus Vector DB | localhost:19530 | Semantic search database |
Embedding Service | http://localhost:8001 | Text-to-vector conversion |
Python Example:
import httpx
# Ask a question
response = httpx.post(
"http://localhost:8002/query",
json={
"question": "What are our top products?",
"top_k": 5,
"model": "llama3.1"
}
)
result = response.json()
print(f"Answer: {result['answer']}")
print(f"Sources: {len(result['sources'])} documents")
cURL Example:
curl -X POST http://localhost:8002/query \
-H "Content-Type: application/json" \
-d '{
"question": "What trends do you see in customer data?",
"top_k": 5,
"model": "llama3.1",
"temperature": 0.7
}'
# Mistral (faster, good for coding)
docker exec ollama ollama pull mistral
# Phi3 (lightweight, quick responses)
docker exec ollama ollama pull phi3
# CodeLlama (code generation)
docker exec ollama ollama pull codellama
# List available models
docker exec ollama ollama list
- ✅ 100% Local: No cloud APIs, no data leaves your infrastructure
- ✅ Private: All processing done on-premise
- ✅ No API Costs: No OpenAI/Anthropic bills
- ✅ Semantic Search: Vector database (Milvus) with 384-dim embeddings
- ✅ RAG System: Retrieval Augmented Generation for context-aware answers
- ✅ Multiple Models: Llama 3.1, Mistral, Phi3, CodeLlama
- ✅ Auto-Ingestion: Scheduled data updates from PostgreSQL/Dremio
- ✅ Source Attribution: See which documents the answer came from
For detailed AI services documentation, see:
- AI Services Guide - Complete guide with architecture, configuration, troubleshooting
- Quick Start Guide - Fast AI setup with examples
- Platform Status - All services including AI
Data Engineers
Data Analysts
Developers
DevOps
# Infrastructure Management
make up # Start all services
make down # Stop all services
make restart # Restart services
make status # Check service status
make logs # View service logs
# Data Transformation (dbt)
make dbt-run # Run transformations
make dbt-test # Run quality tests
make dbt-docs # Generate documentation
make dbt-clean # Clean artifacts
# Data Synchronization
make sync # Manual sync Dremio to PostgreSQL
make sync-auto # Auto sync every 5 minutes
# Testing & Quality
make test # Run all tests
make lint # Code quality checks
make format # Format code
# Deployment
make deploy # Complete deployment
make deploy-quick # Quick deployment
Services: 9/9 operational (includes Airbyte)
dbt Tests: 21/21 passing
Dashboards: 3 active
Languages: 18 supported (5.2B+ people coverage)
Documentation: Complete in 18 languages
Status: Production Ready - v1.0
data-platform-iso-opensource/
├── README.md # This file
├── AUTHORS.md # Project creators and contributors
├── CHANGELOG.md # Version history
├── CONTRIBUTING.md # Contribution guidelines
├── CODE_OF_CONDUCT.md # Community guidelines
├── SECURITY.md # Security policies
├── LICENSE # MIT License
│
├── docs/ # Documentation
│ ├── i18n/ # Multilingual docs (18 languages)
│ │ ├── fr/, es/, pt/, cn/, jp/, ru/, ar/
│ │ ├── de/, ko/, hi/, id/, tr/, vi/
│ │ └── it/, nl/, pl/, se/
│ └── diagrams/ # Mermaid diagrams (248+)
│
├── config/ # Configuration
│ └── i18n/ # Internationalization
│ ├── config.json
│ └── data_generator.py
│
├── dbt/ # Data transformations
│ ├── models/ # SQL models
│ ├── tests/ # Quality tests
│ └── dbt_project.yml
│
├── reports/ # Documentation reports
│ ├── phase1/ # Integration reports
│ ├── phase2/ # Data cleaning reports
│ ├── phase3/ # Quality testing reports
│ ├── superset/ # Dashboard guides
│ └── integration/ # Integration guides
│
├── scripts/ # Automation scripts
│ ├── orchestrate_platform.py
│ ├── sync_dremio_realtime.py
│ └── populate_superset.py
│
└── docker-compose.yml # Infrastructure definition
Our vision for the future of Talentys Data Platform with monthly releases:
Focus: OpenMetadata Integration Phase 1
- 🔍 OpenMetadata: Complete metadata catalog, data lineage, data quality
- 📝 Auto-documentation: LLM-generated dataset descriptions, PII detection
Focus: OpenMetadata Phase 2 & Enhanced Chat UI
- 💬 Enhanced Chat UI: Persistent history, export capabilities, bookmarks, themes
- � OpenMetadata: Smart tagging, column-level metadata
Focus: Security & Authentication
- 🔐 OAuth2/SSO, RBAC, API security (Jan)
- 📊 Real-time analytics dashboard with alerting (Feb)
- 🎨 UI/UX improvements, user management (Mar)
Focus: Advanced AI & ML
- 🤖 MLOps with MLflow, advanced RAG (Apr)
- 🧠 Multi-model LLM support, prompt engineering (May)
- 📊 Predictive analytics, automated insights (Jun)
Focus: Cloud Native & Kubernetes
- ☁️ Helm charts, Kubernetes operators (Jul)
- 🌐 Multi-cloud support (AWS, Azure, GCP), hybrid cloud (Aug)
- 🔄 GitOps with ArgoCD, OpenTelemetry observability (Sep)
Focus: Enterprise Features
- 🏢 Multi-tenancy, white-labeling (Oct)
- 💼 Enterprise governance, audit logging, data masking (Nov)
- 📱 Mobile app (iOS/Android), complete API (Dec)
Focus: Next-Generation Platform
- 🚀 AI-first platform with natural language to SQL
- 🌊 Real-time streaming with Kafka/Flink
- 🌍 Data Mesh architecture, global scale
📄 Full roadmap (18 languages): English | Français | Español | All languages
We welcome contributions from the community. Please see:
- Add language configuration to
config/i18n/config.json
- Create documentation directory:
docs/i18n/[language-code]/
- Translate README and guides
- Update main README language table
- Submit pull request
This project is licensed under the MIT License. See LICENSE file for details.
Supported by Talentys | LinkedIn - Data Engineering and Analytics Excellence
Built with enterprise-grade open-source technologies:
Data Platform:
- Airbyte - Data integration platform (300+ connectors)
- Dremio - Data lakehouse platform
- dbt - Data transformation tool
- Apache Superset - Business intelligence platform
- Apache Arrow - Columnar data format
- PostgreSQL - Relational database
- MinIO - Object storage
- Elasticsearch - Search and analytics
AI Services:
- Ollama - Local LLM server
- Llama 3.1 - Meta's open-source LLM (8B parameters)
- Milvus - Vector database for semantic search
- sentence-transformers - Text embedding models
- FastAPI - Modern web framework for APIs
- Streamlit - App framework for ML/AI projects
Author: Mustapha Fonsau
- 🏢 Organization: Talentys | LinkedIn
- 💼 LinkedIn: linkedin.com/in/mustapha-fonsau
- 🐙 GitHub: github.com/Monsau
- 📧 Email: mfonsau@talentys.eu
For technical assistance:
- 📚 Documentation: docs/i18n/
- 🐛 Issue Tracker: GitHub Issues
- 💬 Discussions: GitHub Discussions
Version 1.0.0 | 2025-10-16 | Production Ready
Made with ❤️ by Mustapha Fonsau | Supported by Talentys | LinkedIn