🚀 Data Platform - ISO Opensource

Enterprise Data Lakehouse SoAI Capabilities (NEW):**

🤖 Local LLM server with Ollama (Llama 3.1, Mistral, Phi)
🧠 Vector database with Milvus for semantic search
📚 RAG (Retrieval Augmented Generation) system
💬 Interactive Chat UI for querying data with natural language
📄 Document upload support (PDF, Word, Excel, CSV, JSON, TXT, Markdown)
📦 Automatic S3/MinIO storage for all uploaded documents
🔄 Automatic data ingestion from PostgreSQL/Dremio to vector DB
🔒 100% on-premise - no cloud dependencies, complete data privacy

Created by: Mustapha Fonsau | GitHub

Supported by Talentys | LinkedIn - Data Engineering & Analytics Excellence

📖 Main documentation in English. Translations available in 17 additional languages below.

🌍 Available Languages

Overview

AI-Ready professional data platform combining Airbyte, Dremio, dbt, Apache Superset, and Local LLM (Ollama) for enterprise-grade data integration, transformation, quality assurance, business intelligence, and AI-powered insights. Built with multilingual support for global teams.

graph TB
    A[Data Sources] --> B[Airbyte ETL]
    B --> C[Dremio Lakehouse]
    C --> D[dbt Transformations]
    D --> E[Apache Superset]
    E --> F[Business Insights]
    
    C --> G[Vector DB<br/>Milvus]
    D --> G
    G --> H[RAG System]
    I[Local LLM<br/>Ollama] --> H
    H --> J[AI Chat UI]
    J --> K[AI-Powered<br/>Insights]
    
    style B fill:#615EFF,color:#fff,stroke:#333,stroke-width:2px
    style C fill:#f5f5f5,stroke:#333,stroke-width:2px
    style D fill:#e8e8e8,stroke:#333,stroke-width:2px
    style E fill:#d8d8d8,stroke:#333,stroke-width:2px
    style G fill:#FF6B6B,color:#fff,stroke:#333,stroke-width:2px
    style I fill:#4ECDC4,color:#fff,stroke:#333,stroke-width:2px
    style H fill:#95E1D3,stroke:#333,stroke-width:2px
    style J fill:#AA96DA,color:#fff,stroke:#333,stroke-width:2px

Key Features

Data Platform:

Data integration with Airbyte 1.8.0 (300+ connectors)
Data lakehouse architecture with Dremio 26.0
Automated transformations with dbt 1.10+
Business intelligence with Apache Superset 3.0
Comprehensive data quality testing (21 automated tests)
Real-time synchronization via Arrow Flight
Multilingual documentation (18 languages)

AI Capabilities (NEW):

🤖 Local LLM server with Ollama (Llama 3.1, Mistral, Phi)
🧠 Vector database with Milvus for semantic search
📚 RAG (Retrieval Augmented Generation) system
💬 Interactive Chat UI for querying data with natural language
� Document upload support (PDF, Word, Excel, CSV, JSON, TXT, Markdown)
�🔄 Automatic data ingestion from PostgreSQL/Dremio to vector DB
🔒 100% on-premise - no cloud dependencies, complete data privacy

Quick Start

Prerequisites

Docker 20.10+ and Docker Compose 2.0+
Python 3.11 or higher
Minimum 8 GB RAM (16 GB recommended for AI services)
30 GB available disk space (includes LLM models)
Optional: NVIDIA GPU for faster LLM inference

One-Command Deployment

Use the orchestrate_platform.py script for automatic setup:

# Full deployment (Data Platform + AI Services)
python orchestrate_platform.py

# Windows PowerShell
$env:PYTHONIOENCODING="utf-8"
python -u orchestrate_platform.py

# Skip AI services if not needed
python orchestrate_platform.py --skip-ai

# Skip infrastructure (if already running)
python orchestrate_platform.py --skip-infrastructure

What it does:

✅ Validates prerequisites
✅ Starts all Docker services
✅ Deploys AI services (Ollama LLM, Milvus Vector DB, RAG API)
✅ Configures Airbyte, Dremio, dbt
✅ Runs data transformations
✅ Creates Superset dashboards
✅ Provides deployment summary with service URLs

Manual Installation

# Clone repository
git clone https://github.com/Monsau/data-platform-iso-opensource.git
cd data-platform-iso-opensource

# Install dependencies
pip install -r requirements.txt

# Start infrastructure (Data Platform + AI Services)
docker-compose -f docker-compose.yml -f docker-compose-airbyte-stable.yml -f docker-compose-ai.yml up -d

# Or just data platform (no AI)
docker-compose -f docker-compose.yml -f docker-compose-airbyte-stable.yml up -d

# Or use make commands
make up

# Verify installation
make status

# Run quality tests
make dbt-test

Access Services

Data Platform:

Service	URL	Credentials
Airbyte	http://localhost:8000	airbyte / password
Dremio	http://localhost:9047	admin / admin123
Superset	http://localhost:8088	admin / admin
MinIO Console	http://localhost:9001	minioadmin / minioadmin123
PostgreSQL	localhost:5432	postgres / postgres123

AI Services (NEW):

Service	URL	Description
AI Chat UI	http://localhost:8501	Chat with your data using natural language
RAG API	http://localhost:8002	REST API for AI queries
RAG API Docs	http://localhost:8002/docs	Interactive API documentation
Ollama LLM	http://localhost:11434	Local LLM server
Milvus Vector DB	localhost:19530	Vector database for embeddings
Embedding Service	http://localhost:8001	Text-to-vector conversion

Architecture

System Components

Data Platform

Component	Version	Port	Description
Airbyte	1.8.0	8000, 8001	Data integration platform (300+ connectors)
Dremio	26.0	9047, 32010	Data lakehouse platform
dbt	1.10+	-	Data transformation tool
Superset	3.0.0	8088	Business intelligence platform
PostgreSQL	15	5432	Transactional database
MinIO	Latest	9000, 9001	S3-compatible object storage
Elasticsearch	7.17.0	9200	Search and analytics engine
MySQL	8.0	3307	OpenMetadata database

AI Services (NEW)

Component	Version	Port	Description
Ollama	Latest	11434	Local LLM server (Llama 3.1 - 8B parameters)
Milvus	2.3.3	19530	Vector database for semantic search
RAG API	1.0	8002	RAG orchestration & query API (FastAPI)
Embedding Service	1.0	8001	Text-to-vector conversion (all-MiniLM-L6-v2)
AI Chat UI	1.0	8501	Natural language query interface (Streamlit)
Data Ingestion	1.0	-	Scheduled data loading service

Architecture Diagrams

Multilingual Support

This project provides complete documentation in 18 languages, covering 5.2B+ people (70% of global population):

Language	Documentation	Data Generation	Native Speakers
🇬🇧 English	README.md	`--language en`	1.5B
🇫🇷 Français	docs/i18n/fr/	`--language fr`	280M
🇪🇸 Español	docs/i18n/es/	`--language es`	559M
🇵🇹 Português	docs/i18n/pt/	`--language pt`	264M
🇸🇦 العربية	docs/i18n/ar/	`--language ar`	422M
🇨🇳 中文	docs/i18n/cn/	`--language cn`	1.3B
🇯🇵 日本語	docs/i18n/jp/	`--language jp`	125M
🇷🇺 Русский	docs/i18n/ru/	`--language ru`	258M
🇩🇪 Deutsch	docs/i18n/de/	`--language de`	134M
🇰🇷 한국어	docs/i18n/ko/	`--language ko`	81M
🇮🇳 हिन्दी	docs/i18n/hi/	`--language hi`	602M
🇮🇩 Indonesia	docs/i18n/id/	`--language id`	199M
🇹🇷 Türkçe	docs/i18n/tr/	`--language tr`	88M
🇻🇳 Tiếng Việt	docs/i18n/vi/	`--language vi`	85M
🇮🇹 Italiano	docs/i18n/it/	`--language it`	85M
🇳🇱 Nederlands	docs/i18n/nl/	`--language nl`	25M
🇵🇱 Polski	docs/i18n/pl/	`--language pl`	45M
🇸🇪 Svenska	docs/i18n/se/	`--language se`	13M

Generate Multilingual Test Data

# Generate French customer data (CSV format)
python config/i18n/data_generator.py --language fr --records 1000 --format csv

# Generate Spanish product data (JSON format)
python config/i18n/data_generator.py --language es --records 500 --format json

# Generate Chinese user data (Parquet format)
python config/i18n/data_generator.py --language cn --records 2000 --format parquet

Configuration: config/i18n/config.json

🤖 AI-Powered Data Insights

The platform includes a complete AI/LLM stack for natural language data querying and insights.

Quick Start with AI

Deploy Platform (includes AI services):
```
python orchestrate_platform.py
```
Access AI Chat Interface:
- Open http://localhost:8501
- Use the sidebar to ingest data from your PostgreSQL or Dremio tables

Ingest Your Data (via sidebar):

Option 1: Upload Documents (NEW!)
- Click "Choose files to upload"
- Select PDF, Word, Excel, CSV, or other files
- Add optional tags/source
- Click "🚀 Upload & Ingest Documents"

Option 2: From Database
Table: customers
Text column: description
Metadata: customer_id,name,segment
→ Click "Ingest PostgreSQL"

Ask Questions (examples):
- "What are the key trends in our sales data?"
- "Show me customer segments with highest revenue"
- "Are there any data quality issues in the orders table?"
- "Generate a SQL query to find recent high-value customers"
- "Explain the ETL pipeline for product data"

AI Architecture

User Question → Chat UI → RAG API → Query Embedding
                                  ↓
                          Vector Search (Milvus)
                                  ↓
                          Retrieve Context Documents
                                  ↓
                          Build Prompt with Context
                                  ↓
                          Local LLM (Ollama/Llama 3.1)
                                  ↓
                          AI-Generated Answer + Sources

AI Services Available

Service	URL	Purpose
AI Chat UI	http://localhost:8501	Interactive Q&A interface
RAG API	http://localhost:8002	REST API for AI queries
RAG API Docs	http://localhost:8002/docs	Interactive API documentation
Ollama LLM	http://localhost:11434	Local LLM server (Llama 3.1)
Milvus Vector DB	localhost:19530	Semantic search database
Embedding Service	http://localhost:8001	Text-to-vector conversion

Programmatic Access

Python Example:

import httpx

# Ask a question
response = httpx.post(
    "http://localhost:8002/query",
    json={
        "question": "What are our top products?",
        "top_k": 5,
        "model": "llama3.1"
    }
)

result = response.json()
print(f"Answer: {result['answer']}")
print(f"Sources: {len(result['sources'])} documents")

cURL Example:

curl -X POST http://localhost:8002/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What trends do you see in customer data?",
    "top_k": 5,
    "model": "llama3.1",
    "temperature": 0.7
  }'

Download Additional LLM Models

# Mistral (faster, good for coding)
docker exec ollama ollama pull mistral

# Phi3 (lightweight, quick responses)
docker exec ollama ollama pull phi3

# CodeLlama (code generation)
docker exec ollama ollama pull codellama

# List available models
docker exec ollama ollama list

AI Features

✅ 100% Local: No cloud APIs, no data leaves your infrastructure
✅ Private: All processing done on-premise
✅ No API Costs: No OpenAI/Anthropic bills
✅ Semantic Search: Vector database (Milvus) with 384-dim embeddings
✅ RAG System: Retrieval Augmented Generation for context-aware answers
✅ Multiple Models: Llama 3.1, Mistral, Phi3, CodeLlama
✅ Auto-Ingestion: Scheduled data updates from PostgreSQL/Dremio
✅ Source Attribution: See which documents the answer came from

Comprehensive Guide

For detailed AI services documentation, see:

AI Services Guide - Complete guide with architecture, configuration, troubleshooting
Quick Start Guide - Fast AI setup with examples
Platform Status - All services including AI

Documentation

For Different Roles

Data Engineers

Data Analysts

Developers

DevOps

Common Commands

# Infrastructure Management
make up              # Start all services
make down            # Stop all services
make restart         # Restart services
make status          # Check service status
make logs            # View service logs

# Data Transformation (dbt)
make dbt-run         # Run transformations
make dbt-test        # Run quality tests
make dbt-docs        # Generate documentation
make dbt-clean       # Clean artifacts

# Data Synchronization
make sync            # Manual sync Dremio to PostgreSQL
make sync-auto       # Auto sync every 5 minutes

# Testing & Quality
make test            # Run all tests
make lint            # Code quality checks
make format          # Format code

# Deployment
make deploy          # Complete deployment
make deploy-quick    # Quick deployment

Project Status

Services: 9/9 operational (includes Airbyte)
dbt Tests: 21/21 passing
Dashboards: 3 active
Languages: 18 supported (5.2B+ people coverage)
Documentation: Complete in 18 languages
Status: Production Ready - v1.0

Project Structure

data-platform-iso-opensource/
├── README.md                       # This file
├── AUTHORS.md                      # Project creators and contributors
├── CHANGELOG.md                    # Version history
├── CONTRIBUTING.md                 # Contribution guidelines
├── CODE_OF_CONDUCT.md              # Community guidelines
├── SECURITY.md                     # Security policies
├── LICENSE                         # MIT License
│
├── docs/                           # Documentation
│   ├── i18n/                       # Multilingual docs (18 languages)
│   │   ├── fr/, es/, pt/, cn/, jp/, ru/, ar/
│   │   ├── de/, ko/, hi/, id/, tr/, vi/
│   │   └── it/, nl/, pl/, se/
│   └── diagrams/                   # Mermaid diagrams (248+)
│
├── config/                         # Configuration
│   └── i18n/                       # Internationalization
│       ├── config.json
│       └── data_generator.py
│
├── dbt/                            # Data transformations
│   ├── models/                     # SQL models
│   ├── tests/                      # Quality tests
│   └── dbt_project.yml
│
├── reports/                        # Documentation reports
│   ├── phase1/                     # Integration reports
│   ├── phase2/                     # Data cleaning reports
│   ├── phase3/                     # Quality testing reports
│   ├── superset/                   # Dashboard guides
│   └── integration/                # Integration guides
│
├── scripts/                        # Automation scripts
│   ├── orchestrate_platform.py
│   ├── sync_dremio_realtime.py
│   └── populate_superset.py
│
└── docker-compose.yml              # Infrastructure definition

🗺️ Roadmap

Our vision for the future of Talentys Data Platform with monthly releases:

📦 v1.2.0 - November 2025 (Next Release)

Focus: OpenMetadata Integration Phase 1

🔍 OpenMetadata: Complete metadata catalog, data lineage, data quality
📝 Auto-documentation: LLM-generated dataset descriptions, PII detection

📦 v1.2.1 - December 2025

Focus: OpenMetadata Phase 2 & Enhanced Chat UI

💬 Enhanced Chat UI: Persistent history, export capabilities, bookmarks, themes
� OpenMetadata: Smart tagging, column-level metadata

📦 v1.3.x - January-March 2026

Focus: Security & Authentication

🔐 OAuth2/SSO, RBAC, API security (Jan)
📊 Real-time analytics dashboard with alerting (Feb)
🎨 UI/UX improvements, user management (Mar)

📦 v1.4.x - April-June 2026

Focus: Advanced AI & ML

🤖 MLOps with MLflow, advanced RAG (Apr)
🧠 Multi-model LLM support, prompt engineering (May)
📊 Predictive analytics, automated insights (Jun)

📦 v1.5.x - July-September 2026

Focus: Cloud Native & Kubernetes

☁️ Helm charts, Kubernetes operators (Jul)
🌐 Multi-cloud support (AWS, Azure, GCP), hybrid cloud (Aug)
🔄 GitOps with ArgoCD, OpenTelemetry observability (Sep)

📦 v1.6.x - October-December 2026

Focus: Enterprise Features

🏢 Multi-tenancy, white-labeling (Oct)
💼 Enterprise governance, audit logging, data masking (Nov)
📱 Mobile app (iOS/Android), complete API (Dec)

📦 v2.0.0 - 2027

Focus: Next-Generation Platform

🚀 AI-first platform with natural language to SQL
🌊 Real-time streaming with Kafka/Flink
🌍 Data Mesh architecture, global scale

📄 Full roadmap (18 languages): English | Français | Español | All languages

Contributing

We welcome contributions from the community. Please see:

Adding a New Language

Add language configuration to config/i18n/config.json
Create documentation directory: docs/i18n/[language-code]/
Translate README and guides
Update main README language table
Submit pull request

License

This project is licensed under the MIT License. See LICENSE file for details.

Acknowledgments

Supported by Talentys | LinkedIn - Data Engineering and Analytics Excellence

Built with enterprise-grade open-source technologies:

Data Platform:

Airbyte - Data integration platform (300+ connectors)
Dremio - Data lakehouse platform
dbt - Data transformation tool
Apache Superset - Business intelligence platform
Apache Arrow - Columnar data format
PostgreSQL - Relational database
MinIO - Object storage
Elasticsearch - Search and analytics

AI Services:

Ollama - Local LLM server
Llama 3.1 - Meta's open-source LLM (8B parameters)
Milvus - Vector database for semantic search
sentence-transformers - Text embedding models
FastAPI - Modern web framework for APIs
Streamlit - App framework for ML/AI projects

📧 Contact

Author: Mustapha Fonsau

🏢 Organization: Talentys | LinkedIn
💼 LinkedIn: linkedin.com/in/mustapha-fonsau
🐙 GitHub: github.com/Monsau
📧 Email: mfonsau@talentys.eu

Support

For technical assistance:

📚 Documentation: docs/i18n/
🐛 Issue Tracker: GitHub Issues
💬 Discussions: GitHub Discussions

Version 1.0.0 | 2025-10-16 | Production Ready

Made with ❤️ by Mustapha Fonsau | Supported by Talentys | LinkedIn

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
ai-services		ai-services
assets		assets
config		config
data/test		data/test
dbt		dbt
docker		docker
docs		docs
dremio		dremio
minio/sample-data		minio/sample-data
opendata		opendata
openmetadata		openmetadata
postgres/init		postgres/init
reports		reports
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
AUTHORS.md		AUTHORS.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CREDENTIALS_STANDARDS.md		CREDENTIALS_STANDARDS.md
LICENSE		LICENSE
Makefile		Makefile
PORTS_MAPPING.md		PORTS_MAPPING.md
QUICK_START.md		QUICK_START.md
README.md		README.md
RELEASE_GUIDE_v1.1.md		RELEASE_GUIDE_v1.1.md
SECURITY_SSH_KEY_CHECK.md		SECURITY_SSH_KEY_CHECK.md
docker-compose-ai.yml		docker-compose-ai.yml
docker-compose-airbyte-stable.yml		docker-compose-airbyte-stable.yml
docker-compose-airbyte.yml		docker-compose-airbyte.yml
docker-compose-minimal.yml		docker-compose-minimal.yml
docker-compose-openmetadata-official.yml		docker-compose-openmetadata-official.yml
docker-compose-openmetadata-standalone.yml		docker-compose-openmetadata-standalone.yml
docker-compose-superset.yml		docker-compose-superset.yml
docker-compose.yml		docker-compose.yml
orchestrate_platform.py		orchestrate_platform.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

License

Monsau/Talentys-OSS-Data-Platform

Folders and files

Latest commit

History

Repository files navigation

🚀 Data Platform - ISO Opensource

🌍 Available Languages

Overview

Key Features

Quick Start

Prerequisites

One-Command Deployment

Manual Installation

Access Services

Architecture

System Components

Data Platform

AI Services (NEW)

Architecture Diagrams

Multilingual Support

Generate Multilingual Test Data

🤖 AI-Powered Data Insights

Quick Start with AI

AI Architecture

AI Services Available

Programmatic Access

Download Additional LLM Models

AI Features

Comprehensive Guide

Documentation

For Different Roles

Common Commands

Project Status

Project Structure

🗺️ Roadmap

📦 v1.2.0 - November 2025 (Next Release)

📦 v1.2.1 - December 2025

📦 v1.3.x - January-March 2026

📦 v1.4.x - April-June 2026

📦 v1.5.x - July-September 2026

📦 v1.6.x - October-December 2026

📦 v2.0.0 - 2027

Contributing

Adding a New Language

License

Acknowledgments

📧 Contact

Support

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages