DataHub is a modular metadata and DDL generation library designed to standardize table definitions across various data platforms (Hudi, etc.).
⚠️ NOTICE: This repository is currently CLOSED SOURCE and under active development. It is not ready for external use or distribution.
| Module | Status | Description |
|---|---|---|
core |
🟢 Implemented | Foundational interfaces (Table, Schema, Field) and reusable metadata definitions. |
hudi |
🟡 In Progress | Hudi-specific implementations, properties management, and DDL generation logic. |
api |
🔴 Planned | REST API definitions for metadata management. |
schema |
🔴 Planned | Schema registry integrations and converters. |
orchestration |
🔴 Planned | Airflow/Dagster integration patterns. |
observability |
🔴 Planned | Data quality and lineage tracking. |
- JDK 17+
- Gradle (wrapper provided)
- Docker & Docker Compose (for local environment)
To build the project and run tests:
./gradlew buildThis project includes a Docker Compose setup to run a local data platform consisting of:
- Trino: Distributed SQL query engine.
- MinIO: S3-compatible object storage.
- Hive Metastore: Metadata service for Trino/Hudi.
- Postgres: Backend for Hive Metastore.
To start the environment:
docker-compose up -d- Trino UI: http://localhost:8080 (User:
admin) - MinIO Console: http://localhost:9001 (User:
minio, Pass:minio123)
This project enforces code style using Checkstyle. Violations will cause the build to fail. To run checks explicitly:
./gradlew check