Skip to content

Centralized metadata framework for defining, deploying, and observing data assets across the entire data lifecycle.

License

Notifications You must be signed in to change notification settings

ryankwagner/DataHub

Repository files navigation

DataHub

DataHub is a modular metadata and DDL generation library designed to standardize table definitions across various data platforms (Hudi, etc.).

⚠️ NOTICE: This repository is currently CLOSED SOURCE and under active development. It is not ready for external use or distribution.

Feature Status

Module Status Description
core 🟢 Implemented Foundational interfaces (Table, Schema, Field) and reusable metadata definitions.
hudi 🟡 In Progress Hudi-specific implementations, properties management, and DDL generation logic.
api 🔴 Planned REST API definitions for metadata management.
schema 🔴 Planned Schema registry integrations and converters.
orchestration 🔴 Planned Airflow/Dagster integration patterns.
observability 🔴 Planned Data quality and lineage tracking.

Prerequisites

  • JDK 17+
  • Gradle (wrapper provided)
  • Docker & Docker Compose (for local environment)

Local Development

Building the Project

To build the project and run tests:

./gradlew build

Local Trino Environment

This project includes a Docker Compose setup to run a local data platform consisting of:

  • Trino: Distributed SQL query engine.
  • MinIO: S3-compatible object storage.
  • Hive Metastore: Metadata service for Trino/Hudi.
  • Postgres: Backend for Hive Metastore.

To start the environment:

docker-compose up -d

Code Style

This project enforces code style using Checkstyle. Violations will cause the build to fail. To run checks explicitly:

./gradlew check

About

Centralized metadata framework for defining, deploying, and observing data assets across the entire data lifecycle.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published