Skip to content

This project involved the development and implementation of a Data Lake architecture to support an AI model capable of generating image captions. The architecture was designed to efficiently ingest, process, and centralized store large volumes of image and text data.

Notifications You must be signed in to change notification settings

Narius2030/DataLake-Solution-IMCP

Repository files navigation

Overal Architecture

image

Detailed Architecture

image

Storage Structure in Data Lake:

image

Overal Data Pipeline

image

Practical Data Pipeline

At the Bronze layer:

  • It will be divided into 3 DAGs serving to collect data from sources
  • Each DAG is responsible for collecting raw data from Parquet and user files (including images and metadata) from the source into MongoDB and MinIO aggregate stores

image

image

image

At the Silver and Gold layers:

  • Silver layer is used to refine raw metadata from Bronze which will establish the refined metadata for Catalog layer in Data Lake
  • Gold layer obtain to extract image feature from sources and save them in MinIO

image

About

This project involved the development and implementation of a Data Lake architecture to support an AI model capable of generating image captions. The architecture was designed to efficiently ingest, process, and centralized store large volumes of image and text data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published