Genomic BigData Warehousing with Apache Spark and LakeHouse Architecture
-
Updated
Jan 19, 2023 - Jupyter Notebook
Genomic BigData Warehousing with Apache Spark and LakeHouse Architecture
ragsearch is a Python library designed for building a Retrieval-Augmented Generation (RAG) application that enables natural language querying over both structured and unstructured data. This tool leverages embedding models and a vector database (FAISS or ChromaDB) to provide an efficient and scalable search engine.
Builds a Spark Standalone Cluster on Docker in local with MinIO integration
Quick look into Iceberg Table that underpin Iceberg Data Lake
Quick look into Delta Table that underpin Delta Lake
Apache Iceberg vs Delta Lake — same Medallion pipeline built on two different table formats
This project implements my master’s thesis on building a scalable, ACID-compliant data lakehouse architecture for IoT and industrial workloads, in a AWS-native environment.
Quick look into Hudi Table that underpin Hudi Data Lake
Open Table Format is a category of open standards for organizing and managing data in data lakehouses.
Add a description, image, and links to the open-table-format topic page so that developers can more easily learn about it.
To associate your repository with the open-table-format topic, visit your repo's landing page and select "manage topics."