kafka-airflow-spark-pipeline

Text-to-speech data collection with Kafka, Airflow, Spark and S3 bucket.

Table of Content

Project Overview
Data
Installation Guide
LICENCE
Contributers

Project overview

In this project design and build a robust, large scale, fault tolerant, highly available Kafka cluster that can be used to post a sentence and receive an audio file and produce a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-t0-text model.

Data

There are a number of large text corpora we will use, but for the purpose of testing the backend development, you can use the recently released Amharic news text classification dataset with baseline performance dataset:

IsraelAbebe/An-Amharic-News-Text-classification-Dataset: An Amharic News Text classification Dataset (github.com)

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.dvc		.dvc
.github/workflows		.github/workflows
airflow		airflow
backend		backend
data		data
frontend		frontend
notebook		notebook
screenshots		screenshots
scripts		scripts
tests		tests
.DS_Store		.DS_Store
.dvcignore		.dvcignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kafka-airflow-spark-pipeline

Project overview

Data

Frontend

Installation Guide

LICENCE

Contributors

About

Releases

Packages

Contributors 6

Languages

License

Hu-10xB6W7G5/kafka-airflow-spark-pipeline

Folders and files

Latest commit

History

Repository files navigation

kafka-airflow-spark-pipeline

Project overview

Data

Frontend

Installation Guide

LICENCE

Contributors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages