speech-to-text-data-collection

A speech to text data collection using Apache Kafka, Apache spark, Airflow, and S3 bucket

Table of Contents

Overview
Project Structure
- data:
- notebooks:
- scripts
- tests:
- logs:
- root folder
Installation guide
LICENCE

Overview

This week, 10 Academy is your client. Recognizing the value of large data sets for speech-t0-text data sets, and seeing the opportunity that there are many text corpuses for both languages, and understanding that complex data engineering skills is valuable to your profile for employers, this week’s task is simple: design and build a robust, large scale, fault tolerant, highly available Kafka cluster that can be used to post a sentence and receive an audio file.

By the end of this project, you should produce a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-t0-text model.

Project Structure

The repository has a number of files including python scripts, jupyter notebooks, pdfs and text files. Here is their structure with a brief explanation.

Data

The purpose of this week’s challenge is to build a data engineering pipeline that allows recording millions of Amharic and Swahili speakers reading digital texts in-app and web platforms.

There are a number of large text corpora we will use, but for the purpose of testing the backend development, you can use the recently released Amharic news text classification dataset with baseline performance dataset:

IsraelAbebe/An-Amharic-News-Text-classification-Dataset

Alternative data Ready-made Amharic data collected from different sources here

Usage

Docker-compose

Both the front-end and the back-end could be run on a docker container.

1. Clone the repo

git clone https://github.com/GrpHu/speech-to-text-data-collection

2. cd into repo

cd speech-to-text-data-collection

3.Start docker container:

docker-compose up -d

notebooks

[EDA.ipynb]: a jupyter notebook for exploratory data analysis

scripts

tests:

the folder containing unit tests for components in the scripts

logs:

the folder containing log files (if it doesn't exist it will be created once logging starts)

Contributors

License

MIT

back to top

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.dvc		.dvc
.github/workflows		.github/workflows
Airflow		Airflow
Airflow2		Airflow2
consumer		consumer
dbt		dbt
flask		flask
frontend		frontend
logs		logs
notebooks		notebooks
screenshots		screenshots
scripts		scripts
spark		spark
tests		tests
.dvcignore		.dvcignore
.gitignore		.gitignore
Dockerfile.python		Dockerfile.python
Dockerfile.react		Dockerfile.react
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
docker-compose3.yml		docker-compose3.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

speech-to-text-data-collection

Overview

Project Structure

Data

Usage

Docker-compose

notebooks

scripts

tests:

logs:

Contributors

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

License

GrpHu/speech-to-text-data-collection

Folders and files

Latest commit

History

Repository files navigation

speech-to-text-data-collection

Overview

Project Structure

Data

Usage

Docker-compose

notebooks

scripts

tests:

logs:

Contributors

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages