Distributed Batch Speech-to-Text (STT) System

This project provides a distributed batch audio transcription system using whisper.cpp for fast, local speech-to-text. It consists of a client (audio processor/uploader) and a server (task distributor/collector). The system is designed for Linux (Fedora recommended), but can be adapted for Mac and Windows.

How it works

The server has access to a directory holding mp3 files and exposes 3 endpoints. The first endpoint when queried by the client finds an mp3 file without an accompanying vtt file. The language to be used for transcription is defined in the accompanying json file under sql_params/language (two letter ISO code). The client then transcodes the mp3 file into wav and runs inference via whisper.cpp. The client doesn't know the original filename, only a unique ID is provided whereby the job is tracked by the server in a .txt file (simple database). When the client is done processing, or encounters an error, it needs to send a POST request to one of the other endpoints, containing the ID of the job, and the server removes it from the db. The client writes all performed jobs to a csv file with some basic data that can be used to generate statistics for the client.

1. Client Setup

The client downloads audio tasks from the server, transcribes them using whisper.cpp, and uploads the results. It is intended to run on Linux (Fedora), but can be used on Mac and Windows with some manual steps.

Requirements

Python 3.8+
ffmpeg (must be installed and available in your system PATH)
requests Python package (pip install requests)
whisper.cpp binary (see below)

Configuration

Copy .env.example to .env and edit the values for your server and credentials:

cp client/.env.example client/.env
# Edit client/.env as needed

The client supports cron scheduling. The ENV variable CRON can be set to a cron schedule that denotes a start time. THe variable PROCESSING_HOURS is a value in hours that denotes for how long the client will be transcriing for since starting.

A. Linux (Fedora recommended)

Recommended: Use Fedora via Distrobox (if not on Fedora)

If you are not running Fedora, you can use Distrobox to create a Fedora container for a clean build environment.

Automatic Setup (Fedora)

Run the provided script to install dependencies and build whisper.cpp:

cd client/whisper
chmod +x build-linux_fedora.sh
./build-linux_fedora.sh

Follow the prompts to select your backend (default: CPU). The script will build whisper.cpp and download the required model.

The resulting binary will be at:

client/whisper/whisper.cpp/build/bin/whisper-cli

Manual Setup

If you prefer, follow the whisper.cpp instructions to build manually.

B. Mac

Follow instructions in more detail: whisper.cpp.

# Apple Silicon
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build --config Release

You might need to have Xcode and cmake

brew install cmake

Place the binary as client/whisper/whisper.cpp/build/bin/whisper-cli (create the folder if needed).
Ensure ffmpeg is installed (e.g., via Homebrew: brew install ffmpeg).

C. Windows

Download the latest whisper.cpp binary for Windows from whisper.cpp releases.
Place the binary as client/whisper/whisper.cpp/build/bin/whisper-cli (create the folder if needed).
Ensure ffmpeg is installed and available in your system PATH.

Running the Client

After building or downloading the binary and setting up your .env, run:

python client/client.py

The client will poll the server for tasks, process them, and upload results. If authentication is enabled, set AUTH_ENABLED=true and provide USERNAME and PASSWORD in your .env.

D. Docker / Podman (Recommended for Linux)

For a containerized client setup that works anywhere, use Docker or Podman. This provides isolation and reproducible builds across different systems.

Quick Start:

Prepare your environment:

cd client
cp .env.example .env
# Edit .env with your server configuration

Build the Docker image:
```
chmod +x build.sh run.sh
./build.sh
```
- Select your container runtime (docker or podman)
- Select your backend (cpu, vulkan, cuda, or openvino)
- For Intel GPUs 13th gen or older with Vulkan, select 'yes' when prompted
Run the container:
```
./run.sh
```
- Select your container runtime (docker or podman)
- Select the backend to match your build
- The container will start and mount your .env file

What Gets Mounted:

processed_uploaded/ – Successfully uploaded VTT files
processed_not_uploaded/ – Processed but not yet uploaded files
not_processed_failed_report/ – Failed processing reports
processed.csv – Processing log (auto-created with headers if missing)
.env – Your environment configuration

Viewing Logs:

# Docker
docker logs -f distributed-batch-stt-client-<backend>

# Podman
podman logs -f distributed-batch-stt-client-<backend>

Stopping/Starting:

# Stop
docker stop distributed-batch-stt-client-<backend>
podman stop distributed-batch-stt-client-<backend>

# Start
docker start distributed-batch-stt-client-<backend>
podman start distributed-batch-stt-client-<backend>

# Remove
docker rm -f distributed-batch-stt-client-<backend>
podman rm -f distributed-batch-stt-client-<backend>

Backend Selection:

cpu – Works everywhere, uses OpenBLAS acceleration (slowest)
vulkan – AMD/Intel/NVIDIA GPUs, requires /dev/dri device (fastest for compatible GPUs)
cuda – NVIDIA GPUs only, requires nvidia-docker or proper GPU passthrough (very fast)
openvino – Intel CPUs/GPUs, requires /dev/dri device for GPU acceleration

See DOCKER_README.md for detailed Docker/Podman documentation.

2. Server Setup

The server is a FastAPI app that distributes audio tasks and collects results. It is designed to run in a Docker container for easy deployment and persistence.

Build and Run with Docker

Build the Docker image:

cd server
docker build -t whisper-server .
# or with Podman:
# podman build -t whisper-server .

Run the container, mapping volumes for persistence:

docker run -d \
  -p 8000:8000 \
  -v /path/to/logs:/app/logs:Z \
  -v /path/to/inprogress.txt:/app/inprogress.txt:Z \
  -v /mnt/data/video:/mnt/data/video:Z \
  whisper-server

Replace /mnt/data/video with the actual path to your MP3 files.
Map logs and inprogress.txt to host locations to avoid data loss when rebuilding the container. podman example:

podman run -d -p 8000:8000 --replace --restart=always --name=whsiper-server -v ./inprogress.txt:/app/inprogress.txt:Z -v ./processed.csv:/app/processed.csv:Z -v /home/shared/video:/mnt/data/video:Z -v ./logs:/app/logs:Z whisper-server

Environment Variables:
- Copy .env.example to .env and edit as needed:
```
cp server/.env.example server/.env
# Edit server/.env
```
- Set AUDIO_DIR to the directory containing your MP3 files.
Access the API:
- The server listens on port 8000 by default.
- Endpoints:
  - GET /task – Get a new audio task
  - POST /result – Submit a completed transcription
  - POST /error – Report a failed task

Authentication

Note: The server does not implement authentication itself. It is recommended to run the server behind a reverse proxy (e.g., Caddy, Nginx) that enforces BASIC AUTH or other authentication at the proxy level. Failing to do so exposes the server to accepting any gibberish files.

3. Additional Notes

ffmpeg must be installed and available in your system PATH for the client to convert audio files.
The client and server communicate using HTTP. Ensure network connectivity between them.
For GPU acceleration, build whisper.cpp with the appropriate backend (CUDA, Vulkan, OpenVINO). See the build script for options.
For troubleshooting, check the logs in the mapped logs directory on the server.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github/workflows		.github/workflows
client		client
server		server
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Batch Speech-to-Text (STT) System

How it works

1. Client Setup

Requirements

Configuration

A. Linux (Fedora recommended)

Recommended: Use Fedora via Distrobox (if not on Fedora)

Automatic Setup (Fedora)

Manual Setup

B. Mac

C. Windows

Running the Client

D. Docker / Podman (Recommended for Linux)

Quick Start:

What Gets Mounted:

Viewing Logs:

Stopping/Starting:

Backend Selection:

2. Server Setup

Build and Run with Docker

Authentication

3. Additional Notes

4. References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

dinccey/distributed_batch_stt

Folders and files

Latest commit

History

Repository files navigation

Distributed Batch Speech-to-Text (STT) System

How it works

1. Client Setup

Requirements

Configuration

A. Linux (Fedora recommended)

Recommended: Use Fedora via Distrobox (if not on Fedora)

Automatic Setup (Fedora)

Manual Setup

B. Mac

C. Windows

Running the Client

D. Docker / Podman (Recommended for Linux)

Quick Start:

What Gets Mounted:

Viewing Logs:

Stopping/Starting:

Backend Selection:

2. Server Setup

Build and Run with Docker

Authentication

3. Additional Notes

4. References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages