Skip to content

Simple implementation of a distributed Speech to Text system utilizing whisper.cpp with client/server architecture

License

Notifications You must be signed in to change notification settings

dinccey/distributed_batch_stt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed Batch Speech-to-Text (STT) System

This project provides a distributed batch audio transcription system using whisper.cpp for fast, local speech-to-text. It consists of a client (audio processor/uploader) and a server (task distributor/collector). The system is designed for Linux (Fedora recommended), but can be adapted for Mac and Windows.

How it works

The server has access to a directory holding mp3 files and exposes 3 endpoints. The first endpoint when queried by the client finds an mp3 file without an accompanying vtt file. The language to be used for transcription is defined in the accompanying json file under sql_params/language (two letter ISO code). The client then transcodes the mp3 file into wav and runs inference via whisper.cpp. The client doesn't know the original filename, only a unique ID is provided whereby the job is tracked by the server in a .txt file (simple database). When the client is done processing, or encounters an error, it needs to send a POST request to one of the other endpoints, containing the ID of the job, and the server removes it from the db. The client writes all performed jobs to a csv file with some basic data that can be used to generate statistics for the client.


1. Client Setup

The client downloads audio tasks from the server, transcribes them using whisper.cpp, and uploads the results. It is intended to run on Linux (Fedora), but can be used on Mac and Windows with some manual steps.

Requirements

  • Python 3.8+
  • ffmpeg (must be installed and available in your system PATH)
  • requests Python package (pip install requests)
  • whisper.cpp binary (see below)

Configuration

Copy .env.example to .env and edit the values for your server and credentials:

cp client/.env.example client/.env
# Edit client/.env as needed

The client supports cron scheduling. The ENV variable CRON can be set to a cron schedule that denotes a start time. THe variable PROCESSING_HOURS is a value in hours that denotes for how long the client will be transcriing for since starting.


A. Linux (Fedora recommended)

Recommended: Use Fedora via Distrobox (if not on Fedora)

If you are not running Fedora, you can use Distrobox to create a Fedora container for a clean build environment.

Automatic Setup (Fedora)

Run the provided script to install dependencies and build whisper.cpp:

cd client/whisper
chmod +x build-linux_fedora.sh
./build-linux_fedora.sh

Follow the prompts to select your backend (default: CPU). The script will build whisper.cpp and download the required model.

The resulting binary will be at:

client/whisper/whisper.cpp/build/bin/whisper-cli

Manual Setup

If you prefer, follow the whisper.cpp instructions to build manually.


B. Mac

  1. Follow instructions in more detail: whisper.cpp.
# Apple Silicon
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build --config Release

You might need to have Xcode and cmake

brew install cmake 
  1. Place the binary as client/whisper/whisper.cpp/build/bin/whisper-cli (create the folder if needed).
  2. Ensure ffmpeg is installed (e.g., via Homebrew: brew install ffmpeg).

C. Windows

  1. Download the latest whisper.cpp binary for Windows from whisper.cpp releases.
  2. Place the binary as client/whisper/whisper.cpp/build/bin/whisper-cli (create the folder if needed).
  3. Ensure ffmpeg is installed and available in your system PATH.

Running the Client

After building or downloading the binary and setting up your .env, run:

python client/client.py

The client will poll the server for tasks, process them, and upload results. If authentication is enabled, set AUTH_ENABLED=true and provide USERNAME and PASSWORD in your .env.


D. Docker / Podman (Recommended for Linux)

For a containerized client setup that works anywhere, use Docker or Podman. This provides isolation and reproducible builds across different systems.

Quick Start:

  1. Prepare your environment:

    cd client
    cp .env.example .env
    # Edit .env with your server configuration
  2. Build the Docker image:

    chmod +x build.sh run.sh
    ./build.sh
    • Select your container runtime (docker or podman)
    • Select your backend (cpu, vulkan, cuda, or openvino)
    • For Intel GPUs 13th gen or older with Vulkan, select 'yes' when prompted
  3. Run the container:

    ./run.sh
    • Select your container runtime (docker or podman)
    • Select the backend to match your build
    • The container will start and mount your .env file

What Gets Mounted:

  • processed_uploaded/ – Successfully uploaded VTT files
  • processed_not_uploaded/ – Processed but not yet uploaded files
  • not_processed_failed_report/ – Failed processing reports
  • processed.csv – Processing log (auto-created with headers if missing)
  • .env – Your environment configuration

Viewing Logs:

# Docker
docker logs -f distributed-batch-stt-client-<backend>

# Podman
podman logs -f distributed-batch-stt-client-<backend>

Stopping/Starting:

# Stop
docker stop distributed-batch-stt-client-<backend>
podman stop distributed-batch-stt-client-<backend>

# Start
docker start distributed-batch-stt-client-<backend>
podman start distributed-batch-stt-client-<backend>

# Remove
docker rm -f distributed-batch-stt-client-<backend>
podman rm -f distributed-batch-stt-client-<backend>

Backend Selection:

  • cpu – Works everywhere, uses OpenBLAS acceleration (slowest)
  • vulkan – AMD/Intel/NVIDIA GPUs, requires /dev/dri device (fastest for compatible GPUs)
  • cuda – NVIDIA GPUs only, requires nvidia-docker or proper GPU passthrough (very fast)
  • openvino – Intel CPUs/GPUs, requires /dev/dri device for GPU acceleration

See DOCKER_README.md for detailed Docker/Podman documentation.


2. Server Setup

The server is a FastAPI app that distributes audio tasks and collects results. It is designed to run in a Docker container for easy deployment and persistence.

Build and Run with Docker

  1. Build the Docker image:

    cd server
    docker build -t whisper-server .
    # or with Podman:
    # podman build -t whisper-server .
  2. Run the container, mapping volumes for persistence:

    docker run -d \
      -p 8000:8000 \
      -v /path/to/logs:/app/logs:Z \
      -v /path/to/inprogress.txt:/app/inprogress.txt:Z \
      -v /mnt/data/video:/mnt/data/video:Z \
      whisper-server
    • Replace /mnt/data/video with the actual path to your MP3 files.
    • Map logs and inprogress.txt to host locations to avoid data loss when rebuilding the container. podman example:
    podman run -d -p 8000:8000 --replace --restart=always --name=whsiper-server -v ./inprogress.txt:/app/inprogress.txt:Z -v ./processed.csv:/app/processed.csv:Z -v /home/shared/video:/mnt/data/video:Z -v ./logs:/app/logs:Z whisper-server
  3. Environment Variables:

    • Copy .env.example to .env and edit as needed:
      cp server/.env.example server/.env
      # Edit server/.env
    • Set AUDIO_DIR to the directory containing your MP3 files.
  4. Access the API:

    • The server listens on port 8000 by default.
    • Endpoints:
      • GET /task – Get a new audio task
      • POST /result – Submit a completed transcription
      • POST /error – Report a failed task

Authentication

Note: The server does not implement authentication itself. It is recommended to run the server behind a reverse proxy (e.g., Caddy, Nginx) that enforces BASIC AUTH or other authentication at the proxy level. Failing to do so exposes the server to accepting any gibberish files.


3. Additional Notes

  • ffmpeg must be installed and available in your system PATH for the client to convert audio files.
  • The client and server communicate using HTTP. Ensure network connectivity between them.
  • For GPU acceleration, build whisper.cpp with the appropriate backend (CUDA, Vulkan, OpenVINO). See the build script for options.
  • For troubleshooting, check the logs in the mapped logs directory on the server.

4. References

About

Simple implementation of a distributed Speech to Text system utilizing whisper.cpp with client/server architecture

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages