Creates access derivatives from audio and video preservation files. Uses FFMPEG-python to create access derivatives. OpenAI/Whisper is used as a speech-to-text tool; speech-to-text is stored in a WebVTT file.
The tool is meant to be used during the digitization/transfer process for legacy media.
- setup virtual environment
pip install -r requirements.txt- install ffmpeg
- run
tools/SetupWhisperModels.pyto pre-download Whisper models (optional)
Master files should be packaged together in a single directory for each 'object.' For example, a two-sided cassette should have 2 WAV files in a single folder. The tool will place all derivatives in a subdirectory for each object.
This tool is narrowly scoped to only process preservation files in the WAVE and MOV file formats. Access copies are scoped to the mp3, mp4 (h.264), and vtt file formats.
Example Structure:
```
root_folder/
├── object_1/
│ ├── audio_file_1.wav
│ ├── audio_file_2.wav
│ └── derivatives/
│ ├── audio_file_1.mp3
│ ├── audio_file_1_caption_eng.vtt
│ ├── audio_file_2.mp3
│ └── audio_file_2_caption_eng.vtt
├── object_2/
│ ├── audio_file_1.wav
│ └── derivatives/
│ ├── audio_file_1.mp3
│ └── audio_file_1_caption_eng.vtt
└── object_3/
├── audio_file_1.wav
└── derivatives/
├── audio_file_1.mp3
└── audio_file_1_caption_eng.vtt
```
Whisper and FFmpeg require a lot of processing power. Recommended Whisper models, like the large-v2, are best when using GPUs. You may run this on a personal device, but select the base or tiny whisper model and expect longer times to process content.
This tool has only been tested using NVIDA CUDA with PyTorch. If you want to use GPU acceleration for Whisper you will need to setup CUDA or DirectML with PyTorch or TensorFlow. GPU acceleration for FFMpeg is possible, but the 'convertAV.py' helper will need to be updated.
See also FFmpeg's guidance on hardware acceleration.
Specifications for outputs produced by FFMPEG may be altered in the convertAV.py tool. See AMIA's FFMPEG cookbook for suggestions and implementation examples.