Featuring StemProver - The Stem Improver Status Update This repository contains the ongoing development of StemProver. While the core architecture and goals are defined here, much of the training code and latest model versions are still under active development locally. This documentation reflects the current architectural direction.
Overview StemProver is a novel system for reducing artifacts in audio stems created by source separation tools like Spleeter or Demucs. When these tools isolate an instrument or vocal track, they often leave behind unwanted sounds, such as muffled bleed from other instruments, robotic phasing, or high-frequency distortion.
StemProver cleans these stems using a state-of-the-art, two-stage deep learning pipeline. It reframes the audio artifact problem as a pair of specialized tasks: first as an image restoration problem, and second as a waveform synthesis problem. This approach allows us to leverage powerful, pre-trained models from both the image and audio domains to achieve high-fidelity results.
→ The Two-Stage Architecture Instead of attempting to manipulate complex audio data (both magnitude and phase) in a single, monolithic model, StemProver separates the problem into two distinct stages, allowing each model to excel at its specialized task.
Stage 1: Spectrogram Restoration (img2img) This stage focuses solely on cleaning the magnitude spectrogram, which is a visual representation of the audio's frequency content. The artifact-laden stem spectrogram is treated like a blurry or damaged photograph.
Input: A magnitude spectrogram generated from a stem that contains separation artifacts.
Model: A ControlNet-LoRA architecture built upon a large-scale, pre-trained image diffusion model (e.g., Stable Diffusion).
Process: The model performs an image-to-image translation. It uses the spectrogram with artifacts as a structural guide (the control) to "in-paint" and repair the missing frequency data and remove the visual signatures of common artifacts.
Output: A clean, high-fidelity magnitude spectrogram that looks as if it came from a perfectly isolated source.
Stage 2: Waveform Synthesis This stage takes the clean visual "blueprint" from Stage 1 and converts it back into pristine audio. It solves the critical problem of phase by generating a new, perfectly coherent phase for the restored magnitude data.
Input: The restored magnitude spectrogram from Stage 1.
Model: A GAN-based neural vocoder, such as HiFi-GAN, which is specifically designed for this task.
Process: The vocoder has been trained on thousands of hours of clean audio and inherently understands the complex phase relationships that make audio sound natural. It takes the frequency information from the spectrogram and synthesizes a complete, realistic audio waveform with a statistically probable phase.
Output: A high-quality, artifact-free WAV audio file.
Technical Advantages Separation of Concerns: This pipeline is robust because each model is a specialist. The diffusion model handles the 2D spatial problem of image restoration, while the vocoder handles the 1D temporal problem of waveform generation.
Leverages Pre-trained Power: This approach effectively harnesses the billions of dollars and years of research invested in large-scale image models for the difficult task of artifact removal.
Solves the Phase Problem Elegantly: The most significant challenge in audio synthesis is phase. Instead of the notoriously difficult and often unstable task of preserving or "unwrapping" the original phase, we generate a new, perceptually perfect phase using a model designed for that exact purpose.
Modular and Upgradable: Each stage is independent. A better vocoder or a new ControlNet architecture can be swapped into the pipeline at any time without redesigning the entire system.
Implementation and Training Dataset Generation: Training requires a large dataset of paired audio stems: ("clean_stem.wav", "artifact_stem.wav"). The project includes powerful tools for generating synthetic audio datasets with precise control over waveforms, allowing for the creation of perfect ground-truth pairs. For real-world training, artifacted stems are generated by running clean source material through various open-source separation models.
Stage 1 Training: The ControlNet-LoRA is trained on pairs of spectrogram images generated from the audio dataset. The model learns to map the "artifacted" spectrogram to the "clean" one.
Stage 2 Training: We can leverage a publicly available, pre-trained HiFi-GAN model. For optimal performance, this vocoder can be fine-tuned on the "clean" half of the audio dataset to specialize it for the target sound sources (vocals, drums, etc.).
Tech Stack: PyTorch for modeling, the Diffusers library for the ControlNet stage, and a pre-existing HiFi-GAN implementation for the synthesis stage. Librosa handles all audio pre-processing and spectrogram generation.