All In Vault - Podcast Analysis Platform

All In Vault is a comprehensive platform for downloading, transcribing, and analyzing the All In podcast episodes. The system retrieves metadata from YouTube, downloads audio files, performs transcription using Deepgram's Nova-3 model, and provides tools for analyzing the content.

Core Features

YouTube Integration: Retrieves podcast episodes and metadata from YouTube
Audio Processing: Downloads high-quality audio for transcription
Advanced Transcription: Uses Deepgram's Nova-3 model with speaker diarization
Metadata Management: Stores and organizes all podcast metadata
Pipeline Automation: Complete workflow from retrieval to transcription
Episode Analysis: Distinguishes between full episodes and shorts
Speaker Identification: Uses heuristics and optional LLM integration to identify speakers

Flexible Pipeline Architecture

AllInVault features a flexible, stage-based pipeline architecture that provides granular control over the podcast processing workflow:

Modular Stages: Each processing step is encapsulated in a separate stage
Flexible Execution: Run the entire pipeline or specific stages
Episode Targeting: Process all episodes or target specific episodes by ID
Stage Dependencies: Automatic handling of stage dependencies
Configurable Parameters: Each stage accepts specific configuration options
Consistent Interface: Unified command-line interface for all operations

The pipeline consists of these sequential stages:

Fetch Metadata: Retrieve episode information from YouTube API
Analyze Episodes: Identify full episodes vs shorts based on duration
Download Audio: Download audio files for episodes
Transcribe Audio: Generate transcriptions using Deepgram
Identify Speakers: Map speakers in transcripts to actual names

See architecture.md for detailed documentation of the system design.

Installation

Clone this repository:

git clone https://github.com/yourusername/allinvault.git
cd allinvault

Install dependencies:
```
pip install -r requirements.txt
```

Configure environment variables:

cp .env.example .env
# Edit .env with your API keys

API Keys Required

YouTube API Key: Required for fetching episode metadata
Deepgram API Key: Required for audio transcription
OpenAI API Key: Optional, used for LLM-based speaker identification
DeepSeek API Key: Optional alternative for LLM-based speaker identification

Usage

Unified Command Interface

AllInVault provides a single, unified command-line interface for all operations through the pipeline.py script:

python pipeline.py [command] [options]

Available commands:

pipeline (default): Execute the pipeline or specific stages
display: Display a transcript
verify: Verify transcript metadata and display statistics

Pipeline Command

The pipeline command (default) processes podcast episodes through the flexible stage-based architecture:

python pipeline.py [pipeline] [options]

Basic Pipeline Operations

# Process the latest 5 episodes through the complete pipeline
python pipeline.py

# Explicitly use the pipeline command (same as above)
python pipeline.py pipeline

# Process a specific number of episodes
python pipeline.py pipeline --num-episodes 10

# Process specific episodes by video ID
python pipeline.py pipeline --episodes "Wr12BFko-Xo,8UzQ5uf_vik"

Stage Selection Options

# Run only specific stages (comma-separated)
python pipeline.py pipeline --stages fetch_metadata,download_audio

# Run from a specific stage to the end
python pipeline.py pipeline --start-stage download_audio

# Run a range of stages
python pipeline.py pipeline --start-stage download_audio --end-stage transcribe_audio

# Skip automatic dependency resolution
python pipeline.py pipeline --stages transcribe_audio --skip-dependencies

Available stages:

fetch_metadata: Retrieve episode information from YouTube API
analyze_episodes: Identify full episodes vs shorts based on duration
download_audio: Download audio files for episodes
transcribe_audio: Generate transcriptions using Deepgram
identify_speakers: Map speakers in transcripts to actual names

Fetch Metadata Options

# Limit the number of episodes to fetch
python pipeline.py pipeline --stages fetch_metadata --limit 20

Flags:

--limit: Maximum number of episodes to fetch metadata for

Episode Analysis Options

# Set custom duration threshold for full episodes
python pipeline.py pipeline --stages analyze_episodes --min-duration 300

Flags:

--min-duration: Minimum duration in seconds for an episode to be considered full (default: 180)

Download Audio Options

# Download in different format and quality
python pipeline.py pipeline --stages download_audio --audio-format m4a --audio-quality 256

# Download all episodes, not just full episodes
python pipeline.py pipeline --stages download_audio --all-episodes

# Specify custom audio directory
python pipeline.py pipeline --stages download_audio --audio-dir "/path/to/audio/files"

Flags:

--audio-format: Audio format to download (e.g., mp3, m4a)
--audio-quality: Audio quality in kbps (e.g., 192, 256)
--all-episodes: Include all episodes for audio download, not just full episodes
--audio-dir: Directory for storing downloaded audio files

Transcription Options

# Customize transcription settings
python pipeline.py pipeline --stages transcribe_audio --model nova-3 --no-diarize --detect-language

# Specify custom directories
python pipeline.py pipeline --stages transcribe_audio --audio-dir "/path/to/audio" --transcripts-dir "/path/to/transcripts"

Flags:

--model: Deepgram model to use for transcription (default: nova-3)
--no-diarize: Disable speaker diarization during transcription
--no-smart-format: Disable smart formatting in transcripts
--detect-language: Enable language detection during transcription
--audio-dir: Directory containing audio files to transcribe
--transcripts-dir: Directory for storing transcriptions

Speaker Identification Options

# Configure speaker identification
python pipeline.py pipeline --stages identify_speakers --llm-provider openai --force-reidentify

# Disable LLM for speaker identification, use heuristics only
python pipeline.py pipeline --stages identify_speakers --no-llm

Flags:

--no-llm: Disable LLM for speaker identification and use heuristics only
--llm-provider: LLM provider to use for speaker identification (options: openai, deepseq)
--force-reidentify: Force re-identification of speakers even if already identified
--transcripts-dir: Directory containing transcripts to process

Display Command

The display command shows transcript content with various formatting options:

python pipeline.py display [options]

Basic Display Operations

# Display transcript in text format
python pipeline.py display --episode VIDEO_ID

# Display transcript in JSON format
python pipeline.py display --episode VIDEO_ID --format json

# Hide speaker information
python pipeline.py display --episode VIDEO_ID --no-speakers

# Show timestamps
python pipeline.py display --episode VIDEO_ID --show-timestamps

Flags:

--episode: Video ID of the episode to display (required)
--format: Display format (options: text, json)
--no-speakers: Hide speaker information
--show-timestamps: Show timestamps

Verify Command

The verify command checks transcript metadata and displays statistics:

python pipeline.py verify [options]

Basic Verification Operations

# Verify all transcripts and display statistics
python pipeline.py verify

# Show only high-level statistics
python pipeline.py verify --stats-only

# Verify without updating metadata files
python pipeline.py verify --no-update

Flags:

--stats-only: Only display statistics without details of missing items
--no-update: Don't update episode metadata files when inconsistencies are found

Complete Command Reference

For a complete list of all available options:

# Show main help
python pipeline.py --help

# Show help for a specific command
python pipeline.py pipeline --help
python pipeline.py display --help
python pipeline.py verify --help

Services and Modules

Core Services

YouTube Service (src/services/youtube_service.py)
- Fetches podcast episode metadata from YouTube API
- Handles API pagination for retrieving multiple episodes
- Converts YouTube API responses to internal data models
Downloader Service (src/services/downloader_service.py)
- Downloads audio files using yt-dlp
- Manages audio format and quality settings
- Handles file naming and storage conventions
Episode Analyzer Service (src/services/episode_analyzer.py)
- Categorizes episodes as full episodes or shorts based on duration
- Parses ISO 8601 durations from YouTube metadata
- Updates episode metadata with duration information
Transcription Service (src/services/transcription_service.py)
- Handles audio transcription using Deepgram API
- Manages speaker diarization and transcript formatting
- Processes transcripts and updates episode metadata
Batch Transcriber Service (src/services/batch_transcriber.py)
- Coordinates transcription of multiple episodes
- Handles parallel processing of transcription tasks
- Manages transcript storage and organization
Podcast Pipeline Service (src/services/podcast_pipeline.py)
- Orchestrates the entire workflow from download to transcription
- Coordinates all other services in sequence
- Provides unified interface for the complete process
Speaker Identification Service (src/services/speaker_identification_service.py)
- Identifies speakers in transcripts using LLM integration
- Maps anonymous speaker tags to actual speaker names
- Optionally works with multiple LLM providers for improved accuracy
LLM Service (src/services/llm_service.py)
- Provides LLM integration for enhanced speaker identification
- Supports multiple LLM providers (OpenAI, DeepSeek)
- Extracts speaker information from transcript context

Data Layer

Episode Repository (src/repositories/episode_repository.py)
- Manages storage and retrieval of episode metadata
- Provides CRUD operations for episode data
- Ensures data consistency across pipeline stages
Podcast Episode Model (src/models/podcast_episode.py)
- Data model representing a podcast episode
- Contains fields for both YouTube metadata and transcript information
- Implements serialization/deserialization logic

Utilities

Configuration Utilities (src/utils/config.py)
- Loads environment variables and configuration settings
- Provides application-wide configuration access
- Manages default values and path resolution

Command-Line Interfaces

Each CLI module provides a user-friendly interface to interact with the corresponding service:

Process Podcast CLI (src/cli/process_podcast_cmd.py)
- Runs the complete pipeline from metadata fetch to transcription
- Command: python process_podcast.py
Download Podcast CLI (src/cli/download_podcast_cmd.py)
- Handles downloading episode metadata and audio files
- Command: python download_podcast.py
Analyze Episodes CLI (src/cli/analyze_episodes_cmd.py)
- Analyzes episodes to identify full episodes vs shorts
- Command: python analyze_episodes.py
Transcribe Audio CLI (src/cli/transcribe_audio_cmd.py)
- Manages audio transcription with various options
- Command: python transcribe_audio.py
Transcribe Full Episodes CLI (src/cli/transcribe_full_episodes_cmd.py)
- Batch transcribes all full episodes
- Command: python transcribe_full_episodes.py
Display Transcript CLI (src/cli/display_transcript_cmd.py)
- Displays transcripts in various formats
- Command: python display_transcript.py
Verify Transcripts CLI (src/cli/verify_transcripts.py)
- Verifies transcript metadata and integrity
- Command: python verify_transcripts.py
Identify Speakers CLI (src/cli/identify_speakers_cmd.py)
- Identifies speakers in transcripts
- Command: python identify_speakers.py

Project Structure

allinvault/
├── data/                      # Data storage
│   ├── audio/                 # Downloaded audio files
│   ├── json/                  # Metadata storage
│   └── transcripts/           # Transcript storage
├── src/                       # Source code
│   ├── cli/                   # Command-line interfaces
│   ├── models/                # Data models
│   ├── repositories/          # Data access layer
│   ├── services/              # Business logic
│   └── utils/                 # Utilities
├── *.py                       # Entry point scripts
├── .env                       # Environment variables (create from .env.example)
├── .env.example               # Example environment variables
├── requirements.txt           # Python dependencies
└── architecture.md            # Detailed architecture documentation

Architecture

The project follows a modular, service-oriented architecture adhering to SOLID principles. See architecture.md for detailed documentation.

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
PRD.md		PRD.md
README.md		README.md
architecture.md		architecture.md
pipeline.py		pipeline.py
requirements.txt		requirements.txt

Uh oh!

License

Uh oh!

sanchitmonga22/AllInVault

Folders and files

Latest commit

History

Repository files navigation

All In Vault - Podcast Analysis Platform

Core Features

Flexible Pipeline Architecture

Installation

API Keys Required

Usage

Unified Command Interface

Pipeline Command

Basic Pipeline Operations

Stage Selection Options

Fetch Metadata Options

Episode Analysis Options

Download Audio Options

Transcription Options

Speaker Identification Options

Display Command

Basic Display Operations

Verify Command

Basic Verification Operations

Complete Command Reference

Services and Modules

Core Services

Data Layer

Utilities

Command-Line Interfaces

Project Structure

Architecture

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages