One tool. One purpose. Rock solid.
Transform PowerPoint presentations into LLM-optimized markdown. Built for technical trainers who need dead-simple reliability.
Does one thing and does it very, very well β¨
# Clone and setup
git clone https://github.com/timothywarner-org/pptx-shredder.git
cd pptx-shredder
pip install -r requirements.txt
# Drop your PPTX files in input/ folder, then:
python shred.py
# β
Done! Your markdown is in output/
# Global access (works from any directory)
claude mcp add pptx-shredder npx -y @timothywarner/pptx-shredder-mcp
# Now use in Claude Desktop from any project:
# "Use shred_pptx to process my presentation.pptx"
NEW: Intelligent Extraction with DeepSeek LLM π€
flowchart TD
A[π Drop PPTX in input/] --> B[π Run python shred.py]
B --> C[π PowerPoint Object Model]
C --> D[π Structured Content Extract]
D --> E[π§ DeepSeek LLM Analysis]
E --> F[π― Learning Objectives Detection]
E --> G[π Module Boundary Recognition]
E --> H[π·οΈ Activity Type Classification]
E --> I[β±οΈ Time Estimation]
F --> J[π§© Intelligent Chunking]
G --> J
H --> J
I --> J
J --> K[π Rich YAML Metadata]
K --> L[π High-Quality Markdown]
L --> M[π Save to output/]
N[π₯οΈ Claude Desktop<br/>Any Directory] --> O[π¦ npx @timothywarner/<br/>pptx-shredder-mcp]
O --> P[π MCP Server]
P --> B
style A fill:#e1f5fe
style M fill:#e8f5e8
style N fill:#fff3e0
style O fill:#e8f0ff
style E fill:#ff9800
style J fill:#f3e5f5
pptx-shredder/ π Main project directory
βββ π shred.py β Entry point (run this!)
βββ π requirements.txt β Python dependencies
βββ π¦ package.json β npm package for global MCP access
βββ βοΈ config.yaml β Settings (optional)
βββ π mcp_server.py β MCP server (Python)
βββ π .mcp.json β MCP configuration (local + global)
β
βββ π bin/ π Global npm package entry
β βββ mcp-server.js β Node.js wrapper for global access
β
βββ π src/ π§ Core application logic
β βββ π extractor.py β Legacy PPTX extraction (regex-based)
β βββ π€ intelligent_extractor.py β NEW: AI-powered extraction
β βββ β¨ formatter.py β Legacy markdown formatting
β βββ π§ intelligent_formatter.py β NEW: AI-optimized formatting
β βββ ποΈ shred.py β CLI interface with Rich UI
β βββ π οΈ utils.py β Helpers & token counting
β
βββ π input/ π₯ Drop your PPTX files here
β βββ π README.md β Usage instructions
β
βββ π output/ π€ Generated markdown appears here
β βββ π README.md β What gets created
β
βββ π§ͺ tests/ π¬ 64 comprehensive tests
β βββ test_extractor.py β Content extraction tests
β βββ test_formatter.py β Markdown generation tests
β βββ test_integration.py β End-to-end workflow tests
β
βββ π³ .devcontainer/ π¦ VS Code dev environment
βββ π€ .github/ βοΈ CI/CD & automation
β βββ workflows/ β GitHub Actions
β βββ dependabot.yml β Dependency updates
β
βββ π docs/ π Documentation
βββ PRD.md β Product requirements
Single Purpose: Convert PPTX β LLM-ready markdown
Rock Solid: 64 tests, 95%+ coverage, enterprise CI/CD
Dead Simple: Drop files, run command, collect results
- π§ Pattern Recognition: Detects modules, labs, exercises, learning objectives
- π Context Preservation: Maintains instructional flow and narrative
- π€ LLM Optimization: Token-counted chunks (1500-2000) with smart overlap
- π» Code Detection: Identifies and formats code in 15+ languages
- π Rich Metadata: YAML frontmatter with semantic context
- π Pedagogical Awareness: Categorizes instructor notes by intent (timing, emphasis, tips, warnings)
- π Difficulty Assessment: Automatic difficulty level detection (beginner/intermediate/advanced)
- β±οΈ Time Estimation: Activity-based duration calculation with multipliers
- π Prerequisites Detection: Extracts required knowledge from content and notes
- π Learning Analytics: Cognitive load, interaction level, and learning mode analysis
- π‘οΈ Compliance Ready: Detects regulatory markers (GDPR, HIPAA, SOX, ISO, NIST, PCI)
- π― Assessment Extraction: Identifies quiz questions and knowledge checks
- πΌοΈ Visual Context: Describes images, tables, charts, and layout semantics
# Production mode - scan input/ folder
python shred.py
# Process specific files
python shred.py presentation.pptx course.pptx
# Preview mode (no files created)
python shred.py --dry-run
# Show help
python shred.py --help
# Custom chunking strategy
python shred.py --strategy sequential --chunk-size 2000
# Verbose output with detailed logging
python shred.py --verbose
# Custom output directory
python shred.py --output-dir ./my-markdown
# Force overwrite existing files
python shred.py --force
instructional
(default): Smart chunking that preserves learning modulessequential
: Simple slide-by-slide processingsingle
: One file per presentation
PPTX Shredder intelligently:
- Extracts Everything: Text, speaker notes, slide structure, code blocks
- Recognizes Patterns: Modules, labs, exercises, learning objectives
- Optimizes for LLMs: Token-counted chunks (1500-2000 tokens) with overlap
- Preserves Context: Instructional narrative and relationships
- Rich Metadata: YAML frontmatter with learning context
- Code Detection: Identifies and formats code in 15+ languages
- Beautiful UI: Progress bars, tables, and colored output
- Detects module boundaries and learning objectives
- Preserves lab instructions and exercise context
- Maintains teaching flow and narrative structure
- Groups related content intelligently
---
module_id: 01-azure-storage-fundamentals
module_title: Azure Storage Fundamentals
slide_range: [1, 8]
chunk_index: 1
total_chunks: 3
learning_objectives:
- Configure blob storage with appropriate security settings
- Implement lifecycle management policies for cost optimization
- Apply compliance requirements for enterprise data governance
prerequisites:
- Basic understanding of cloud computing concepts
- Familiarity with Azure portal navigation
concepts: ["Azure", "Storage", "Security", "Compliance", "GDPR"]
difficulty_level: intermediate
estimated_duration: 25 minutes
learning_context:
primary_learning_mode: experiential
cognitive_load: medium
interaction_level: high
activity_type: hands-on-lab
compliance_markers: ["GDPR", "SECURITY"]
instructor_guidance_categories: ["timing", "emphasis", "examples", "tips", "warnings"]
---
# Azure Storage Fundamentals
*This is part 1 of 3 in the Azure Storage Fundamentals module series.*
**π Compliance Notice:** This content relates to GDPR, SECURITY requirements.
## π Prerequisites
Before starting this module, you should have:
- Basic understanding of cloud computing concepts
- Familiarity with Azure portal navigation
## π― Learning Objectives
By the end of this module, you will be able to:
- Configure blob storage with appropriate security settings
- Implement lifecycle management policies for cost optimization
- Apply compliance requirements for enterprise data governance
## π Content
### π§ͺ Storage Account Configuration
**Objective**: Create and configure a storage account with enterprise security
#### π» Lab Code:
```powershell
# Create storage account with security features
$storageAccount = New-AzStorageAccount `
-ResourceGroupName "rg-storage-lab" `
-Name "stentsec$((Get-Random))" `
-AllowBlobPublicAccess $false `
-EnableHttpsTrafficOnly $true `
-MinimumTlsVersion "TLS1_2"
Q: What is the minimum TLS version required for enterprise security compliance?
β±οΈ Timing: Allow 8 minutes for storage account creation
## π§ Status: Production Ready
| Aspect | Status | Details |
|--------|--------|---------|
| **π― Core Function** | β
Complete | PPTX β Markdown conversion working perfectly |
| **π§ͺ Testing** | β
64 tests, 95%+ coverage | Unit, integration, cross-platform tests |
| **π CI/CD** | β
Enterprise grade | GitHub Actions, Dependabot, auto-review |
| **π UI** | β
Rich console | Progress bars, tables, colored output |
| **π Security** | β
Local only | Zero network calls, NDA-friendly |
| **π Global Access** | β
npm package | Works from any directory via npx |
| **π Content Quality** | β
Automated linting | Markdown formatting and URL validation |
| **β‘ Platform** | β
Cross-platform | Windows, macOS, Linux support |
| **π³ DevOps** | β
Full automation | Dev containers, automated dependencies |
## π― Rock Solid Philosophy
**Single Responsibility**: We do ONE thing - convert PPTX to LLM-ready markdown
**Zero Surprises**: Predictable, reliable behavior every time
**Maximum Clarity**: Simple workflow, clear output, obvious structure
**Bullet Proof**: Comprehensive testing prevents regressions
**Privacy First**: All processing local, no external dependencies
## π¬ Try It Now
### Quick Demo
```bash
# Run the interactive demo
python demo.py
# Or try with sample presentations
cp samples/*.pptx input/
python shred.py
# Process a technical training deck
python shred.py "Azure Fundamentals Course.pptx"
# Output includes:
# - Module detection and grouping
# - Lab instructions preserved
# - Code blocks properly formatted
# - Learning objectives extracted
# - Smart chunking for LLM context windows
# Run all tests with verbose output
PYTHONPATH=src python -m pytest tests/ -v
# Run with coverage report
PYTHONPATH=src python -m pytest tests/ --cov=src --cov-report=html
# Run specific test category
PYTHONPATH=src python -m pytest tests/test_extractor.py -v
# Quick test run
make test
# Format code
black src/ tests/
# Type checking
mypy src/
# Lint code
ruff check src/
# Run all checks
make check
# Install dev dependencies
pip install -r requirements-dev.txt
# Run in watch mode
make watch
# Build and test
make all
- Replaced regex-based extraction with PowerPoint object model + DeepSeek LLM
- Learning objectives detection now uses semantic understanding instead of pattern matching
- Module boundary recognition identifies instructional structure automatically
- Activity type classification (lecture, demo, lab, assessment, etc.)
- Time estimation based on content complexity and activity type
- Prerequisites extraction from both content and speaker notes
- Uses DeepSeek API for instructional design inference
- Structured content extraction via PPTX object model
- Intelligent chunking based on pedagogical flow
- Rich YAML frontmatter with 20+ metadata fields
- Fixed malformed docstrings and syntax errors
- Enhanced error handling and robust slide processing
- Proper import resolution for modular architecture
- Cross-platform compatibility maintained
- Batch DeepSeek API calls - Currently 1 call per slide (slow for large presentations)
- Implement caching - Cache LLM responses for similar slide patterns
- Parallel processing - Process multiple slides concurrently
- Fallback modes - Graceful degradation when API unavailable
- Multi-language support - Detect and handle non-English content
- Custom LLM providers - Support OpenAI, Anthropic, local models
- Export formats - Add JSON, HTML, and SCORM output options
- Template system - Customizable markdown templates for different use cases
- Batch directory processing - Process entire folder hierarchies
- Git integration - Track changes across presentation versions
- Compliance tracking - Enhanced detection of regulatory markers
- Quality metrics - Automated assessment of content quality
# Check markdown formatting and URLs
./scripts/local-content-check.sh
# Markdown linting only
./scripts/local-content-check.sh markdown-only
# URL validation only
./scripts/local-content-check.sh urls-only
- Convert course materials for AI-assisted delivery
- Create searchable knowledge bases from presentations
- Generate practice questions and assessments
- Repurpose existing content for new formats
- Extract learning objectives and outcomes
- Analyze course structure and flow
- Build AI training datasets from presentations
- Create documentation from training materials
- Generate summaries and abstracts
- Process technical presentations for RAG systems
- Extract code examples and documentation
- Build knowledge bases for AI assistants
graph TB
subgraph "π― Single Purpose Design"
A[π Input PPTX Files] --> B[π Extractor]
B --> C[β¨ Formatter]
C --> D[π Output Markdown]
end
subgraph "π§ Core Components"
B --> B1[Extract Text]
B --> B2[Extract Notes]
B --> B3[Detect Patterns]
C --> C1[Smart Chunking]
C --> C2[Add Metadata]
C --> C3[Generate Files]
end
subgraph "π Integrations"
E[π₯οΈ Claude Desktop] --> F[MCP Server]
F --> B
G[ποΈ CLI Interface] --> B
end
style A fill:#e1f5fe
style D fill:#e8f5e8
style B fill:#fff3e0
style C fill:#f3e5f5
Default settings in config.yaml
:
extraction:
extract_text: true
extract_notes: true
extract_images: false # Coming soon
formatting:
default_chunk_size: 1500
chunk_overlap: 200
include_metadata: true
output:
overwrite_existing: false
create_summary: true
- Core PPTX text extraction
- Instructional design patterns
- LLM-optimized chunking
- Rich console interface
- Comprehensive testing
- CI/CD pipeline
- Image extraction and description
- Table preservation
- Multi-language support
- Web interface
- API endpoint
- CLAUDE.md - AI assistant context
- docs/PRD.md - Product requirements
- GitHub Wiki - Extended docs
Contributions welcome! This project uses:
- Automated PR review assignment
- GitHub Copilot code review
- Comprehensive test requirements
- Pre-commit hooks for quality
See CONTRIBUTING.md for guidelines.
MIT License - see LICENSE
PPTX Shredder does ONE thing and does it very, very well.
β
Zero Configuration - Works out of the box
β
Zero Surprises - Predictable, reliable results
β
Zero Network - Completely local processing
β
Maximum Clarity - Simple workflow, clear output
Built by technical trainers, for technical trainers. π
π§ Questions? π Found a bug? Open an issue