Skip to content

Transform PowerPoint presentations into LLM-optimized markdown while preserving instructional design narrative. Built for technical trainers.

License

Notifications You must be signed in to change notification settings

timothywarner-org/pptx-shredder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

40 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PPTX Shredder 🎯

One tool. One purpose. Rock solid.

Transform PowerPoint presentations into LLM-optimized markdown. Built for technical trainers who need dead-simple reliability.

Does one thing and does it very, very well ✨

⚑ Quick Start

Step 1: Get Running (30 seconds)

# Clone and setup
git clone https://github.com/timothywarner-org/pptx-shredder.git
cd pptx-shredder
pip install -r requirements.txt

# Drop your PPTX files in input/ folder, then:
python shred.py

# βœ… Done! Your markdown is in output/

Step 2: Use from Claude Desktop (optional)

# Global access (works from any directory)
claude mcp add pptx-shredder npx -y @timothywarner/pptx-shredder-mcp

# Now use in Claude Desktop from any project:
# "Use shred_pptx to process my presentation.pptx"

🧠 How It Works (Now with AI!)

NEW: Intelligent Extraction with DeepSeek LLM πŸ€–

flowchart TD
    A[πŸ“ Drop PPTX in input/] --> B[πŸš€ Run python shred.py]
    B --> C[πŸ” PowerPoint Object Model]
    C --> D[πŸ“ Structured Content Extract]
    D --> E[🧠 DeepSeek LLM Analysis]
    E --> F[🎯 Learning Objectives Detection]
    E --> G[πŸ“š Module Boundary Recognition]
    E --> H[🏷️ Activity Type Classification]
    E --> I[⏱️ Time Estimation]
    F --> J[🧩 Intelligent Chunking]
    G --> J
    H --> J
    I --> J
    J --> K[πŸ“‹ Rich YAML Metadata]
    K --> L[πŸ“„ High-Quality Markdown]
    L --> M[πŸ“‚ Save to output/]
    
    N[πŸ–₯️ Claude Desktop<br/>Any Directory] --> O[πŸ“¦ npx @timothywarner/<br/>pptx-shredder-mcp]
    O --> P[πŸ”Œ MCP Server]
    P --> B
    
    style A fill:#e1f5fe
    style M fill:#e8f5e8
    style N fill:#fff3e0
    style O fill:#e8f0ff
    style E fill:#ff9800
    style J fill:#f3e5f5
Loading

πŸ“ Project Structure

pptx-shredder/                 🏠 Main project directory
β”œβ”€β”€ πŸš€ shred.py               ← Entry point (run this!)
β”œβ”€β”€ πŸ“‹ requirements.txt       ← Python dependencies  
β”œβ”€β”€ πŸ“¦ package.json           ← npm package for global MCP access
β”œβ”€β”€ βš™οΈ config.yaml           ← Settings (optional)
β”œβ”€β”€ πŸ”Œ mcp_server.py         ← MCP server (Python)
β”œβ”€β”€ πŸ“„ .mcp.json             ← MCP configuration (local + global)
β”‚
β”œβ”€β”€ πŸ“‚ bin/                   🌍 Global npm package entry
β”‚   └── mcp-server.js        ← Node.js wrapper for global access
β”‚
β”œβ”€β”€ πŸ“‚ src/                   🧠 Core application logic
β”‚   β”œβ”€β”€ πŸ” extractor.py      ← Legacy PPTX extraction (regex-based)
β”‚   β”œβ”€β”€ πŸ€– intelligent_extractor.py ← NEW: AI-powered extraction
β”‚   β”œβ”€β”€ ✨ formatter.py      ← Legacy markdown formatting  
β”‚   β”œβ”€β”€ 🧠 intelligent_formatter.py ← NEW: AI-optimized formatting
β”‚   β”œβ”€β”€ πŸŽ›οΈ shred.py          ← CLI interface with Rich UI
β”‚   └── πŸ› οΈ utils.py          ← Helpers & token counting
β”‚
β”œβ”€β”€ πŸ“‚ input/                 πŸ“₯ Drop your PPTX files here
β”‚   └── πŸ“– README.md         ← Usage instructions
β”‚
β”œβ”€β”€ πŸ“‚ output/                πŸ“€ Generated markdown appears here
β”‚   └── πŸ“– README.md         ← What gets created
β”‚
β”œβ”€β”€ πŸ§ͺ tests/                 πŸ”¬ 64 comprehensive tests
β”‚   β”œβ”€β”€ test_extractor.py    ← Content extraction tests
β”‚   β”œβ”€β”€ test_formatter.py    ← Markdown generation tests
β”‚   └── test_integration.py  ← End-to-end workflow tests
β”‚
β”œβ”€β”€ 🐳 .devcontainer/        πŸ“¦ VS Code dev environment
β”œβ”€β”€ πŸ€– .github/              βš™οΈ CI/CD & automation
β”‚   β”œβ”€β”€ workflows/           ← GitHub Actions
β”‚   └── dependabot.yml      ← Dependency updates
β”‚
└── πŸ“š docs/                  πŸ“– Documentation
    └── PRD.md               ← Product requirements

🎯 What It Does (The Magic)

Single Purpose: Convert PPTX β†’ LLM-ready markdown
Rock Solid: 64 tests, 95%+ coverage, enterprise CI/CD
Dead Simple: Drop files, run command, collect results

Core Intelligence

  • 🧠 Pattern Recognition: Detects modules, labs, exercises, learning objectives
  • πŸ“š Context Preservation: Maintains instructional flow and narrative
  • πŸ€– LLM Optimization: Token-counted chunks (1500-2000) with smart overlap
  • πŸ’» Code Detection: Identifies and formats code in 15+ languages
  • πŸ“‹ Rich Metadata: YAML frontmatter with semantic context

Enterprise Training Intelligence

  • πŸŽ“ Pedagogical Awareness: Categorizes instructor notes by intent (timing, emphasis, tips, warnings)
  • πŸ“Š Difficulty Assessment: Automatic difficulty level detection (beginner/intermediate/advanced)
  • ⏱️ Time Estimation: Activity-based duration calculation with multipliers
  • πŸ” Prerequisites Detection: Extracts required knowledge from content and notes
  • πŸ“ˆ Learning Analytics: Cognitive load, interaction level, and learning mode analysis
  • πŸ›‘οΈ Compliance Ready: Detects regulatory markers (GDPR, HIPAA, SOX, ISO, NIST, PCI)
  • 🎯 Assessment Extraction: Identifies quiz questions and knowledge checks
  • πŸ–ΌοΈ Visual Context: Describes images, tables, charts, and layout semantics

πŸ“– Usage Guide

Basic Commands

# Production mode - scan input/ folder
python shred.py

# Process specific files  
python shred.py presentation.pptx course.pptx

# Preview mode (no files created)
python shred.py --dry-run

# Show help
python shred.py --help

Advanced Options

# Custom chunking strategy
python shred.py --strategy sequential --chunk-size 2000

# Verbose output with detailed logging
python shred.py --verbose

# Custom output directory
python shred.py --output-dir ./my-markdown

# Force overwrite existing files
python shred.py --force

Processing Strategies

  • instructional (default): Smart chunking that preserves learning modules
  • sequential: Simple slide-by-slide processing
  • single: One file per presentation

🎯 What It Does

Core Features

PPTX Shredder intelligently:

  • Extracts Everything: Text, speaker notes, slide structure, code blocks
  • Recognizes Patterns: Modules, labs, exercises, learning objectives
  • Optimizes for LLMs: Token-counted chunks (1500-2000 tokens) with overlap
  • Preserves Context: Instructional narrative and relationships
  • Rich Metadata: YAML frontmatter with learning context
  • Code Detection: Identifies and formats code in 15+ languages
  • Beautiful UI: Progress bars, tables, and colored output

Instructional Design Awareness

  • Detects module boundaries and learning objectives
  • Preserves lab instructions and exercise context
  • Maintains teaching flow and narrative structure
  • Groups related content intelligently

πŸ“„ Output Example

---
module_id: 01-azure-storage-fundamentals
module_title: Azure Storage Fundamentals
slide_range: [1, 8]
chunk_index: 1
total_chunks: 3
learning_objectives:
  - Configure blob storage with appropriate security settings
  - Implement lifecycle management policies for cost optimization
  - Apply compliance requirements for enterprise data governance
prerequisites:
  - Basic understanding of cloud computing concepts
  - Familiarity with Azure portal navigation
concepts: ["Azure", "Storage", "Security", "Compliance", "GDPR"]
difficulty_level: intermediate
estimated_duration: 25 minutes
learning_context:
  primary_learning_mode: experiential
  cognitive_load: medium
  interaction_level: high
activity_type: hands-on-lab
compliance_markers: ["GDPR", "SECURITY"]
instructor_guidance_categories: ["timing", "emphasis", "examples", "tips", "warnings"]
---

# Azure Storage Fundamentals

*This is part 1 of 3 in the Azure Storage Fundamentals module series.*

**πŸ”’ Compliance Notice:** This content relates to GDPR, SECURITY requirements.

## πŸ“‹ Prerequisites
Before starting this module, you should have:
- Basic understanding of cloud computing concepts
- Familiarity with Azure portal navigation

## 🎯 Learning Objectives
By the end of this module, you will be able to:
- Configure blob storage with appropriate security settings
- Implement lifecycle management policies for cost optimization
- Apply compliance requirements for enterprise data governance

## πŸ“š Content

### πŸ§ͺ Storage Account Configuration
**Objective**: Create and configure a storage account with enterprise security

#### πŸ’» Lab Code:
```powershell
# Create storage account with security features
$storageAccount = New-AzStorageAccount `
  -ResourceGroupName "rg-storage-lab" `
  -Name "stentsec$((Get-Random))" `
  -AllowBlobPublicAccess $false `
  -EnableHttpsTrafficOnly $true `
  -MinimumTlsVersion "TLS1_2"

🧠 Knowledge Check:

Q: What is the minimum TLS version required for enterprise security compliance?

πŸ‘¨β€πŸ« Instructor Guidance:

⏱️ Timing: Allow 8 minutes for storage account creation ⚠️ Emphasis: Critical to stress importance of disabling public blob access πŸ’‘ Examples: Show real-world scenario where public access led to data breach πŸ”§ Tips: Use naming conventions that include environment and purpose


## πŸ”§ Status: Production Ready

| Aspect | Status | Details |
|--------|--------|---------|
| **🎯 Core Function** | βœ… Complete | PPTX β†’ Markdown conversion working perfectly |
| **πŸ§ͺ Testing** | βœ… 64 tests, 95%+ coverage | Unit, integration, cross-platform tests |
| **πŸš€ CI/CD** | βœ… Enterprise grade | GitHub Actions, Dependabot, auto-review |
| **πŸ“Š UI** | βœ… Rich console | Progress bars, tables, colored output |
| **πŸ”’ Security** | βœ… Local only | Zero network calls, NDA-friendly |
| **🌍 Global Access** | βœ… npm package | Works from any directory via npx |
| **πŸ“ Content Quality** | βœ… Automated linting | Markdown formatting and URL validation |
| **⚑ Platform** | βœ… Cross-platform | Windows, macOS, Linux support |
| **🐳 DevOps** | βœ… Full automation | Dev containers, automated dependencies |

## 🎯 Rock Solid Philosophy

**Single Responsibility**: We do ONE thing - convert PPTX to LLM-ready markdown  
**Zero Surprises**: Predictable, reliable behavior every time  
**Maximum Clarity**: Simple workflow, clear output, obvious structure  
**Bullet Proof**: Comprehensive testing prevents regressions  
**Privacy First**: All processing local, no external dependencies

## 🎬 Try It Now

### Quick Demo
```bash
# Run the interactive demo
python demo.py

# Or try with sample presentations
cp samples/*.pptx input/
python shred.py

Real-World Example

# Process a technical training deck
python shred.py "Azure Fundamentals Course.pptx"

# Output includes:
# - Module detection and grouping
# - Lab instructions preserved
# - Code blocks properly formatted
# - Learning objectives extracted
# - Smart chunking for LLM context windows

πŸ§ͺ Development

Testing

# Run all tests with verbose output
PYTHONPATH=src python -m pytest tests/ -v

# Run with coverage report
PYTHONPATH=src python -m pytest tests/ --cov=src --cov-report=html

# Run specific test category
PYTHONPATH=src python -m pytest tests/test_extractor.py -v

# Quick test run
make test

Code Quality

# Format code
black src/ tests/

# Type checking
mypy src/

# Lint code
ruff check src/

# Run all checks
make check

Development Workflow

# Install dev dependencies
pip install -r requirements-dev.txt

# Run in watch mode
make watch

# Build and test
make all

🎯 Recent Improvements (v0.2.0)

πŸ€– Intelligent Extraction System

  • Replaced regex-based extraction with PowerPoint object model + DeepSeek LLM
  • Learning objectives detection now uses semantic understanding instead of pattern matching
  • Module boundary recognition identifies instructional structure automatically
  • Activity type classification (lecture, demo, lab, assessment, etc.)
  • Time estimation based on content complexity and activity type
  • Prerequisites extraction from both content and speaker notes

🧠 AI-Powered Analysis

  • Uses DeepSeek API for instructional design inference
  • Structured content extraction via PPTX object model
  • Intelligent chunking based on pedagogical flow
  • Rich YAML frontmatter with 20+ metadata fields

πŸ“Š Quality Improvements

  • Fixed malformed docstrings and syntax errors
  • Enhanced error handling and robust slide processing
  • Proper import resolution for modular architecture
  • Cross-platform compatibility maintained

πŸ“‹ Outstanding TODOs

Performance Optimization

  • Batch DeepSeek API calls - Currently 1 call per slide (slow for large presentations)
  • Implement caching - Cache LLM responses for similar slide patterns
  • Parallel processing - Process multiple slides concurrently
  • Fallback modes - Graceful degradation when API unavailable

Feature Enhancements

  • Multi-language support - Detect and handle non-English content
  • Custom LLM providers - Support OpenAI, Anthropic, local models
  • Export formats - Add JSON, HTML, and SCORM output options
  • Template system - Customizable markdown templates for different use cases

Enterprise Features

  • Batch directory processing - Process entire folder hierarchies
  • Git integration - Track changes across presentation versions
  • Compliance tracking - Enhanced detection of regulatory markers
  • Quality metrics - Automated assessment of content quality

Content Quality

# Check markdown formatting and URLs
./scripts/local-content-check.sh

# Markdown linting only
./scripts/local-content-check.sh markdown-only

# URL validation only
./scripts/local-content-check.sh urls-only

πŸ‘₯ Perfect For

Technical Trainers

  • Convert course materials for AI-assisted delivery
  • Create searchable knowledge bases from presentations
  • Generate practice questions and assessments

Instructional Designers

  • Repurpose existing content for new formats
  • Extract learning objectives and outcomes
  • Analyze course structure and flow

Content Teams

  • Build AI training datasets from presentations
  • Create documentation from training materials
  • Generate summaries and abstracts

Developers

  • Process technical presentations for RAG systems
  • Extract code examples and documentation
  • Build knowledge bases for AI assistants

πŸ—οΈ Simple Architecture

graph TB
    subgraph "🎯 Single Purpose Design"
        A[πŸ“ Input PPTX Files] --> B[πŸ” Extractor]
        B --> C[✨ Formatter] 
        C --> D[πŸ“„ Output Markdown]
    end
    
    subgraph "🧠 Core Components"
        B --> B1[Extract Text]
        B --> B2[Extract Notes]
        B --> B3[Detect Patterns]
        
        C --> C1[Smart Chunking]
        C --> C2[Add Metadata]
        C --> C3[Generate Files]
    end
    
    subgraph "πŸ”Œ Integrations"
        E[πŸ–₯️ Claude Desktop] --> F[MCP Server]
        F --> B
        
        G[πŸŽ›οΈ CLI Interface] --> B
    end
    
    style A fill:#e1f5fe
    style D fill:#e8f5e8
    style B fill:#fff3e0
    style C fill:#f3e5f5
Loading

πŸ”§ Configuration

Default settings in config.yaml:

extraction:
  extract_text: true
  extract_notes: true
  extract_images: false  # Coming soon
  
formatting:
  default_chunk_size: 1500
  chunk_overlap: 200
  include_metadata: true
  
output:
  overwrite_existing: false
  create_summary: true

πŸš€ Roadmap

  • Core PPTX text extraction
  • Instructional design patterns
  • LLM-optimized chunking
  • Rich console interface
  • Comprehensive testing
  • CI/CD pipeline
  • Image extraction and description
  • Table preservation
  • Multi-language support
  • Web interface
  • API endpoint

πŸ“š Documentation

🀝 Contributing

Contributions welcome! This project uses:

  • Automated PR review assignment
  • GitHub Copilot code review
  • Comprehensive test requirements
  • Pre-commit hooks for quality

See CONTRIBUTING.md for guidelines.

πŸ“„ License

MIT License - see LICENSE


🎯 The Bottom Line

PPTX Shredder does ONE thing and does it very, very well.

βœ… Zero Configuration - Works out of the box
βœ… Zero Surprises - Predictable, reliable results
βœ… Zero Network - Completely local processing
βœ… Maximum Clarity - Simple workflow, clear output

Built by technical trainers, for technical trainers. πŸŽ“

πŸ“§ Questions? πŸ› Found a bug? Open an issue

About

Transform PowerPoint presentations into LLM-optimized markdown while preserving instructional design narrative. Built for technical trainers.

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •