Word to Outline (WTO)

Convert Word documents to Outline knowledge base documents with full formatting preservation, image extraction, and metadata handling.

🎯 Project Status: Phase 2 Complete! ✅

Word to Outline now provides complete end-to-end conversion from Word documents to Outline with interactive upload workflow.

What's Working Now:

✅ Word Document Extraction: Full content extraction using python-docx and mammoth
✅ Format Preservation: Bold, italic, headers, lists, tables
✅ Image Support: Complete image extraction and upload to Outline as attachments
✅ Original Document Preservation: Source Word document automatically attached for reference
✅ Metadata Extraction: Title, author, word count, creation dates
✅ Multiple Output Formats: HTML, Markdown, and structured JSON
✅ Batch Processing: Process entire directories of Word documents
✅ CLI Interface: Simple command-line operations
✅ Interactive Upload: Collection selection and conflict resolution with detailed comparison
✅ API Integration: Complete Outline API integration with proven CTO architecture
✅ Attachment Handling: Images uploaded as proper Outline attachments
✅ Enhanced Conflict Resolution: 4-option workflow (overwrite/details/skip/cancel)
✅ Force Mode: Overwrite existing documents when needed
✅ Batch Upload: Upload multiple documents efficiently
✅ Reset Functionality: Clean slate command to remove extracted files
✅ Smart Defaults: Extract defaults to ./input, upload defaults to interactive mode
✅ Environment Configuration: Robust .env file support for API credentials

🚀 Quick Start

Installation

Clone and setup:

cd WordToOutline
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Configure API access (required for upload):

cp .env.example .env
# Edit .env with your Outline API credentials

Basic Usage

Extract Word documents (defaults to ./input directory):

# Extract from default input directory
python main.py --extract

# Or specify a specific file/directory
python main.py --extract path/to/document.docx
python main.py --extract path/to/documents/

List extracted documents:
```
python main.py --list-extracted
```

Upload to Outline (defaults to interactive mode):

# Interactive upload (default - select document and collection)
python main.py --upload

# Upload specific document by ID
python main.py --upload document_id

# Batch upload all documents
python main.py --upload batch

Reset to clean state:

# Remove all extracted files and start fresh
python main.py --reset

📁 How It Works

Input → Processing → Upload

Word Documents          JSON Holding Files        Outline Documents
    (.docx)         →      (extracted/)         →      (Live in Outline)
      ↓                        ↓                         ↓
[document.docx]    →    [uuid.json]           →    [Outline Document]
      ↓                        ↓                         ↓
   Content              • Document metadata             • Full document created
   Images          →    • HTML content           →      • Collection assignment
   Metadata             • Markdown content              • Image attachments
   Formatting           • Extracted images              • Original document attached
                                                        • Format preservation

Extraction Process

Document Analysis: Uses python-docx to extract metadata and structure
Content Conversion: Uses mammoth for superior HTML conversion
Format Processing: Converts HTML to clean Markdown
Image Extraction: Extracts and saves embedded images
JSON Storage: Creates structured holding files for future upload
Upload Workflow: Creates stub documents, uploads all attachments (images + original), then updates with final content

📂 Project Structure

WordToOutline/
├── input/                  # Place Word documents here
├── extracted/              # Generated JSON holding files
├── images/                 # Extracted images (organized by document ID)
├── libs/
│   ├── word_extractor.py   # Core Word document processing
│   ├── config.py           # Configuration management
│   ├── logger.py           # Logging and progress tracking
│   └── __pycache__/
├── main.py                 # CLI interface
├── requirements.txt        # Python dependencies
├── PLAN.md                 # Detailed implementation plan
└── README.md              # This file

🔧 Configuration

Configuration

API Configuration (Required for Upload)

Get Outline API credentials:
- Log into your Outline instance
- Go to Settings > API Tokens
- Create a new token with appropriate permissions

Configure using .env file (recommended):

cp .env.example .env
# Edit .env with your credentials:
# OUTLINE_API_TOKEN=your_outline_api_token
# OUTLINE_API_URL=https://your-outline-instance.com

Or set environment variables directly:

export OUTLINE_API_TOKEN="your_outline_api_token"
export OUTLINE_API_URL="https://your-outline-instance.com"

Command Line Options

# Main operations (choose one)
--extract [PATH]            # Extract from file/directory (default: ./input)
--list-extracted            # List all extracted documents  
--upload [DOCUMENT_ID]      # Upload to Outline (default: interactive)
--reset                     # Reset local directories to clean state

# Extraction options
--no-images                 # Skip image extraction
--no-formatting             # Skip formatting preservation

# Directory configuration
--input-dir DIR             # Input directory (default: input)
--extracted-dir DIR         # Output directory (default: extracted)
--images-dir DIR            # Images directory (default: images)

# API configuration (alternative to .env file)
--api-key TOKEN             # Outline API token
--api-url URL               # Outline API URL

# Logging
--log-level LEVEL           # DEBUG, INFO, WARNING, ERROR (default: INFO)
--log-file FILE             # Log to file instead of console

🚀 Performance Tips & Rate Limiting

For Large Document Uploads:

The system includes automatic rate limiting protection with exponential backoff
Standard 1-second delays between image uploads help prevent rate limiting
Batch delays of 3 seconds every 10 uploads provide additional protection

Optional Performance Optimization: If you have a private Outline instance and want faster uploads, you can temporarily disable rate limiting by modifying the delay values in libs/api_upload_manager.py:

Set regular delay to 0.1 seconds for faster uploads
Set batch delay to 1.0 seconds for minimal throttling
⚠️ Warning: Only recommended for private instances to avoid overwhelming public servers

Upload Strategies:

Interactive Mode: Best for selective document uploads with control
Batch Mode: Efficient for uploading many documents at once
Individual Uploads: Use specific document IDs for targeted uploads

⚠️ Known Issues

1. Rate Limiting During Large Uploads

Issue: Large documents with many images may experience significant delays due to API rate limiting.

Impact: Upload times can be extended with exponential backoff delays (5s, 10s, 20s, 40s between retries).

Administrator Solution:

Temporarily disable rate limiting on your Outline instance during bulk upload sessions
This can reduce upload times from minutes to seconds for image-heavy documents
Remember to re-enable rate limiting after bulk operations complete

User Workaround:

Upload documents during low-traffic periods
Consider breaking very large documents into smaller sections

2. Modified Images in Word Documents

Issue: Images that have been edited or modified within Microsoft Word (such as adding highlights, annotations, or effects) may not display exactly as they appeared in Word.

Behavior:

The original unmodified image will be extracted and uploaded
Word's modifications (highlights, annotations, effects) will appear as separate overlay images
This results in multiple image attachments in Outline instead of a single modified image

Workaround:

For critical visual fidelity, edit images in external image editing software before inserting into Word
Alternatively, take screenshots of modified images in Word and replace them manually in Outline

Interactive Features

Enhanced Upload Workflow:

📋 Collection Selection: Browse and select target collections
🔍 Document Comparison: Detailed metadata comparison for conflicts
⚡ Conflict Resolution: 4 options - overwrite, view details, skip, or cancel
📦 Batch Processing: Upload multiple documents with progress tracking
🛡️ Safe Operations: Confirmation prompts for destructive actions

📝 Extracted Content Format

Each extracted document creates a JSON file with:

{
  "document_id": "uuid-string",
  "metadata": {
    "filename": "document.docx",
    "title": "Document Title", 
    "author": "Author Name",
    "word_count": 150,
    "paragraph_count": 12,
    "created": "2024-01-01T12:00:00",
    "modified": "2024-01-02T12:00:00"
  },
  "content_html": "<h1>Title</h1><p>Content...</p>",
  "content_markdown": "# Title\n\nContent...",
  "images": [
    {
      "image_id": "uuid",
      "original_filename": "image.png", 
      "extracted_filename": "uuid.png",
      "file_path": "images/doc-uuid/uuid.png",
      "content_type": "image/png",
      "size_bytes": 12345,
      "width": 800,
      "height": 600
    }
  ],
  "extraction_timestamp": "2024-01-01T12:00:00"
}

🔮 Recent Updates & Bug Fixes

Latest Enhancements ✨

Original Document Attachment: Source Word documents now automatically attached to Outline pages
Complete Document Preservation: Users get converted content AND access to original source
Smart Defaults: --extract now defaults to ./input directory
Interactive Default: --upload now defaults to interactive mode
Environment Configuration: Fixed .env file parsing for seamless API setup
Enhanced Documentation: Complete feature coverage and usage examples
Improved User Experience: Streamlined workflows with sensible defaults

Real-World Testing Fixes 🐛

Image Upload Fix: Resolved 'None' filename errors causing 400 Bad Request failures
Rate Limiting: Added retry logic with exponential backoff for API rate limits
Interactive Batch Upload: New workflow for processing documents one-at-a-time with c/s/e options
Title Override: Added document title customization step in interactive upload
Clean Collection Selection: Simplified display showing only collection titles with NEW option

Phase 3: Advanced Features (Future)

Word Template Processing: Handle complex document templates
Collaboration Features: Multi-user document processing
Advanced Formatting: Enhanced support for complex layouts
Integration Options: Web interface, desktop app
Workflow Automation: Watch folders, scheduled processing

🏗️ Built With

python-docx: Word document parsing and metadata extraction
mammoth: Superior HTML conversion from Word documents
Pillow: Image processing and optimization
requests: HTTP client for future API integration

🤝 Development

Architecture Notes

This project leverages the proven architecture from the Confluence to Outline (CTO) project, reusing:

✅ Configuration management system
✅ Logging and progress tracking
✅ Error handling patterns
✅ Future API integration components (60-70% code reuse)

Testing

# Create test document (if needed)
python create_test_doc.py

# Extract from default input directory
python main.py --extract

# View extracted documents
python main.py --list-extracted

# Test interactive upload (requires API setup)
python main.py --upload

# Reset and start fresh
python main.py --reset

Common Usage Patterns

# Quick workflow: extract and upload
python main.py --extract                    # Extract all from ./input
python main.py --upload                     # Interactive upload

# Batch processing workflow  
python main.py --extract path/to/documents/ # Extract directory
python main.py --upload batch               # Upload all documents

# Clean slate workflow
python main.py --reset                      # Start fresh
python main.py --extract                    # Extract again

📄 License

[Your License Here]

🙋‍♂️ Support

For questions about:

Phase 1 (Current): Word document extraction and processing
Phase 2 (Planned): Outline API integration and upload functionality

Word to Outline - Making knowledge transfer from Word documents to Outline simple and reliable.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
extracted		extracted
images		images
input		input
libs		libs
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
create_test_doc.py		create_test_doc.py
debug_analysis.py		debug_analysis.py
main.py		main.py
requirements.txt		requirements.txt
todo.md		todo.md

rioncm/WordToOutline

Folders and files

Latest commit

History

Repository files navigation