Skip to content

rioncm/WordToOutline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Word to Outline (WTO)

Convert Word documents to Outline knowledge base documents with full formatting preservation, image extraction, and metadata handling.

๐ŸŽฏ Project Status: Phase 2 Complete! โœ…

Word to Outline now provides complete end-to-end conversion from Word documents to Outline with interactive upload workflow.

What's Working Now:

  • โœ… Word Document Extraction: Full content extraction using python-docx and mammoth
  • โœ… Format Preservation: Bold, italic, headers, lists, tables
  • โœ… Image Support: Complete image extraction and upload to Outline as attachments
  • โœ… Original Document Preservation: Source Word document automatically attached for reference
  • โœ… Metadata Extraction: Title, author, word count, creation dates
  • โœ… Multiple Output Formats: HTML, Markdown, and structured JSON
  • โœ… Batch Processing: Process entire directories of Word documents
  • โœ… CLI Interface: Simple command-line operations
  • โœ… Interactive Upload: Collection selection and conflict resolution with detailed comparison
  • โœ… API Integration: Complete Outline API integration with proven CTO architecture
  • โœ… Attachment Handling: Images uploaded as proper Outline attachments
  • โœ… Enhanced Conflict Resolution: 4-option workflow (overwrite/details/skip/cancel)
  • โœ… Force Mode: Overwrite existing documents when needed
  • โœ… Batch Upload: Upload multiple documents efficiently
  • โœ… Reset Functionality: Clean slate command to remove extracted files
  • โœ… Smart Defaults: Extract defaults to ./input, upload defaults to interactive mode
  • โœ… Environment Configuration: Robust .env file support for API credentials

๐Ÿš€ Quick Start

Installation

  1. Clone and setup:

    cd WordToOutline
    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    pip install -r requirements.txt
  2. Configure API access (required for upload):

    cp .env.example .env
    # Edit .env with your Outline API credentials

Basic Usage

  1. Extract Word documents (defaults to ./input directory):

    # Extract from default input directory
    python main.py --extract
    
    # Or specify a specific file/directory
    python main.py --extract path/to/document.docx
    python main.py --extract path/to/documents/
  2. List extracted documents:

    python main.py --list-extracted
  3. Upload to Outline (defaults to interactive mode):

    # Interactive upload (default - select document and collection)
    python main.py --upload
    
    # Upload specific document by ID
    python main.py --upload document_id
    
    # Batch upload all documents
    python main.py --upload batch
  4. Reset to clean state:

    # Remove all extracted files and start fresh
    python main.py --reset

๐Ÿ“ How It Works

Input โ†’ Processing โ†’ Upload

Word Documents          JSON Holding Files        Outline Documents
    (.docx)         โ†’      (extracted/)         โ†’      (Live in Outline)
      โ†“                        โ†“                         โ†“
[document.docx]    โ†’    [uuid.json]           โ†’    [Outline Document]
      โ†“                        โ†“                         โ†“
   Content              โ€ข Document metadata             โ€ข Full document created
   Images          โ†’    โ€ข HTML content           โ†’      โ€ข Collection assignment
   Metadata             โ€ข Markdown content              โ€ข Image attachments
   Formatting           โ€ข Extracted images              โ€ข Original document attached
                                                        โ€ข Format preservation

Extraction Process

  1. Document Analysis: Uses python-docx to extract metadata and structure
  2. Content Conversion: Uses mammoth for superior HTML conversion
  3. Format Processing: Converts HTML to clean Markdown
  4. Image Extraction: Extracts and saves embedded images
  5. JSON Storage: Creates structured holding files for future upload
  6. Upload Workflow: Creates stub documents, uploads all attachments (images + original), then updates with final content

๐Ÿ“‚ Project Structure

WordToOutline/
โ”œโ”€โ”€ input/                  # Place Word documents here
โ”œโ”€โ”€ extracted/              # Generated JSON holding files
โ”œโ”€โ”€ images/                 # Extracted images (organized by document ID)
โ”œโ”€โ”€ libs/
โ”‚   โ”œโ”€โ”€ word_extractor.py   # Core Word document processing
โ”‚   โ”œโ”€โ”€ config.py           # Configuration management
โ”‚   โ”œโ”€โ”€ logger.py           # Logging and progress tracking
โ”‚   โ””โ”€โ”€ __pycache__/
โ”œโ”€โ”€ main.py                 # CLI interface
โ”œโ”€โ”€ requirements.txt        # Python dependencies
โ”œโ”€โ”€ PLAN.md                 # Detailed implementation plan
โ””โ”€โ”€ README.md              # This file

๐Ÿ”ง Configuration

Configuration

API Configuration (Required for Upload)

  1. Get Outline API credentials:

    • Log into your Outline instance
    • Go to Settings > API Tokens
    • Create a new token with appropriate permissions
  2. Configure using .env file (recommended):

    cp .env.example .env
    # Edit .env with your credentials:
    # OUTLINE_API_TOKEN=your_outline_api_token
    # OUTLINE_API_URL=https://your-outline-instance.com
  3. Or set environment variables directly:

    export OUTLINE_API_TOKEN="your_outline_api_token"
    export OUTLINE_API_URL="https://your-outline-instance.com"

Command Line Options

# Main operations (choose one)
--extract [PATH]            # Extract from file/directory (default: ./input)
--list-extracted            # List all extracted documents  
--upload [DOCUMENT_ID]      # Upload to Outline (default: interactive)
--reset                     # Reset local directories to clean state

# Extraction options
--no-images                 # Skip image extraction
--no-formatting             # Skip formatting preservation

# Directory configuration
--input-dir DIR             # Input directory (default: input)
--extracted-dir DIR         # Output directory (default: extracted)
--images-dir DIR            # Images directory (default: images)

# API configuration (alternative to .env file)
--api-key TOKEN             # Outline API token
--api-url URL               # Outline API URL

# Logging
--log-level LEVEL           # DEBUG, INFO, WARNING, ERROR (default: INFO)
--log-file FILE             # Log to file instead of console

๐Ÿš€ Performance Tips & Rate Limiting

For Large Document Uploads:

  • The system includes automatic rate limiting protection with exponential backoff
  • Standard 1-second delays between image uploads help prevent rate limiting
  • Batch delays of 3 seconds every 10 uploads provide additional protection

Optional Performance Optimization: If you have a private Outline instance and want faster uploads, you can temporarily disable rate limiting by modifying the delay values in libs/api_upload_manager.py:

  • Set regular delay to 0.1 seconds for faster uploads
  • Set batch delay to 1.0 seconds for minimal throttling
  • โš ๏ธ Warning: Only recommended for private instances to avoid overwhelming public servers

Upload Strategies:

  • Interactive Mode: Best for selective document uploads with control
  • Batch Mode: Efficient for uploading many documents at once
  • Individual Uploads: Use specific document IDs for targeted uploads

โš ๏ธ Known Issues

1. Rate Limiting During Large Uploads

Issue: Large documents with many images may experience significant delays due to API rate limiting.

Impact: Upload times can be extended with exponential backoff delays (5s, 10s, 20s, 40s between retries).

Administrator Solution:

  • Temporarily disable rate limiting on your Outline instance during bulk upload sessions
  • This can reduce upload times from minutes to seconds for image-heavy documents
  • Remember to re-enable rate limiting after bulk operations complete

User Workaround:

  • Upload documents during low-traffic periods
  • Consider breaking very large documents into smaller sections

2. Modified Images in Word Documents

Issue: Images that have been edited or modified within Microsoft Word (such as adding highlights, annotations, or effects) may not display exactly as they appeared in Word.

Behavior:

  • The original unmodified image will be extracted and uploaded
  • Word's modifications (highlights, annotations, effects) will appear as separate overlay images
  • This results in multiple image attachments in Outline instead of a single modified image

Workaround:

  • For critical visual fidelity, edit images in external image editing software before inserting into Word
  • Alternatively, take screenshots of modified images in Word and replace them manually in Outline

Interactive Features

Enhanced Upload Workflow:

  • ๐Ÿ“‹ Collection Selection: Browse and select target collections
  • ๐Ÿ” Document Comparison: Detailed metadata comparison for conflicts
  • โšก Conflict Resolution: 4 options - overwrite, view details, skip, or cancel
  • ๐Ÿ“ฆ Batch Processing: Upload multiple documents with progress tracking
  • ๐Ÿ›ก๏ธ Safe Operations: Confirmation prompts for destructive actions

๐Ÿ“ Extracted Content Format

Each extracted document creates a JSON file with:

{
  "document_id": "uuid-string",
  "metadata": {
    "filename": "document.docx",
    "title": "Document Title", 
    "author": "Author Name",
    "word_count": 150,
    "paragraph_count": 12,
    "created": "2024-01-01T12:00:00",
    "modified": "2024-01-02T12:00:00"
  },
  "content_html": "<h1>Title</h1><p>Content...</p>",
  "content_markdown": "# Title\n\nContent...",
  "images": [
    {
      "image_id": "uuid",
      "original_filename": "image.png", 
      "extracted_filename": "uuid.png",
      "file_path": "images/doc-uuid/uuid.png",
      "content_type": "image/png",
      "size_bytes": 12345,
      "width": 800,
      "height": 600
    }
  ],
  "extraction_timestamp": "2024-01-01T12:00:00"
}

๐Ÿ”ฎ Recent Updates & Bug Fixes

Latest Enhancements โœจ

  • Original Document Attachment: Source Word documents now automatically attached to Outline pages
  • Complete Document Preservation: Users get converted content AND access to original source
  • Smart Defaults: --extract now defaults to ./input directory
  • Interactive Default: --upload now defaults to interactive mode
  • Environment Configuration: Fixed .env file parsing for seamless API setup
  • Enhanced Documentation: Complete feature coverage and usage examples
  • Improved User Experience: Streamlined workflows with sensible defaults

Real-World Testing Fixes ๐Ÿ›

  • Image Upload Fix: Resolved 'None' filename errors causing 400 Bad Request failures
  • Rate Limiting: Added retry logic with exponential backoff for API rate limits
  • Interactive Batch Upload: New workflow for processing documents one-at-a-time with c/s/e options
  • Title Override: Added document title customization step in interactive upload
  • Clean Collection Selection: Simplified display showing only collection titles with NEW option

Phase 3: Advanced Features (Future)

  • Word Template Processing: Handle complex document templates
  • Collaboration Features: Multi-user document processing
  • Advanced Formatting: Enhanced support for complex layouts
  • Integration Options: Web interface, desktop app
  • Workflow Automation: Watch folders, scheduled processing

๐Ÿ—๏ธ Built With

  • python-docx: Word document parsing and metadata extraction
  • mammoth: Superior HTML conversion from Word documents
  • Pillow: Image processing and optimization
  • requests: HTTP client for future API integration

๐Ÿค Development

Architecture Notes

This project leverages the proven architecture from the Confluence to Outline (CTO) project, reusing:

  • โœ… Configuration management system
  • โœ… Logging and progress tracking
  • โœ… Error handling patterns
  • โœ… Future API integration components (60-70% code reuse)

Testing

# Create test document (if needed)
python create_test_doc.py

# Extract from default input directory
python main.py --extract

# View extracted documents
python main.py --list-extracted

# Test interactive upload (requires API setup)
python main.py --upload

# Reset and start fresh
python main.py --reset

Common Usage Patterns

# Quick workflow: extract and upload
python main.py --extract                    # Extract all from ./input
python main.py --upload                     # Interactive upload

# Batch processing workflow  
python main.py --extract path/to/documents/ # Extract directory
python main.py --upload batch               # Upload all documents

# Clean slate workflow
python main.py --reset                      # Start fresh
python main.py --extract                    # Extract again

๐Ÿ“„ License

[Your License Here]

๐Ÿ™‹โ€โ™‚๏ธ Support

For questions about:

  • Phase 1 (Current): Word document extraction and processing
  • Phase 2 (Planned): Outline API integration and upload functionality

Word to Outline - Making knowledge transfer from Word documents to Outline simple and reliable.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages