Convert Word documents to Outline knowledge base documents with full formatting preservation, image extraction, and metadata handling.
Word to Outline now provides complete end-to-end conversion from Word documents to Outline with interactive upload workflow.
- โ Word Document Extraction: Full content extraction using python-docx and mammoth
- โ Format Preservation: Bold, italic, headers, lists, tables
- โ Image Support: Complete image extraction and upload to Outline as attachments
- โ Original Document Preservation: Source Word document automatically attached for reference
- โ Metadata Extraction: Title, author, word count, creation dates
- โ Multiple Output Formats: HTML, Markdown, and structured JSON
- โ Batch Processing: Process entire directories of Word documents
- โ CLI Interface: Simple command-line operations
- โ Interactive Upload: Collection selection and conflict resolution with detailed comparison
- โ API Integration: Complete Outline API integration with proven CTO architecture
- โ Attachment Handling: Images uploaded as proper Outline attachments
- โ Enhanced Conflict Resolution: 4-option workflow (overwrite/details/skip/cancel)
- โ Force Mode: Overwrite existing documents when needed
- โ Batch Upload: Upload multiple documents efficiently
- โ Reset Functionality: Clean slate command to remove extracted files
- โ Smart Defaults: Extract defaults to ./input, upload defaults to interactive mode
- โ Environment Configuration: Robust .env file support for API credentials
-
Clone and setup:
cd WordToOutline python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate pip install -r requirements.txt
-
Configure API access (required for upload):
cp .env.example .env # Edit .env with your Outline API credentials
-
Extract Word documents (defaults to ./input directory):
# Extract from default input directory python main.py --extract # Or specify a specific file/directory python main.py --extract path/to/document.docx python main.py --extract path/to/documents/
-
List extracted documents:
python main.py --list-extracted
-
Upload to Outline (defaults to interactive mode):
# Interactive upload (default - select document and collection) python main.py --upload # Upload specific document by ID python main.py --upload document_id # Batch upload all documents python main.py --upload batch
-
Reset to clean state:
# Remove all extracted files and start fresh python main.py --reset
Word Documents JSON Holding Files Outline Documents
(.docx) โ (extracted/) โ (Live in Outline)
โ โ โ
[document.docx] โ [uuid.json] โ [Outline Document]
โ โ โ
Content โข Document metadata โข Full document created
Images โ โข HTML content โ โข Collection assignment
Metadata โข Markdown content โข Image attachments
Formatting โข Extracted images โข Original document attached
โข Format preservation
- Document Analysis: Uses python-docx to extract metadata and structure
- Content Conversion: Uses mammoth for superior HTML conversion
- Format Processing: Converts HTML to clean Markdown
- Image Extraction: Extracts and saves embedded images
- JSON Storage: Creates structured holding files for future upload
- Upload Workflow: Creates stub documents, uploads all attachments (images + original), then updates with final content
WordToOutline/
โโโ input/ # Place Word documents here
โโโ extracted/ # Generated JSON holding files
โโโ images/ # Extracted images (organized by document ID)
โโโ libs/
โ โโโ word_extractor.py # Core Word document processing
โ โโโ config.py # Configuration management
โ โโโ logger.py # Logging and progress tracking
โ โโโ __pycache__/
โโโ main.py # CLI interface
โโโ requirements.txt # Python dependencies
โโโ PLAN.md # Detailed implementation plan
โโโ README.md # This file
-
Get Outline API credentials:
- Log into your Outline instance
- Go to Settings > API Tokens
- Create a new token with appropriate permissions
-
Configure using .env file (recommended):
cp .env.example .env # Edit .env with your credentials: # OUTLINE_API_TOKEN=your_outline_api_token # OUTLINE_API_URL=https://your-outline-instance.com
-
Or set environment variables directly:
export OUTLINE_API_TOKEN="your_outline_api_token" export OUTLINE_API_URL="https://your-outline-instance.com"
# Main operations (choose one)
--extract [PATH] # Extract from file/directory (default: ./input)
--list-extracted # List all extracted documents
--upload [DOCUMENT_ID] # Upload to Outline (default: interactive)
--reset # Reset local directories to clean state
# Extraction options
--no-images # Skip image extraction
--no-formatting # Skip formatting preservation
# Directory configuration
--input-dir DIR # Input directory (default: input)
--extracted-dir DIR # Output directory (default: extracted)
--images-dir DIR # Images directory (default: images)
# API configuration (alternative to .env file)
--api-key TOKEN # Outline API token
--api-url URL # Outline API URL
# Logging
--log-level LEVEL # DEBUG, INFO, WARNING, ERROR (default: INFO)
--log-file FILE # Log to file instead of consoleFor Large Document Uploads:
- The system includes automatic rate limiting protection with exponential backoff
- Standard 1-second delays between image uploads help prevent rate limiting
- Batch delays of 3 seconds every 10 uploads provide additional protection
Optional Performance Optimization:
If you have a private Outline instance and want faster uploads, you can temporarily disable rate limiting by modifying the delay values in libs/api_upload_manager.py:
- Set regular delay to
0.1seconds for faster uploads - Set batch delay to
1.0seconds for minimal throttling โ ๏ธ Warning: Only recommended for private instances to avoid overwhelming public servers
Upload Strategies:
- Interactive Mode: Best for selective document uploads with control
- Batch Mode: Efficient for uploading many documents at once
- Individual Uploads: Use specific document IDs for targeted uploads
Issue: Large documents with many images may experience significant delays due to API rate limiting.
Impact: Upload times can be extended with exponential backoff delays (5s, 10s, 20s, 40s between retries).
Administrator Solution:
- Temporarily disable rate limiting on your Outline instance during bulk upload sessions
- This can reduce upload times from minutes to seconds for image-heavy documents
- Remember to re-enable rate limiting after bulk operations complete
User Workaround:
- Upload documents during low-traffic periods
- Consider breaking very large documents into smaller sections
Issue: Images that have been edited or modified within Microsoft Word (such as adding highlights, annotations, or effects) may not display exactly as they appeared in Word.
Behavior:
- The original unmodified image will be extracted and uploaded
- Word's modifications (highlights, annotations, effects) will appear as separate overlay images
- This results in multiple image attachments in Outline instead of a single modified image
Workaround:
- For critical visual fidelity, edit images in external image editing software before inserting into Word
- Alternatively, take screenshots of modified images in Word and replace them manually in Outline
Enhanced Upload Workflow:
- ๐ Collection Selection: Browse and select target collections
- ๐ Document Comparison: Detailed metadata comparison for conflicts
- โก Conflict Resolution: 4 options - overwrite, view details, skip, or cancel
- ๐ฆ Batch Processing: Upload multiple documents with progress tracking
- ๐ก๏ธ Safe Operations: Confirmation prompts for destructive actions
Each extracted document creates a JSON file with:
{
"document_id": "uuid-string",
"metadata": {
"filename": "document.docx",
"title": "Document Title",
"author": "Author Name",
"word_count": 150,
"paragraph_count": 12,
"created": "2024-01-01T12:00:00",
"modified": "2024-01-02T12:00:00"
},
"content_html": "<h1>Title</h1><p>Content...</p>",
"content_markdown": "# Title\n\nContent...",
"images": [
{
"image_id": "uuid",
"original_filename": "image.png",
"extracted_filename": "uuid.png",
"file_path": "images/doc-uuid/uuid.png",
"content_type": "image/png",
"size_bytes": 12345,
"width": 800,
"height": 600
}
],
"extraction_timestamp": "2024-01-01T12:00:00"
}- Original Document Attachment: Source Word documents now automatically attached to Outline pages
- Complete Document Preservation: Users get converted content AND access to original source
- Smart Defaults:
--extractnow defaults to./inputdirectory - Interactive Default:
--uploadnow defaults to interactive mode - Environment Configuration: Fixed .env file parsing for seamless API setup
- Enhanced Documentation: Complete feature coverage and usage examples
- Improved User Experience: Streamlined workflows with sensible defaults
- Image Upload Fix: Resolved 'None' filename errors causing 400 Bad Request failures
- Rate Limiting: Added retry logic with exponential backoff for API rate limits
- Interactive Batch Upload: New workflow for processing documents one-at-a-time with c/s/e options
- Title Override: Added document title customization step in interactive upload
- Clean Collection Selection: Simplified display showing only collection titles with NEW option
- Word Template Processing: Handle complex document templates
- Collaboration Features: Multi-user document processing
- Advanced Formatting: Enhanced support for complex layouts
- Integration Options: Web interface, desktop app
- Workflow Automation: Watch folders, scheduled processing
- python-docx: Word document parsing and metadata extraction
- mammoth: Superior HTML conversion from Word documents
- Pillow: Image processing and optimization
- requests: HTTP client for future API integration
This project leverages the proven architecture from the Confluence to Outline (CTO) project, reusing:
- โ Configuration management system
- โ Logging and progress tracking
- โ Error handling patterns
- โ Future API integration components (60-70% code reuse)
# Create test document (if needed)
python create_test_doc.py
# Extract from default input directory
python main.py --extract
# View extracted documents
python main.py --list-extracted
# Test interactive upload (requires API setup)
python main.py --upload
# Reset and start fresh
python main.py --reset# Quick workflow: extract and upload
python main.py --extract # Extract all from ./input
python main.py --upload # Interactive upload
# Batch processing workflow
python main.py --extract path/to/documents/ # Extract directory
python main.py --upload batch # Upload all documents
# Clean slate workflow
python main.py --reset # Start fresh
python main.py --extract # Extract again[Your License Here]
For questions about:
- Phase 1 (Current): Word document extraction and processing
- Phase 2 (Planned): Outline API integration and upload functionality
Word to Outline - Making knowledge transfer from Word documents to Outline simple and reliable.