A Python application that recursively searches directories for duplicate and visually similar images using metadata analysis, file content hashing, and perceptual hashing.
-
Multi-stage Duplicate Detection:
- Primary: Timestamp metadata matching (EXIF DateTimeOriginal, CreateDate)
- Secondary: File content hash matching (SHA-256)
- Tertiary: Perceptual similarity using multiple hash algorithms (pHash, aHash, dHash, wHash)
-
Mode-based Operation:
- Detect mode: Scans for duplicates, stores results in persistent database, marks files for removal
- Remove mode: Reads database and removes images marked for deletion
- Protect mode: Marks specific images as protected from deletion
-
Intelligent File Selection: Automatically chooses which file to keep based on:
- Highest resolution (width × height)
- Largest file size (if resolution is identical)
- Simplest filename (shortest length, then lexicographical order)
-
Comprehensive Reporting: Generates reports in multiple formats:
- Human-readable text format
- CSV format for spreadsheet analysis
- JSON format for programmatic processing
-
Safe Operation:
- Dry-run mode for testing without file deletion
- User confirmation required for file removal
- Image protection system to prevent accidental deletion
- Persistent database tracks all operations
-
Performance Optimized:
- Parallel processing for I/O-bound operations
- SQLite database for efficient metadata storage and querying
- Configurable worker thread limits
- Skips system directories (@eaDir for Synology NAS)
-
Docker Support: Fully containerized with volume mounting for safe operation
- Clone the repository:
git clone <repository-url>
cd omnidupe
- Install dependencies:
pip install -r requirements.txt
Build the Docker image:
docker build -t omnidupe .
OmniDupe operates in three distinct modes:
Scan for duplicates and store results in the database:
python main.py detect --input-dir /path/to/images --output-dir /path/to/results
Advanced detection with custom settings:
python main.py detect \
--input-dir /path/to/images \
--output-dir /path/to/results \
--similarity-threshold 5 \
--report-format csv \
--max-workers 8 \
--verbose
Remove images that were marked for deletion during detection:
python main.py remove --output-dir /path/to/results
With dry-run to see what would be deleted:
python main.py remove --output-dir /path/to/results --dry-run --verbose
Mark specific images as protected from deletion:
python main.py protect --output-dir /path/to/results --file-path /path/to/image.jpg
# Step 1: Detect duplicates and create database
python main.py detect -i /photos -o /results --verbose
# Step 2: Protect important images (optional)
python main.py protect -o /results --file-path /photos/family/wedding.jpg
python main.py protect -o /results --file-path /photos/vacation/sunset.jpg
# Step 3: Preview what will be deleted
python main.py remove -o /results --dry-run
# Step 4: Actually remove the duplicates
python main.py remove -o /results
Detection mode:
docker run --rm \
-v /path/to/images:/images:ro \
-v /path/to/output:/data \
omnidupe \
detect --input-dir /images --output-dir /data --verbose
Removal mode:
docker run --rm \
-v /path/to/images:/images \
-v /path/to/output:/data \
omnidupe \
remove --output-dir /data --dry-run
Protect mode:
docker run --rm \
-v /path/to/images:/images \
-v /path/to/output:/data \
omnidupe \
protect --output-dir /data --file-path /images/important.jpg
When using Docker, ensure proper permissions on mounted volumes:
# Option 1: Run with current user (recommended)
docker run --rm --user $(id -u):$(id -g) \
-v /path/to/images:/images:ro \
-v /path/to/output:/data \
omnidupe detect --input-dir /images --output-dir /data
# Option 2: Use Z flag for SELinux systems
docker run --rm \
-v /path/to/images:/images:ro,Z \
-v /path/to/output:/data:Z \
omnidupe detect --input-dir /images --output-dir /data
# Option 3: Set directory permissions before mounting
sudo chown -R $(id -u):$(id -g) /path/to/output
docker run --rm \
-v /path/to/images:/images:ro \
-v /path/to/output:/data \
omnidupe detect --input-dir /images --output-dir /data
Error: "No write permission for file"
- Use
--user $(id -u):$(id -g)
when running container - Ensure output directory is writable by container user
- On SELinux systems, add
:Z
to volume mounts
Error: "Failed to move file"
- Input directory must be writable for move operations
- Use read-only mount (
:ro
) if only detecting duplicates - For removal/move operations, mount input as writable
Move files instead of deleting:
# Move duplicates to a separate directory for review
docker run --rm --user $(id -u):$(id -g) \
-v /path/to/images:/images \
-v /path/to/output:/data \
-v /path/to/moved:/moved \
omnidupe remove --output-dir /data --move-to /moved
Option | Description | Default |
---|---|---|
--input-dir , -i |
Directory to scan for images | Required |
--output-dir , -o |
Directory to store reports and database | ./output |
--similarity-threshold |
Hamming distance threshold for perceptual similarity | 5 |
--report-format |
Output report format (text , csv , json ) |
text |
--max-workers |
Maximum number of worker threads | 4 |
--verbose , -v |
Enable verbose logging | False |
Option | Description | Default |
---|---|---|
--output-dir , -o |
Directory containing the database | ./output |
--dry-run |
Show what would be deleted without actually deleting | False |
--move-to |
Move files to this directory instead of deleting them | None |
--verbose , -v |
Enable verbose logging | False |
Option | Description | Default |
---|---|---|
--output-dir , -o |
Directory containing the database | ./output |
--file-path |
Path to image file to protect | Required |
--verbose , -v |
Enable verbose logging | False |
- JPEG (.jpg, .jpeg, .jfif, .pjpeg, .pjp)
- PNG (.png)
- GIF (.gif)
- TIFF (.tiff, .tif)
- BMP (.bmp)
- WebP (.webp)
- ICO (.ico)
- Extracts EXIF timestamp metadata (DateTimeOriginal, CreateDate)
- Groups images with identical timestamps
- Most reliable for photos from digital cameras
- Calculates SHA-256 hash of file content
- Identifies byte-for-byte identical files
- Useful for exact copies, even with different filenames
- Uses multiple perceptual hashing algorithms:
- pHash (perceptual hash): Robust to minor changes
- aHash (average hash): Fast, good for basic similarity
- dHash (difference hash): Sensitive to edges and gradients
- wHash (wavelet hash): Advanced frequency domain analysis
- Configurable similarity threshold (Hamming distance)
- Groups visually similar images (e.g., different resolutions, slight edits)
The application generates several output files in the specified output directory:
- Report file:
duplicate_report_YYYYMMDD_HHMMSS.{txt|csv|json}
- Database file:
omnidupe.db
(if--persistent-db
is used) - Removal script:
remove_duplicates.sh
(when duplicates are found)
OmniDupe - Duplicate Image Detection Report
==================================================
Generated: 2025-06-28 10:30:45
Total duplicate groups found: 3
Summary:
Total images that can be removed: 7
Total disk space that can be saved: 45.2 MB
TIMESTAMP DUPLICATES (2 groups)
----------------------------------------
Group 1 (3 images, save 12.8 MB):
KEEP: /photos/vacation/IMG_001.jpg
Size: 8.5 MB, Resolution: 4032x3024
REMOVE:
- /photos/backup/IMG_001_copy.jpg
Size: 8.5 MB, Resolution: 4032x3024
- /photos/duplicates/vacation_001.jpg
Size: 4.3 MB, Resolution: 2016x1512
PERCEPTUAL DUPLICATES (1 group)
----------------------------------------
Group 2 (5 images, save 32.4 MB):
Similarity score: 3.20
KEEP: /photos/family/portrait_hires.jpg
Size: 15.2 MB, Resolution: 6000x4000
REMOVE:
- /photos/family/portrait_med.jpg
Size: 8.1 MB, Resolution: 3000x2000
- /photos/family/portrait_small.jpg
Size: 2.3 MB, Resolution: 1500x1000
[...]
- Dry-run mode: Test operations without making changes
- User confirmation: Requires explicit confirmation before deleting files
- Backup scripts: Generates shell scripts for manual review and execution
- Read-only Docker volumes: Mount image directories as read-only for safety
- Comprehensive logging: Detailed logs of all operations
- Keeper verification: Ensures selected files to keep are still accessible
- ⚡ Persistent Database (HIGHLY RECOMMENDED): Use
--persistent-db
to cache metadata between runs:- First run: Processes all images and saves metadata to
omnidupe.db
- Subsequent runs: Only processes new/modified images (dramatically faster)
- Ideal for: Large image collections (>1000 images) or regular scanning
- Example: First scan takes 30 minutes, subsequent scans take 2-3 minutes
- First run: Processes all images and saves metadata to
- Parallel processing: Configurable worker threads for I/O operations
- Database indexing: Optimized database queries with proper indexing
- Memory efficiency: Processes images in streams to minimize memory usage
- Incremental processing: Database persistence allows resuming interrupted operations
- Batch operations: Groups database operations for better performance
- Always use
--persistent-db
for collections with hundreds or thousands of images - First run setup:
# Initial scan - will take time but builds the database python main.py --input-dir /large/photo/collection --output-dir ./results --persistent-db --verbose
- Subsequent runs (much faster):
# Only processes new/changed images python main.py --input-dir /large/photo/collection --output-dir ./results --persistent-db
- Adjust workers based on your system:
--max-workers 8
for powerful systems - Monitor progress with
--verbose
flag to see processing status
- Permission errors: Ensure the application has read access to image directories and write access to output directory
- Memory issues: Reduce
--max-workers
for systems with limited RAM - Large directories: Use
--persistent-db
for very large image collections - False positives: Adjust
--similarity-threshold
(higher values = less sensitive)
- Volume mounting: Ensure correct host path mapping
- Permissions: Host directories must be accessible to the container user
- Read-only volumes: Use
:ro
suffix for image directories to prevent accidental changes
omnidupe/
├── main.py # Main application entry point
├── src/
│ ├── __init__.py
│ ├── image_scanner.py # Directory scanning and file detection
│ ├── metadata_extractor.py # Image metadata and hash extraction
│ ├── duplicate_detector.py # Multi-stage duplicate detection
│ ├── database.py # SQLite database operations
│ ├── reporter.py # Report generation (text, CSV, JSON)
│ └── file_manager.py # Safe file removal operations
├── tests/ # Comprehensive test suite
│ ├── conftest.py # Test fixtures and configuration
│ ├── test_database.py # Database operations tests
│ ├── test_file_manager.py # File management tests
│ ├── test_image_scanner.py # Image scanning tests
│ ├── test_duplicate_detector.py # Duplicate detection tests
│ ├── test_main.py # Main application tests
│ └── test_integration.py # End-to-end integration tests
├── requirements.txt # Python dependencies
├── requirements-dev.txt # Development dependencies
├── pytest.ini # Test configuration
├── run_tests.py # Test runner script
├── TESTING.md # Test documentation
├── Dockerfile # Container configuration
└── README.md # This file
OmniDupe includes a comprehensive test suite with 97 tests covering all functionality:
# Install development dependencies
pip install -r requirements-dev.txt
# Run all tests
python run_tests.py
# Run fast tests only
python run_tests.py fast
# Run with coverage report
python run_tests.py --coverage
# Run specific test categories
python run_tests.py unit
python run_tests.py integration
# Direct pytest usage
pytest # Run all tests
pytest tests/test_database.py # Run specific test file
pytest -m "not slow" # Skip slow tests
See TESTING.md for detailed test documentation.
# Test with sample images
python main.py detect --input-dir ./images --output-dir ./test_output --verbose --dry-run
- Follow Python PEP 8 style guidelines
- Add appropriate logging for new features
- Update documentation for API changes
- Test with various image formats and directory structures
[Add your license information here]
[Add support contact information here]