A high-performance Python tool that removes duplicate images from ZIP archives using perceptual hashing. This tool efficiently detects and removes similar images even if they've been resized, compressed, or slightly modified, while maintaining directory structure and optimizing memory usage.
- Identifies similar images using average hash algorithm
- Processes ZIP archives while preserving directory structure
- Supports multiple image formats (PNG, JPG, JPEG, BMP, GIF, TIFF)
- Uses perceptual hashing with configurable similarity threshold
- Maintains original file hierarchy in output
- Provides detailed processing feedback
- Handles errors gracefully with automatic cleanup
- Clone the repository:
git clone https://github.com/yourusername/image-perceptual-dedup.git
cd image-perceptual-dedup
- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
Basic usage:
python perceptual_dedup.py input.zip output_directory
Extended usage with size limits:
python perceptual_dedup.py input.zip output_directory --max-zip-size 2147483648 --max-image-size 52428800
input_zip
: Path to the input ZIP fileoutput_dir
: Directory where the output ZIP file will be created--max-zip-size
: Maximum ZIP file size in bytes (default: 1GB)--max-image-size
: Maximum individual image size in bytes (default: 50MB)
The script will:
- Extract images from the input ZIP
- Process each image to identify duplicates
- Create a new ZIP file containing only unique images
- Clean up temporary files automatically
- Hash size: 8x8 (64-bit fingerprint)
- Hamming distance threshold: 5 (adjustable in code)
- Supported image formats: PNG, JPG, JPEG, BMP, GIF, TIFF
- Uses PIL/Pillow for image processing
- Temporary files are automatically cleaned up
- Memory efficient - processes images one at a time
- Performance optimizations:
- Groups similar images by hash to reduce comparisons
- Uses tuple-based hash dictionary for efficient lookups
- Encapsulated duplicate detection logic for better maintainability
- Optimized for handling large datasets with minimal memory footprint
You can modify these constants in the script:
HASH_SIZE = 8 # Size of the hash (8x8 = 64 bits)
HASH_DIFF_THRESHOLD = 5 # Similarity threshold
VALID_EXTENSIONS = {'.png', '.jpg', '.jpeg', '.bmp', '.gif', '.tiff'}
Planned features:
- Add command line options for hash size and similarity threshold
- Support for processing directories without requiring ZIP
- Parallel processing for faster execution
- Option to save list of duplicates
- GUI interface
- Progress bar for large archives
- Support for more image formats
- Option to preview duplicates before removal
- Python 3.x
- Pillow >= 10.0.0
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
The script processes images locally and doesn't transmit any data. All temporary files are automatically cleaned up after processing.