Image Perceptual Deduplication

A high-performance Python tool that removes duplicate images from ZIP archives using perceptual hashing. This tool efficiently detects and removes similar images even if they've been resized, compressed, or slightly modified, while maintaining directory structure and optimizing memory usage.

Features

Identifies similar images using average hash algorithm
Processes ZIP archives while preserving directory structure
Supports multiple image formats (PNG, JPG, JPEG, BMP, GIF, TIFF)
Uses perceptual hashing with configurable similarity threshold
Maintains original file hierarchy in output
Provides detailed processing feedback
Handles errors gracefully with automatic cleanup

Installation

Clone the repository:

git clone https://github.com/yourusername/image-perceptual-dedup.git
cd image-perceptual-dedup

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Usage

Basic usage:

python perceptual_dedup.py input.zip output_directory

Extended usage with size limits:

python perceptual_dedup.py input.zip output_directory --max-zip-size 2147483648 --max-image-size 52428800

Parameters

input_zip: Path to the input ZIP file
output_dir: Directory where the output ZIP file will be created
--max-zip-size: Maximum ZIP file size in bytes (default: 1GB)
--max-image-size: Maximum individual image size in bytes (default: 50MB)

The script will:

Extract images from the input ZIP
Process each image to identify duplicates
Create a new ZIP file containing only unique images
Clean up temporary files automatically

Technical Details

Hash size: 8x8 (64-bit fingerprint)
Hamming distance threshold: 5 (adjustable in code)
Supported image formats: PNG, JPG, JPEG, BMP, GIF, TIFF
Uses PIL/Pillow for image processing
Temporary files are automatically cleaned up
Memory efficient - processes images one at a time
Performance optimizations:
- Groups similar images by hash to reduce comparisons
- Uses tuple-based hash dictionary for efficient lookups
- Encapsulated duplicate detection logic for better maintainability
- Optimized for handling large datasets with minimal memory footprint

Configuration

You can modify these constants in the script:

HASH_SIZE = 8  # Size of the hash (8x8 = 64 bits)
HASH_DIFF_THRESHOLD = 5  # Similarity threshold
VALID_EXTENSIONS = {'.png', '.jpg', '.jpeg', '.bmp', '.gif', '.tiff'}

Future Improvements

Planned features:

Add command line options for hash size and similarity threshold
Support for processing directories without requiring ZIP
Parallel processing for faster execution
Option to save list of duplicates
GUI interface
Progress bar for large archives
Support for more image formats
Option to preview duplicates before removal

Requirements

Python 3.x
Pillow >= 10.0.0

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Security Note

The script processes images locally and doesn't transmit any data. All temporary files are automatically cleaned up after processing.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
perceptual_dedup.py		perceptual_dedup.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Image Perceptual Deduplication

Features

Installation

Usage

Parameters

Technical Details

Configuration

Future Improvements

Requirements

Contributing

License

Security Note

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

t1m41n4/image-perceptual-dedup

Folders and files

Latest commit

History

Repository files navigation

Image Perceptual Deduplication

Features

Installation

Usage

Parameters

Technical Details

Configuration

Future Improvements

Requirements

Contributing

License

Security Note

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages