A Python script for deduplicating files across multiple directories with priority-based retention.
- Cross-directory deduplication: Scan and remove duplicate files across multiple target directories
- Priority-based retention: Keep files from higher priority directories while removing duplicates from lower priority ones
- Multiple hash methods: Uses MD5, SHA256, and perceptual hashing (phash) for accurate duplicate detection
- Partial hash calculation: Efficient scanning by calculating partial file hashes (first and last chunks)
- Configurable exclusions: Automatically excludes system and metadata files (XMP, AAE, DB, etc.)
# Clone the repository
git clone https://github.com/navilg/dedupy.git
cd dedupy
# Install dependencies
pip install -r requirements.txtRequired dependencies:
- Python 3.12+
- Pillow (PIL) for image processing
- imagehash for perceptual hashing
Scan directories for duplicate files
Create a file called directories.txt listing directories to scan, one per line. List them in priority order: the first directory has highest priority, the last has lowest.
python main.py directories.txt --action scanThis will scan the directories and creates 3 files:
- deduplication_report.json: Contains a detailed report of all duplicate files, their signature, file to be kept, duplicate files marked for deletion and total size saved after deletion in JSON format.
- deduplication_summary.txt: Contains a detailed report of all duplicate files, their signature, file to be kept, duplicate files marked for deletion and total size saved after deletion in human-readable format.
- duplicates_to_delete.txt: Contains list of duplicate files which are marked for deletion.
Delete duplicate files
Review the above 3 files which are generated after scan.
python main.py --action deleteThis will generate file deletion_report.json which contains report of files deleted in JSON format.
--phash: Enable perceptual hashing for image comparison based on similarity during scan. It's slow.--verbose: Enable verbosity
- File Scanning: Recursively scans all provided directories
- Hash Calculation: Computes partial MD5, SHA256, and perceptual hashes
- Signature Generation: Creates unique signatures for comparison
- Duplicate Detection: Groups files with identical signatures
- Priority Sorting: Sorts duplicates by directory priority
- Cleanup: Deletes duplicates from lower priority directories
Directories are processed with priority-based retention. Files in directories with lower priority numbers are kept, while duplicates in higher-numbered directories are removed.
The tool handles various file formats through perceptual hashing and standard hash methods. System files and metadata files are automatically excluded.