Skip to content

Commit 52e1bdf

Browse files
committed
Full refactoring to hash videos by image.
- Add no_duplicates page from #71 - Improve fuzzy search for the new logic of multiple hashes - Add table layout of dupliacates - Extend documentation - pylint and pycodestyle fixed, add ubuntu 22.04 test scripts - Adapt tests
1 parent e6cc347 commit 52e1bdf

22 files changed

+419
-172
lines changed
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
2+
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
3+
4+
name: Python package
5+
6+
on:
7+
push:
8+
branches: [ "master" ]
9+
pull_request:
10+
branches: [ "master" ]
11+
12+
jobs:
13+
build:
14+
15+
runs-on: ubuntu-latest
16+
strategy:
17+
fail-fast: false
18+
matrix:
19+
python-version: ["3.12"]
20+
21+
steps:
22+
- uses: actions/checkout@v3
23+
- name: Set up Python ${{ matrix.python-version }}
24+
uses: actions/setup-python@v3
25+
with:
26+
python-version: ${{ matrix.python-version }}
27+
- name: Install dependencies
28+
run: |
29+
sudo apt-get install -y python3 python3-pip python3-setuptools gnupg curl file
30+
curl -fsSL https://pgp.mongodb.com/server-7.0.asc | sudo gpg -o /usr/share/keyrings/mongodb-server-7.0.gpg --dearmor
31+
echo "deb [ arch=amd64,arm64 signed-by=/usr/share/keyrings/mongodb-server-7.0.gpg ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list
32+
sudo apt-get update && sudo apt-get install -y mongodb-org && sudo mkdir -p /data/db && mongod &
33+
pip install --upgrade setuptools
34+
pip install --only-binary=numpy,scipy -r requirements.txt
35+
pip install -r requirements-test.txt
36+
- name: pep8 styles
37+
run: |
38+
pycodestyle *.py hashers tests
39+
- name: Test with pytest
40+
run: |
41+
pytest tests/test.py -v

.pylintrc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[MASTER]
2+
3+
extension-pkg-allow-list=av

.travis.yml

Lines changed: 0 additions & 12 deletions
This file was deleted.

README.md

Lines changed: 42 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -2,26 +2,26 @@
22

33
![](https://api.travis-ci.org/philipbl/duplicate-images.svg)
44

5-
This Python script finds duplicate images using a [perspective hash (pHash)](http://www.phash.org) to compare images. pHash ignores the image size and file size and instead creates a hash based on the pixels of the image. This allows you to find duplicate pictures that have been rotated, have changed metadata, and slightly edited.
5+
This Python script finds duplicate files:
6+
- any file by exact match of blake2b hash
7+
- images using a [perspective hash (pHash)](http://www.phash.org) to compare images. pHash ignores the image size and file size and instead creates a hash based on the pixels of the image. This allows you to find duplicate pictures that have been rotated, have changed metadata, and slightly edited.
8+
- videos by extracting N frames at respective times and hashing them with perspective hash (see above)
69

7-
This script hashes images added to it, storing the hash into a database (MongoDB). To find duplicate images, hashes are compared. If the hash is the same between two images, then they are marked as duplicates. A web interface is provided to delete duplicate images easily. If you are feeling lucky, there is an option to automatically delete duplicate files.
10+
This script hashes files added to it, storing the hashes into a database (MongoDB). To find duplicate files, hashes are compared. If the hash is the same between two files, then they are marked as duplicates. A web interface is provided to delete duplicate files easily. If you are feeling lucky, there is an option to automatically delete duplicate files.
811

9-
As a word of caution, pHash is not perfect. I have found that duplicate pictures sometimes have different hashes and similar (but not the same) pictures have the same hash. This script is a great starting point for cleaning your photo library of duplicate pictures, but make sure you look at the pictures before you delete them. You have been warned! I hold no responsibility for any family memories that might be lost because of this script.
10-
11-
This script has only been tested with Python 3 and is still pretty rough around the edges. Use at your own risk.
12+
As a word of caution, pHash is not perfect. I have found that duplicate pictures sometimes have different hashes and similar (but not the same) pictures have the same hash. This script is a great starting point for cleaning your photo or video library of duplicate pictures, but make sure you look at the pictures before you delete them. You have been warned! I hold no responsibility for any family memories that might be lost because of this script.
1213

1314
## Requirements
1415

15-
This script requires MongoDB, Python 3.4 or higher, and a few Python modules, as found in `requirements.txt`.
16-
16+
This script requires MongoDB, Python 3.12 or higher, and a few Python modules, as found in `requirements.txt`.
1717

1818
## Quick Start
1919

2020
I suggest you read the usage, but here are the steps to get started right away. These steps assume that MongoDB is already installed on the system.
2121

22-
First, install this script. This can be done by either cloning the repository or [downloading the script](https://github.com/philipbl/duplicate-images/archive/master.zip).
22+
First, install this script. This can be done by either cloning the repository or [downloading the script](https://github.com/bolshevik/duplicate-images/archive/master.zip).
2323
```bash
24-
git clone https://github.com/philipbl/duplicate-images.git
24+
git clone https://github.com/bolshevik/duplicate-images.git
2525
```
2626

2727
Next, download all required modules. This script has only been tested with Python 3. I would suggest that you make a virtual environment, setting Python 3 as the default python executable (`mkvirtualenv --python=/usr/local/bin/python3 <name>`)
@@ -34,26 +34,6 @@ Last, run script:
3434
python duplicate_finder.py
3535
```
3636

37-
## On Ubuntu 18.04
38-
39-
```bash
40-
# Install Mongo and pip
41-
sudo apt -y install mongodb-server python3-pip
42-
# Disable Mongo service autostart
43-
sudo systemctl disable mongodb.service
44-
# Stop Mongo service
45-
sudo service mongodb stop
46-
```
47-
48-
Python 2 is the default version of Python, so we have to call `python3` explicitely:
49-
50-
```bash
51-
# Install dependencies with Python 3
52-
pip3 install -r requirements.txt
53-
# “python duplicate_finder.py” will fail, so we have to use Python 3 for every call:
54-
python3 duplicate_finder.py …
55-
```
56-
5737
## Example
5838

5939
```bash
@@ -76,6 +56,7 @@ Usage:
7656
duplicate_finder.py remove <path> ... [--db=<db_path>]
7757
duplicate_finder.py clear [--db=<db_path>]
7858
duplicate_finder.py show [--db=<db_path>]
59+
duplicate_finder.py cleanup [--db=<db_path>]
7960
duplicate_finder.py find [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--threshold=<num>]
8061
duplicate_finder.py -h | --help
8162

@@ -84,13 +65,13 @@ Options:
8465

8566
--db=<db_path> The location of the database or a MongoDB URI. (default: ./db)
8667

87-
--parallel=<num_processes> The number of parallel processes to run to hash the image
88-
files (default: number of CPUs).
68+
--parallel=<num_processes> The number of parallel processes to run to hash the files
69+
(default: number of CPUs).
8970

9071
find:
91-
--threshold=<num> Image matching threshold. Number of different bits in Hamming distance. False positives are possible.
72+
--threshold=<num> Hash matching threshold. Number of different bits in Hamming distance. False positives are possible.
9273
--print Only print duplicate files rather than displaying HTML file
93-
--delete Move all found duplicate pictures to the trash. This option takes priority over --print.
74+
--delete Move all found duplicate files to the trash. This option takes priority over --print.
9475
--match-time Adds the extra constraint that duplicate images must have the
9576
same capture times in order to be considered.
9677
--trash=<trash_path> Where files will be put when they are deleted (default: ./Trash)
@@ -101,14 +82,14 @@ Options:
10182
python duplicate_finder.py add /path/to/images
10283
```
10384

104-
When a path is added, image files are recursively searched for. In particular, `JPEG`, `PNG`, `GIF`, and `TIFF` images are searched for. Any image files found will be hashed. Adding a path uses 8 processes (by default) to hash images in parallel so the CPU usage is very high.
85+
When a path is added, files are recursively searched for. Binary content hash is applied to all files, for image files like `JPEG`, `PNG`, `GIF`, and `TIFF` the perceptive hash is applied. Video hash is applied to video files. Adding a path uses 8 processes (by default) to hash images in parallel so the CPU usage is very high.
10586

10687
### Remove
10788
```bash
10889
python duplicate_finder.py remove /path/to/images
10990
```
11091

111-
A path can be removed from the database. Any image inside that path will be removed from the database.
92+
A path can be removed from the database. Any file inside that path will be removed from the database.
11293

11394
### Clear
11495
```bash
@@ -117,6 +98,13 @@ python duplicate_finder.py clear
11798

11899
Removes all hashes from the database.
119100

101+
### Cleanup
102+
```bash
103+
python duplicate_finder.py cleanup
104+
```
105+
106+
Clean disappeared files from the database.
107+
120108
### Show
121109
```bash
122110
python duplicate_finder.py show
@@ -129,7 +117,25 @@ Prints the contents database.
129117
duplicate_finder.py find [--print] [--delete] [--match-time] [--trash=<trash_path>] [--threshold=<num>]
130118
```
131119

132-
Finds duplicate pictures that have been hashed. This will find images that have the same hash stored in the database. There are a few options associated with `find`. By default, when this command is run, a webpage is displayed showing duplicate pictures and a server is started that allows for the pictures to be deleted (images are not actually deleted, but moved to a trash folder -- I really don't want you to make a mistake). The first option, **`--print`**, prints all duplicate pictures and does not display a webpage or start the server. **`--delete`** automatically moves all duplicate images found to the trash. Be careful with this one. **`--match-time`** adds the extra constraint that images must have the same EXIF time stamp to be considered duplicate pictures. Last, `--trash=<trash_path>` lets you select a path to where you want files to be put when they are deleted. The default trash location is `./Trash`.
120+
Finds duplicate files that have been hashed. This will find files that have the same hash stored in the database. There are a few options associated with `find`. By default, when this command is run, a webpage is displayed showing duplicate files and a server is started that allows for the files to be deleted (files are not actually deleted, but moved to a trash folder -- I really don't want you to make a mistake). The first option, **`--print`**, prints all duplicate files and does not display a webpage or start the server. **`--delete`** automatically moves all duplicate files found to the trash. Be careful with this one. **`--match-time`** adds the extra constraint that images must have the same EXIF time stamp to be considered duplicate pictures. `--trash=<trash_path>` lets you select a path to where you want files to be put when they are deleted. The default trash location is `./Trash`. Last, `--threshold=<num>` specifies number of bits of Hamming distance to run fuzzy matching of hashes.
121+
122+
# Testing
123+
124+
## Ubuntu 22.04
125+
```
126+
sudo apt-get install python3 python3-pip python3-setuptools gnupg curl file
127+
curl -fsSL https://pgp.mongodb.com/server-7.0.asc | \
128+
sudo gpg -o /usr/share/keyrings/mongodb-server-7.0.gpg \
129+
--dearmor
130+
echo "deb [ arch=amd64,arm64 signed-by=/usr/share/keyrings/mongodb-server-7.0.gpg ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list
131+
sudo apt-get update
132+
sudo apt-get install -y mongodb-org
133+
sudo mkdir /data/db
134+
sudo mongod
135+
136+
pip install --only-binary=numpy,scipy -r requirements.txt
137+
pip install -r requirements-test.txt
138+
```
133139

134140
## Disclaimer
135141

0 commit comments

Comments
 (0)