Skip to content

Commit 41e0253

Browse files
committed
Added fuzzy image matching with pybktree.
1 parent 7fc2fef commit 41e0253

File tree

4 files changed

+60
-6
lines changed

4 files changed

+60
-6
lines changed

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ Usage:
7676
duplicate_finder.py remove <path> ... [--db=<db_path>]
7777
duplicate_finder.py clear [--db=<db_path>]
7878
duplicate_finder.py show [--db=<db_path>]
79-
duplicate_finder.py find [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>]
79+
duplicate_finder.py find [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--threshold=<num>]
8080
duplicate_finder.py -h | --help
8181

8282
Options:
@@ -88,6 +88,7 @@ Options:
8888
files (default: number of CPUs).
8989

9090
find:
91+
--threshold=<num> Image matching threshold. Number of different bits in Hamming distance. False positives are possible.
9192
--print Only print duplicate files rather than displaying HTML file
9293
--delete Move all found duplicate pictures to the trash. This option takes priority over --print.
9394
--match-time Adds the extra constraint that duplicate images must have the
@@ -125,7 +126,7 @@ Prints the contents database.
125126

126127
### Find
127128
```bash
128-
duplicate_finder.py find [--print] [--delete] [--match-time] [--trash=<trash_path>]
129+
duplicate_finder.py find [--print] [--delete] [--match-time] [--trash=<trash_path>] [--threshold=<num>]
129130
```
130131

131132
Finds duplicate pictures that have been hashed. This will find images that have the same hash stored in the database. There are a few options associated with `find`. By default, when this command is run, a webpage is displayed showing duplicate pictures and a server is started that allows for the pictures to be deleted (images are not actually deleted, but moved to a trash folder -- I really don't want you to make a mistake). The first option, **`--print`**, prints all duplicate pictures and does not display a webpage or start the server. **`--delete`** automatically moves all duplicate images found to the trash. Be careful with this one. **`--match-time`** adds the extra constraint that images must have the same EXIF time stamp to be considered duplicate pictures. Last, `--trash=<trash_path>` lets you select a path to where you want files to be put when they are deleted. The default trash location is `./Trash`.

duplicate_finder.py

Lines changed: 52 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
duplicate_finder.py remove <path> ... [--db=<db_path>]
88
duplicate_finder.py clear [--db=<db_path>]
99
duplicate_finder.py show [--db=<db_path>]
10-
duplicate_finder.py find [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>]
10+
duplicate_finder.py find [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--threshold=<num>]
1111
duplicate_finder.py -h | --help
1212
1313
Options:
@@ -19,6 +19,7 @@
1919
files (default: number of CPUs).
2020
2121
find:
22+
--threshold=<num> Image matching threshold. Number of different bits in Hamming distance. False positives are possible.
2223
--print Only print duplicate files rather than displaying HTML file
2324
--delete Move all found duplicate pictures to the trash. This option takes priority over --print.
2425
--match-time Adds the extra constraint that duplicate images must have the
@@ -45,7 +46,7 @@
4546
from PIL import Image, ExifTags
4647
import pymongo
4748
from termcolor import cprint
48-
49+
import pybktree
4950

5051
@contextmanager
5152
def connect_to_db(db_conn_string='./db'):
@@ -244,6 +245,51 @@ def find(db, match_time=False):
244245

245246
return list(dups)
246247

248+
def find_threshold(db, threshold=1):
249+
dups = []
250+
# Build a tree
251+
cursor = db.find()
252+
tree = pybktree.BKTree(pybktree.hamming_distance)
253+
254+
cprint('Finding fuzzy duplicates, it might take a while...')
255+
cnt = 0
256+
for document in db.find():
257+
int_hash = int(document['hash'], 16)
258+
tree.add(int_hash)
259+
cnt = cnt + 1
260+
261+
deduplicated = set()
262+
263+
scanned = 0
264+
for document in db.find():
265+
cprint("\r%d%%" % (scanned * 100 / (cnt - 1)), end='')
266+
scanned = scanned + 1
267+
if document['hash'] in deduplicated:
268+
continue
269+
deduplicated.add(document['hash'])
270+
hash_len = len(document['hash'])
271+
int_hash = int(document['hash'], 16)
272+
similar = tree.find(int_hash, threshold)
273+
similar = list(set(similar))
274+
if len(similar) > 1:
275+
similars = []
276+
for (distance, item_hash) in similar:
277+
#if distance > 0:
278+
item_hash = format(item_hash, '0' + str(hash_len) + 'x')
279+
deduplicated.add(item_hash)
280+
for item in db.find({'hash': item_hash}):
281+
item['file_name'] = item['_id']
282+
similars.append(item)
283+
if len(similars) > 0:
284+
dups.append(
285+
{
286+
'_id': document['hash'],
287+
'total': len(similars),
288+
'items': similars
289+
}
290+
)
291+
292+
return dups
247293

248294
def delete_duplicates(duplicates, db):
249295
results = [delete_picture(x['file_name'], db)
@@ -355,7 +401,10 @@ def get_capture_time(img):
355401
elif args['show']:
356402
show(db)
357403
elif args['find']:
358-
dups = find(db, args['--match-time'])
404+
if args['--threshold'] is not None:
405+
dups = find_threshold(db, int(args['--threshold']))
406+
else:
407+
dups = find(db, args['--match-time'])
359408

360409
if args['--delete']:
361410
delete_duplicates(dups, db)

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,4 @@ termcolor==1.1.0
1010
Werkzeug==0.14.1
1111
Flask-Cors==3.0.3
1212
dnspython>=1.15.0
13+
pybktree==1.1

template/index.html

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
<html>
22
<head>
3+
<meta charset="UTF-8">
34
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css">
45
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap-theme.min.css">
56
<script src="https://code.jquery.com/jquery-2.1.4.min.js"></script>
@@ -9,7 +10,9 @@
910
{% macro image(img, size) -%}
1011
<div class="col-xs-{{ size }}">
1112
<div class="thumbnail">
12-
<img class="img-responsive" src="{{ img['file_name'] }}" alt="{{ img['file_name'] }}">
13+
<a href="{{ img['file_name'] }}" target='_blank'>
14+
<img class="img-responsive" src="{{ img['file_name'] }}" alt="{{ img['file_name'] }}">
15+
</a>
1316
<div class="caption">
1417
<h5 class="name">{{ img['file_name'] }}</h5>
1518
<div class="file-size">{{ img['file_size'] | filesizeformat }}</div>

0 commit comments

Comments
 (0)