You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+25-13
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,10 @@
1
1
# Duplicate Image Finder
2
2
3
-
This Python script finds duplicate images using a perspective hash (pHash) to compare images. pHash ignores the image size and file size and instead creates a hash based on the pixels of the image. This allows you to find duplicate pictures that have been rotated, have changed metadata, and edited.
3
+
This Python script finds duplicate images using a [perspective hash (pHash)](http://www.phash.org) to compare images. pHash ignores the image size and file size and instead creates a hash based on the pixels of the image. This allows you to find duplicate pictures that have been rotated, have changed metadata, and edited.
4
4
5
-
This script hashes images added to it, storing the hash into a database. To find duplicate images, hashes are compared. If the hash is the same between two images, then they are marked as duplicates. A web interface is provided to delete duplicate images easily.
5
+
This script hashes images added to it, storing the hash into a database. To find duplicate images, hashes are compared. If the hash is the same between two images, then they are marked as duplicates. A web interface is provided to delete duplicate images easily. If you are feeling lucky, there is an option to automatically delete duplicate files.
6
6
7
-
As a word of caution, pHash is not perfect. I have found that duplicate pictures sometimes have different hashes and similar pictures have the same hash. This script is a great starting point for cleaning your photo library of duplicate pictures, but make sure you look at the pictures before you delete them.
7
+
As a word of caution, pHash is not perfect. I have found that duplicate pictures sometimes have different hashes and similar (but not the same) pictures have the same hash. This script is a great starting point for cleaning your photo library of duplicate pictures, but make sure you look at the pictures before you delete them. You have been warned! I hold no responsibility for any family memories that might be lost because of this script.
8
8
9
9
This script has only been tested with Python 3 and is still pretty rough around the edges. Use at your own risk.
10
10
@@ -17,7 +17,6 @@ First, install this script. This can be done by either cloning the repository or
Next, download all required modules. This script has only been tested with Python 3. I would suggest that you make a virtual environment, setting Python 3 as the default python executable (`mkvirtualenv --python=/usr/local/bin/python3 <name>`)
--db=<db_path> The location of the database. (default: ./db)
47
+
48
+
--parallel=<num_processes> The number of parallel processes to run to hash the image
49
+
files (default: 8).
46
50
47
51
find:
48
52
--print Only print duplicate files rather than displaying HTML file
49
53
--match-time Adds the extra constraint that duplicate images must have the
50
54
same capture times in order to be considered.
51
55
--trash=<trash_path> Where files will be put when they are deleted (default: ./Trash)
56
+
57
+
dedup:
58
+
--confirm Confirm you realize this will delete duplicates automatically.
52
59
```
53
60
54
61
### Add
55
62
```
56
63
python duplicate_finder.py add /path/to/images
57
64
```
58
65
59
-
When a path is added, image files are recursively searched for. In particular, `JPEG`, `PNG`, `GIF`, and `TIFF` images are searched for. Any image files found will be hashed. Adding a path uses 8 processes to hash images in parallel so the CPU usage is very high.
66
+
When a path is added, image files are recursively searched for. In particular, `JPEG`, `PNG`, `GIF`, and `TIFF` images are searched for. Any image files found will be hashed. Adding a path uses 8 processes (by default) to hash images in parallel so the CPU usage is very high.
Finds duplicate pictures that have been hashed. This will find images that have the same hash stored in the database. There are a few options associated with find. By default, when this command is run, a webpage is displayed showing duplicate pictures and a server is started that allows for the pictures to be deleted (images are not actually deleted, but moved to a trash folder). The first option, `--print`, prints all duplicate pictures and does not display a webpage or start the server. `--match-time` adds the extra constraint that images must have the same EXIF time stamp to be considered duplicate pictures. Last, `--trash=<trash_path>` lets you select a path to where you want files to be put when they are deleted. The trash path must already exist before a image is deleted.
94
+
Finds duplicate pictures that have been hashed. This will find images that have the same hash stored in the database. There are a few options associated with `find`. By default, when this command is run, a webpage is displayed showing duplicate pictures and a server is started that allows for the pictures to be deleted (images are not actually deleted, but moved to a trash folder -- I really don't want you to make a mistake). The first option, `--print`, prints all duplicate pictures and does not display a webpage or start the server. `--match-time` adds the extra constraint that images must have the same EXIF time stamp to be considered duplicate pictures. Last, `--trash=<trash_path>` lets you select a path to where you want files to be put when they are deleted. The trash path must already exist before a image is deleted.
Similar to find, except that it deletes any duplicate picture it finds rather than bringing up a webpage. To make sure you really want to do this, you must provide the `--confirm` flag. See `find` for a description of the other options.
0 commit comments