philipbl
diff --git a/‎.github/workflows/python-package.yml‎
Lines changed: 41 additions & 0 deletions b/‎.github/workflows/python-package.yml‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎.pylintrc‎
Lines changed: 3 additions & 0 deletions b/‎.pylintrc‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎.travis.yml‎
Lines changed: 0 additions & 12 deletions b/‎.travis.yml‎
Lines changed: 0 additions & 12 deletions
diff --git a/‎README.md‎
Lines changed: 42 additions & 36 deletions b/‎README.md‎
Lines changed: 42 additions & 36 deletions
@@ -0,0 +1,41 @@
+# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
+# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
+
+name: Python package
+
+on:
+  push:
+    branches: [ "master" ]
+  pull_request:
+    branches: [ "master" ]
+
+jobs:
+  build:
+
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ["3.12"]
+
+    steps:
+    - uses: actions/checkout@v3
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v3
+      with:
+        python-version: ${{ matrix.python-version }}
+    - name: Install dependencies
+      run: |
+        sudo apt-get install -y python3 python3-pip python3-setuptools gnupg curl file
+        curl -fsSL https://pgp.mongodb.com/server-7.0.asc | sudo gpg -o /usr/share/keyrings/mongodb-server-7.0.gpg --dearmor
+        echo "deb [ arch=amd64,arm64 signed-by=/usr/share/keyrings/mongodb-server-7.0.gpg ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list
+        sudo apt-get update && sudo apt-get install -y mongodb-org && sudo mkdir -p /data/db && mongod &
+        pip install --upgrade setuptools
+        pip install --only-binary=numpy,scipy -r requirements.txt
+        pip install -r requirements-test.txt
+    - name: pep8 styles
+      run: |
+        pycodestyle *.py hashers tests
+    - name: Test with pytest
+      run: |
+        pytest tests/test.py -v
@@ -0,0 +1,3 @@
+[MASTER]
+
+extension-pkg-allow-list=av
@@ -2,26 +2,26 @@
 
 ![](https://api.travis-ci.org/philipbl/duplicate-images.svg)
 
-This Python script finds duplicate images using a [perspective hash (pHash)](http://www.phash.org) to compare images. pHash ignores the image size and file size and instead creates a hash based on the pixels of the image. This allows you to find duplicate pictures that have been rotated, have changed metadata, and slightly edited.
+This Python script finds duplicate files:
+- any file by exact match of blake2b hash
+- images using a [perspective hash (pHash)](http://www.phash.org) to compare images. pHash ignores the image size and file size and instead creates a hash based on the pixels of the image. This allows you to find duplicate pictures that have been rotated, have changed metadata, and slightly edited.
+- videos by extracting N frames at respective times and hashing them with perspective hash (see above)
 
-This script hashes images added to it, storing the hash into a database (MongoDB). To find duplicate images, hashes are compared. If the hash is the same between two images, then they are marked as duplicates. A web interface is provided to delete duplicate images easily. If you are feeling lucky, there is an option to automatically delete duplicate files.
+This script hashes files added to it, storing the hashes into a database (MongoDB). To find duplicate files, hashes are compared. If the hash is the same between two files, then they are marked as duplicates. A web interface is provided to delete duplicate files easily. If you are feeling lucky, there is an option to automatically delete duplicate files.
 
-As a word of caution, pHash is not perfect. I have found that duplicate pictures sometimes have different hashes and similar (but not the same) pictures have the same hash. This script is a great starting point for cleaning your photo library of duplicate pictures, but make sure you look at the pictures before you delete them. You have been warned! I hold no responsibility for any family memories that might be lost because of this script.
-
-This script has only been tested with Python 3 and is still pretty rough around the edges. Use at your own risk.
+As a word of caution, pHash is not perfect. I have found that duplicate pictures sometimes have different hashes and similar (but not the same) pictures have the same hash. This script is a great starting point for cleaning your photo or video library of duplicate pictures, but make sure you look at the pictures before you delete them. You have been warned! I hold no responsibility for any family memories that might be lost because of this script.
 
 ## Requirements
 
-This script requires MongoDB, Python 3.4 or higher, and a few Python modules, as found in `requirements.txt`.
-
+This script requires MongoDB, Python 3.12 or higher, and a few Python modules, as found in `requirements.txt`.
 
 ## Quick Start
 
 I suggest you read the usage, but here are the steps to get started right away. These steps assume that MongoDB is already installed on the system.
 
-First, install this script. This can be done by either cloning the repository or [downloading the script](https://github.com/philipbl/duplicate-images/archive/master.zip).
+First, install this script. This can be done by either cloning the repository or [downloading the script](https://github.com/bolshevik/duplicate-images/archive/master.zip).
 ```bash
-git clone https://github.com/philipbl/duplicate-images.git
+git clone https://github.com/bolshevik/duplicate-images.git
 ```
 
 Next, download all required modules. This script has only been tested with Python 3. I would suggest that you make a virtual environment, setting Python 3 as the default python executable (`mkvirtualenv --python=/usr/local/bin/python3 <name>`)
@@ -34,26 +34,6 @@ Last, run script:
 python duplicate_finder.py
 ```
 
-## On Ubuntu 18.04
-
-```bash
-# Install Mongo and pip
-sudo apt -y install mongodb-server python3-pip
-# Disable Mongo service autostart
-sudo systemctl disable mongodb.service
-# Stop Mongo service
-sudo service mongodb stop
-```
-
-Python 2 is the default version of Python, so we have to call `python3` explicitely:
-
-```bash
-# Install dependencies with Python 3
-pip3 install -r requirements.txt
-# “python duplicate_finder.py” will fail, so we have to use Python 3 for every call:
-python3 duplicate_finder.py …
-```
-
 ## Example
 
 ```bash
@@ -76,6 +56,7 @@ Usage:
     duplicate_finder.py remove <path> ... [--db=<db_path>]
     duplicate_finder.py clear [--db=<db_path>]
     duplicate_finder.py show [--db=<db_path>]
+    duplicate_finder.py cleanup [--db=<db_path>]
     duplicate_finder.py find [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--threshold=<num>]
     duplicate_finder.py -h | --help
 
@@ -84,13 +65,13 @@ Options:
 
     --db=<db_path>            The location of the database or a MongoDB URI. (default: ./db)
 
-    --parallel=<num_processes> The number of parallel processes to run to hash the image
-                               files (default: number of CPUs).
+    --parallel=<num_processes> The number of parallel processes to run to hash the files
+                               (default: number of CPUs).
 
     find:
-        --threshold=<num>     Image matching threshold. Number of different bits in Hamming distance. False positives are possible.
+        --threshold=<num>     Hash matching threshold. Number of different bits in Hamming distance. False positives are possible.
         --print               Only print duplicate files rather than displaying HTML file
-        --delete              Move all found duplicate pictures to the trash. This option takes priority over --print.
+        --delete              Move all found duplicate files to the trash. This option takes priority over --print.
         --match-time          Adds the extra constraint that duplicate images must have the
                               same capture times in order to be considered.
         --trash=<trash_path>  Where files will be put when they are deleted (default: ./Trash)
@@ -101,14 +82,14 @@ Options:
 python duplicate_finder.py add /path/to/images
 ```
 
-When a path is added, image files are recursively searched for. In particular, `JPEG`, `PNG`, `GIF`, and `TIFF` images are searched for. Any image files found will be hashed. Adding a path uses 8 processes (by default) to hash images in parallel so the CPU usage is very high.
+When a path is added, files are recursively searched for. Binary content hash is applied to all files, for image files like `JPEG`, `PNG`, `GIF`, and `TIFF` the perceptive hash is applied. Video hash is applied to video files. Adding a path uses 8 processes (by default) to hash images in parallel so the CPU usage is very high.
 
 ### Remove
 ```bash
 python duplicate_finder.py remove /path/to/images
 ```
 
-A path can be removed from the database. Any image inside that path will be removed from the database.
+A path can be removed from the database. Any file inside that path will be removed from the database.
 
 ### Clear
 ```bash
@@ -117,6 +98,13 @@ python duplicate_finder.py clear
 
 Removes all hashes from the database.
 
+### Cleanup
+```bash
+python duplicate_finder.py cleanup
+```
+
+Clean disappeared files from the database.
+
 ### Show
 ```bash
 python duplicate_finder.py show
@@ -129,7 +117,25 @@ Prints the contents database.
 duplicate_finder.py find [--print] [--delete] [--match-time] [--trash=<trash_path>] [--threshold=<num>]
 ```
 
-Finds duplicate pictures that have been hashed. This will find images that have the same hash stored in the database. There are a few options associated with `find`. By default, when this command is run, a webpage is displayed showing duplicate pictures and a server is started that allows for the pictures to be deleted (images are not actually deleted, but moved to a trash folder -- I really don't want you to make a mistake). The first option, **`--print`**, prints all duplicate pictures and does not display a webpage or start the server. **`--delete`** automatically moves all duplicate images found to the trash. Be careful with this one. **`--match-time`** adds the extra constraint that images must have the same EXIF time stamp to be considered duplicate pictures. Last, `--trash=<trash_path>` lets you select a path to where you want files to be put when they are deleted. The default trash location is `./Trash`.
+Finds duplicate files that have been hashed. This will find files that have the same hash stored in the database. There are a few options associated with `find`. By default, when this command is run, a webpage is displayed showing duplicate files and a server is started that allows for the files to be deleted (files are not actually deleted, but moved to a trash folder -- I really don't want you to make a mistake). The first option, **`--print`**, prints all duplicate files and does not display a webpage or start the server. **`--delete`** automatically moves all duplicate files found to the trash. Be careful with this one. **`--match-time`** adds the extra constraint that images must have the same EXIF time stamp to be considered duplicate pictures. `--trash=<trash_path>` lets you select a path to where you want files to be put when they are deleted. The default trash location is `./Trash`. Last, `--threshold=<num>` specifies number of bits of Hamming distance to run fuzzy matching of hashes.
+
+# Testing
+
+## Ubuntu 22.04
+```
+sudo apt-get install python3 python3-pip python3-setuptools gnupg curl file
+curl -fsSL https://pgp.mongodb.com/server-7.0.asc | \
+   sudo gpg -o /usr/share/keyrings/mongodb-server-7.0.gpg \
+   --dearmor
+echo "deb [ arch=amd64,arm64 signed-by=/usr/share/keyrings/mongodb-server-7.0.gpg ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list
+sudo apt-get update
+sudo apt-get install -y mongodb-org
+sudo mkdir /data/db
+sudo mongod
+
+pip install --only-binary=numpy,scipy -r requirements.txt
+pip install -r requirements-test.txt
+```
 
 ## Disclaimer
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+[MASTER]`
	`2`	`+`
	`3`	`+extension-pkg-allow-list=av`