Skip to content
forked from abonander/img-dup

A tool for finding duplicate and similar images, written in Rust

License

Notifications You must be signed in to change notification settings

oneamitj/img-dup

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

img-dup

A tool for finding duplicate and similar images in directory trees, written in Rust

Thanks to Dr. Neal Krawetz for the outlines of the Average-Mean and DCT-Mean perceptual hash algorithms:
http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html (Accessed August 2014)

Thanks to Emil Mikulic for the 2D Discrete Cosine Transform implementation in C, ported to Rust in src/dst.rs:
http://unix4lyfe.org/dct/ (Implementation: http://unix4lyfe.org/dct/listing2.c) (Accessed August 2014)

Unfortunately, the AAN algorithm that provides O(n log n) performance didn't seem to be viable for arbitrary-length input vectors without massive code duplicaton. This shouldn't be much of a concern as the program is largely I/O bound and the actual time spent hashing will be insignificant compared to the run time of the program as a whole.

Usage

$ img-dup --help
Duplicate Image Finder

Options:
    -t --threads [1+]   How many threads the program should use to process
			images. Defaults to the number of cores reported by
			the OS.
    -d --dir [directory]
			The directory the program should search in. Default is
			the current working directory.
    -r --recurse        If present, the program will search subdirectories.
    -h --hash-size [1+] Helps the program decide the number of bits to use for
			the hash. A higher number means more detail, but
			greater memory usage. Default is 8
    -s --threshold [0.01 - 99.99]
			The amount in percentage that an image must be
			different from another to qualify as unique. Default
			is 3
    -f --fast           Use a faster, less accurate algorithm. Really only
			useful for finding duplicates. Using a low threshold
			and/or a larger hash is recommended.
    -e --ext [extension]
			Search for filenames with the given extension.
			Defaults are jpeg, jpg, and png.
    -o --outfile [file] Output to the given file. If omitted, will print to
			stdout. If not absolute, it will be relative to the
			search directory.
    --help              Display this help.
    -u --dup-only       Only output images with similars or duplicates.
    -l --limit [1+]     Only process the given number of images.

Given no arguments, img-dup will search the current working directory with a configuration that should be optimal for most use cases, as discovered via brief experimentation. It will output its results to stdout, which may not be a good idea for large galleries as it can easily overflow the terminal window buffer.

img-dup --outfile=results.txt will put the results of the search to results.txt in the search directory, specified by --dir=[directory] or otherwise the current working directory. If it already exists, the file will be overwritten.

img-dup can take quite a long time to process all the images it finds, depending on the average size and the number of images in a directory tree. It took about an hour to process ~2300 images (~2.3GB) on the following machine:

  • Core i7 3770k (stock clocks) (8 logical cores as reported by the OS)
  • 16 GB DDR3 RAM
  • Windows 7 64-bit
  • 1 TB HDD 7200RPM SATA3

You might see performance improvements using a higher number of threads than the default (the number of cores in your CPU as reported by your OS), since many of them will be blocked on I/O at any given point. An SSD or RAMDisk might further improve search speeds, as will a properly defragmented hard drive (if using NTFS).

However, in my experiments, all 8 cores were at 100% capacity most of the time, so the bottleneck might actually be in decoding the images and not loading them from disk.

A --threshold of greater than 3(%) difference often produces misleading results, as the perceptual hash will find images that are "similar" in structure or composition but aren't subjectively similar to the human eye. Exact duplicates are always 0% different, and resizes and minor edits are usually within 2%.

If detail is a concern, a larger threshold should be used with a larger --hash-size setting, though memory usage increases on the order of O([number of images] * hash-size^2). The actual image data isn't kept in memory after being hashed, so memory usage shouldn't be much of a concern. In the above test, img-dup kept below 500MB for the duration of the test.

GIF files are currently not searched for by default due to an elusive bug in rust-image that may or may not have to do with animations. You can add --ext=gif to search for them. It shouldn't be a huge problem since an invalid image can only crash a single processing thread, though crashed threads currently cannot be restarted. The other threads will continue until they finish the remaining work or crash, at which point the program will print the available results and exit normally.

Building

img-dup is built with Cargo. If you are at all proficient in Rust then you should know how to use Cargo by now, and have it installed. There's no special steps besides cargo build.

Keep in mind that Rust programs need the libgcc_s_dw2-1.dll to run on Windows. It can be found in the bin/ folder of your Rust installation. This is only necessary on machines without Rust installed. For those you can just place a copy of the DLL in img-dup's directory.

License

Please see LICENSE.md in this repository.

This project uses PistonDevelopers/rust-image, which is MIT-licensed. According to Wikipedia, it is safe to use MIT-licensed software in this project.

About

A tool for finding duplicate and similar images, written in Rust

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published