#Find duplicate files
This is a Perl script that locates duplicate files (ie. files with identical contents but at different inodes) with O(n) time and O(1) space overhead. It uses GNU sort that takes O(nlgn) time and O(n) space to sort the entries. This script manages to minimize the overheads needed to find duplicates. It should be among the most efficient scripts available on GitHub for the same purpose.
File duplicatiion is checked using String::CRC32 and Digest::MD5. You can use one or both of them as checksum.
Usage: find-duplicate [OPTIONS] {DIR,FILE}*
Options:
--checksum=TYPE Specify the type (crc, md5, both) of checksum. Default is crc.
--digest Don't find duplicates. Print checksums of the input files instead.
--pipe Don't compute checksums. Read checksums from STDIN instead.
--parallel=N Change the number of sorts run concurrently to N.
--verbose Print debug messages.
--help Print this message.
Given at least one DIR/FILE, the script checks the files contains in DIR or listed in FILE (which is expected to be the output of something like ls -1 DIR
or find DIR
). Note that the script ignores sub-directories and empty files. To check files recursively, you have to feed the filenames to the script using find
. For example, to find duplicate files in directory images
using MD5 as checksum, you can write
find-duplicate --checksum=md5 images
which is equivalent to (see discussion below):
find-duplicate --checksum=md5 --digest images | find-duplicate --pipe
Note that the above usages don't check files in the sub-directories. To check files recursively, use
find images | find-duplicate --checksum=md5
In practice, computing checksum is usually the most time-consuming stage in the process of finding duplicate files. Hence, it is sometimes preferrable to separate this stage from the rest of the process. An advantage of two-stage processing is that it gives you an opportunity to compute checksums in parallel. For example, suppose you want to compare files in folders B1, ..., Bn. Instead of invoking find-duplicate B1 ... Bn
directly, you can parallelize the computation of checksums as follows:
for i in $(seq 1 $n); do # spawn $n processes
find-duplicate --digest B$i > B$i.checksum &
done
wait
cat *.checksum | find-duplicate --pipe && rm *.checksum
For a slightly more convoluted example, consider the situation that you want to compare files in folder A against those in folders B1, ..., Bn and you don't want to compare files in B1, ..., Bn with each other. In this case, computing the checksums first and then comparing the folders using the pre-computed checksums is far more efficient than comparing the folders in pairs directly.
find-duplicate --digest A > A.checksum
for i in $(seq 1 $n); do # spawn $n processes
( find-duplicate --digest B$i > B$i.checksum
( flock 200
cat A.checksum B$i.checksum | find-duplicate --pipe
rm B$i.checksum
) 200 > /tmp/.lock
) &
done
Here we use flock
to prevent the processes from writing to STDOUT simultaneously.