GitHub - ericpony/find-duplicate-files

#Find duplicate files

This is a Perl script that locates duplicate files (ie. files with identical contents but at different inodes) with O(n) time and O(1) space overhead. It uses GNU sort that takes O(nlgn) time and O(n) space to sort the entries. This script manages to minimize the overheads needed to find duplicates. It should be among the most efficient scripts available on GitHub for the same purpose.

File duplicatiion is checked using String::CRC32 and Digest::MD5. You can use one or both of them as checksum.

Usage

Usage: find-duplicate [OPTIONS] {DIR,FILE}*
Options:
  --checksum=TYPE  Specify the type (crc, md5, both) of checksum. Default is crc.
  --digest         Don't find duplicates. Print checksums of the input files instead.
  --pipe           Don't compute checksums. Read checksums from STDIN instead.
  --parallel=N     Change the number of sorts run concurrently to N.
  --verbose        Print debug messages.
  --help           Print this message.

Given at least one DIR/FILE, the script checks the files contains in DIR or listed in FILE (which is expected to be the output of something like ls -1 DIR or find DIR). Note that the script ignores sub-directories and empty files. To check files recursively, you have to feed the filenames to the script using find. For example, to find duplicate files in directory images using MD5 as checksum, you can write

find-duplicate --checksum=md5 images

which is equivalent to (see discussion below):

find-duplicate --checksum=md5 --digest images | find-duplicate --pipe

Note that the above usages don't check files in the sub-directories. To check files recursively, use

find images | find-duplicate --checksum=md5

Discussion

In practice, computing checksum is usually the most time-consuming stage in the process of finding duplicate files. Hence, it is sometimes preferrable to separate this stage from the rest of the process. An advantage of two-stage processing is that it gives you an opportunity to compute checksums in parallel. For example, suppose you want to compare files in folders B₁, ..., B_n. Instead of invoking find-duplicate B1 ... Bn directly, you can parallelize the computation of checksums as follows:

for i in $(seq 1 $n); do # spawn $n processes
    find-duplicate --digest B$i > B$i.checksum &
done
wait 
cat *.checksum | find-duplicate --pipe && rm *.checksum

For a slightly more convoluted example, consider the situation that you want to compare files in folder A against those in folders B₁, ..., B_n and you don't want to compare files in B₁, ..., B_n with each other. In this case, computing the checksums first and then comparing the folders using the pre-computed checksums is far more efficient than comparing the folders in pairs directly.

find-duplicate --digest A > A.checksum
for i in $(seq 1 $n); do # spawn $n processes
    (   find-duplicate --digest B$i > B$i.checksum
        (   flock 200
            cat A.checksum B$i.checksum | find-duplicate --pipe
            rm B$i.checksum
        ) 200 > /tmp/.lock 
    ) &
done

Here we use flock to prevent the processes from writing to STDOUT simultaneously.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
LICENSE		LICENSE
README.md		README.md
find-duplicate		find-duplicate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Usage

Discussion

About

Uh oh!

Releases

Packages

Languages

License

ericpony/find-duplicate-files

Folders and files

Latest commit

History

Repository files navigation

Usage

Discussion

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages