ShadowSeek is a CLI tool for near-duplicate detection in text files. Written in native Rust, it offers fast execution and low memory overhead, with no dependencies on external runtime environments. Using the textract
, rtf-parser
, and epub
crates to parse various text file formats, it uses SimHash to quickly filter out highly dissimilar documents then applies a more sophisticated MinHash algorithm to identify near-duplicates with high accuracy.
Inspired by Dr. Paweł Mandera's near-duplicate detection tool Duometer, ShadowSeek aims to provide a more lightweight and efficient alternative. Development in Rust allows users to run a precompiled binary without needing to install a Java runtime environment; as an added bonus, this also reduces startup time and memory usage. The inclusion of SimHash as a first-pass filter also facilitates faster elimination of dissimilar documents, minimizing the number of comparisons performed in the more computationally expensive MinHash stage.
(CURRENTLY UNDER DEVELOPMENT)