Skip to content

A CLI tool for near-duplicate detection in text files, written in Rust with no dependencies on runtime environments.

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

Luis-Varona/shadowseek

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

shadowseek

Version License: MIT OR Apache-2.0 Build Status

ShadowSeek is a CLI tool for near-duplicate detection in text files. Written in native Rust, it offers fast execution and low memory overhead, with no dependencies on external runtime environments. Using the textract, rtf-parser, and epub crates to parse various text file formats, it uses SimHash to quickly filter out highly dissimilar documents then applies a more sophisticated MinHash algorithm to identify near-duplicates with high accuracy.

Inspired by Dr. Paweł Mandera's near-duplicate detection tool Duometer, ShadowSeek aims to provide a more lightweight and efficient alternative. Development in Rust allows users to run a precompiled binary without needing to install a Java runtime environment; as an added bonus, this also reduces startup time and memory usage. The inclusion of SimHash as a first-pass filter also facilitates faster elimination of dissimilar documents, minimizing the number of comparisons performed in the more computationally expensive MinHash stage.

(CURRENTLY UNDER DEVELOPMENT)

About

A CLI tool for near-duplicate detection in text files, written in Rust with no dependencies on runtime environments.

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages