A native Rust implementation of the CPMerge algorithm, designed for approximate string matching. This crate is particularly useful for natural language processing tasks that require the retrieval of strings/texts from very large corpora (big amounts of texts). Currently, this crate supports both character and word-based N-grams feature generation, with plans to allow custom user-defined feature generation methods.
- ✅ Fast algorithm for string matching
- ✅ 100% exact retrieval
- ✅ Support for Unicode
- Support for building databases directly from text files
- Mecab-based tokenizer support
- ✅ Dice coefficient
- ✅ Jaccard coefficient
- ✅ Cosine coefficient
- ✅ Overlap coefficient
- ✅ Exact match
Add simstring_rust to your Cargo.toml:
[dependencies]
simstring_rust = "0.3.0" # change version accordinglyFor the latest features, you can add the master branch by specifying the Git repository:
[dependencies]
simstring_rust = { git = "https://github.com/PyDataBlog/simstring_rs.git", branch = "main" }Note: Using the master branch may include experimental features and potential breakages. Use with caution!
To revert to a stable version, ensure your Cargo.toml specifies a specific version number instead of the Git repository.
Here is a basic example of how to use simstring_rs in your Rust project:
use simstring_rust::database::HashDb;
use simstring_rust::extractors::CharacterNgrams;
use simstring_rust::measures::Cosine;
use simstring_rust::Searcher;
use std::sync::Arc;
fn main() {
// 1. Setup the database
let feature_extractor = Arc::new(CharacterNgrams::new(2, "$"));
let mut db = HashDb::new(feature_extractor);
// 2. Index some strings
db.insert("hello".to_string());
db.insert("help".to_string());
db.insert("halo".to_string());
db.insert("world".to_string());
// 3. Search for strings
let measure = Cosine;
let searcher = Searcher::new(&db, measure);
let query = "hell";
let alpha = 0.5;
if let Ok(results) = searcher.ranked_search(query, alpha) {
println!("Found {} results for query '{}'", results.len(), query);
for (item, score) in results {
println!("- Match: '{}', Score: {:.4}", item, score);
}
}
}Contributions are welcome! Please open an issue or submit a pull request on GitHub. License
This project is licensed under the MIT License.
The benches/run_benches.py harness compares several language bindings (Rust, Python, Julia, Ruby, C++).
git,autoconf,automake,libtool,make,python,uvand a C++ compiler (g++) to build the C++ CLI.
The C++ sources are cloned into benches/.simstring_cpp/ and a local copy of the simstring binary is installed
under that directory. If you need to rebuild from scratch, remove benches/.simstring_cpp/ before re-running the benchmark suite.
Inspired by the SimString project.