Skip to content

Commit 19b6068

Browse files
authored
WIP: Initial proposed search method for hashdb (#7)
* Initial proposed search method for hashdb
1 parent 7e4c5a9 commit 19b6068

File tree

18 files changed

+610
-154
lines changed

18 files changed

+610
-154
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
/target
22
Session.vim
3+
.vscode/

Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "simstring_rust"
3-
version = "0.1.0"
3+
version = "0.1.1"
44
description = "A native Rust implementation of the SimString algorithm"
55
license = "MIT"
66
repository = "https://github.com/PyDataBlog/simstring_rs"

README.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,9 @@
1-
# simstring_rs
1+
# simstring_rust
2+
3+
[![Build Status](https://github.com/PyDataBlog/simstring_rs/actions/workflows/CI.yml/badge.svg)](https://github.com/PyDataBlog/simstring_rs/actions)
4+
[![Crates.io](https://img.shields.io/crates/v/simstring_rust.svg)](https://crates.io/crates/simstring_rust)
5+
[![Documentation](https://docs.rs/simstring_rust/badge.svg)](https://docs.rs/simstring_rust)
6+
[![Rust](https://img.shields.io/badge/rust-1.63.0%2B-blue.svg?maxAge=3600)](https://github.com/PyDataBlog/simstring_rs)
27

38
A native Rust implementation of the CPMerge algorithm, designed for approximate string matching. This crate is particularly useful for natural language processing tasks that require the retrieval of strings/texts from very large corpora (big amounts of texts). Currently, this crate supports both character and word-based N-grams feature generation, with plans to allow custom user-defined feature generation methods.
49

@@ -9,7 +14,6 @@ A native Rust implementation of the CPMerge algorithm, designed for approximate
914
- ✅ Support for Unicode
1015
- [ ] Support for building databases directly from text files
1116
- [ ] Mecab-based tokenizer support
12-
- [ ] Support for persistent databases like MongoDB
1317

1418
## Supported String Similarity Measures
1519

@@ -21,18 +25,18 @@ A native Rust implementation of the CPMerge algorithm, designed for approximate
2125

2226
## Installation
2327

24-
Add `simstring_rs` to your `Cargo.toml`:
28+
Add `simstring_rust` to your `Cargo.toml`:
2529

2630
```toml
2731
[dependencies]
28-
simstring_rs = "0.1.0"
32+
simstring_rust = "0.1.0" # change version accordingly
2933
```
3034

3135
For the latest features, you can add the master branch by specifying the Git repository:
3236

3337
```toml
3438
[dependencies]
35-
simstring_rs = { git = "https://github.com/PyDataBlog/simstring_rs.git", branch = "main" }
39+
simstring_rust = { git = "https://github.com/PyDataBlog/simstring_rs.git", branch = "main" }
3640
```
3741

3842
Note: Using the master branch may include experimental features and potential breakages. Use with caution!

examples/basic_usage.rs

Lines changed: 14 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,29 @@
1-
use simstring_rust::database::{HashDB, SimStringDB};
2-
use simstring_rust::extractors::{CharacterNGrams, FeatureExtractor};
1+
use simstring_rust::database::HashDB;
2+
use simstring_rust::extractors::CharacterNGrams;
33
use simstring_rust::measures::Cosine;
44

55
fn main() {
6-
let _cs = Cosine::new();
7-
86
let feature_extractor = CharacterNGrams {
9-
n: 3,
7+
n: 2,
108
padder: " ".to_string(),
119
};
12-
13-
let mut db = HashDB::new(feature_extractor);
10+
let measure = Cosine::new();
11+
let mut db = HashDB::new(feature_extractor, measure);
1412

1513
db.insert("hello".to_string());
1614
db.insert("help".to_string());
1715
db.insert("halo".to_string());
1816
db.insert("world".to_string());
1917

20-
let (total_collection, avg_size_ngrams, total_ngrams) = db.describe_collection();
21-
println!(
22-
"Database contains {} strings, average n-gram size {:.2}, total n-grams {}.",
23-
total_collection, avg_size_ngrams, total_ngrams
24-
);
25-
26-
//println!("Complete DB State: {:?}", db); # FIX: db needs a fmt.debug implementation
27-
28-
let query = "prepress";
29-
30-
let query_features = db.feature_extractor.extract(query);
31-
let query_size = query_features.len();
32-
33-
println!("Query size: {}", query_size);
18+
let threshold = 0.5;
19+
let results = db.search("hell", threshold);
3420

35-
println!("Extracted features from query '{}':", query);
36-
for (feature, count) in &query_features {
37-
println!(" - Feature: '{}', Count: {}", feature, count);
21+
if results.is_empty() {
22+
println!("No results found with threshold {}", threshold);
23+
} else {
24+
println!("Results with threshold {}:", threshold);
25+
for result in results {
26+
println!("Match: '{}' (score: {})", result.value, result.score);
27+
}
3828
}
3929
}

0 commit comments

Comments
 (0)