Skip to content

Commit fb00b8c

Browse files
authored
Update readme
1 parent dd3dccc commit fb00b8c

File tree

1 file changed

+102
-12
lines changed

1 file changed

+102
-12
lines changed

README.md

+102-12
Original file line numberDiff line numberDiff line change
@@ -52,13 +52,13 @@ By default functions in this module consider single character as the unit for ed
5252
**[Levenshtein Distance & Similarity](https://en.wikipedia.org/wiki/Levenshtein_distance)**: edit with insertion, deletion, and substitution
5353

5454
```python
55-
import pytextdist
55+
from pytextdist.edit_distance import levenshtein_distance, levenshtein_similarity
5656

5757
str_a = 'kitten'
5858
str_b = 'sitting'
59-
dist = pytextdist.edit_distance.levenshtein_distance(str_a,str_b)
60-
simi = round(pytextdist.edit_distance.levenshtein_similarity(str_a,str_b),2)
61-
print(f"Levenshtein Distance:{dist}\nLevenshtein Similarity:{simi}")
59+
dist = levenshtein_distance(str_a,str_b)
60+
simi = levenshtein_similarity(str_a,str_b)
61+
print(f"Levenshtein Distance:{dist:.0f}\nLevenshtein Similarity:{simi:.2f}")
6262

6363
>> Levenshtein Distance:3
6464
>> Levenshtein Similarity:0.57
@@ -67,45 +67,135 @@ print(f"Levenshtein Distance:{dist}\nLevenshtein Similarity:{simi}")
6767
**[Longest Common Subsequence Distance & Similarity](https://en.wikipedia.org/wiki/Longest_common_subsequence_problem)**: edit with insertion and deletion
6868

6969
```python
70-
import pytextdist
70+
from pytextdist.edit_distance import lcs_distance, lcs_similarity
7171

7272
str_a = 'kitten'
7373
str_b = 'sitting'
74-
dist = pytextdist.edit_distance.lcs_distance(str_a,str_b)
75-
simi = round(pytextdist.edit_distance.lcs_similarity(str_a,str_b),2)
76-
print(f"LCS Distance:{dist}\nLCS Similarity:{simi}")
74+
dist = lcs_distance(str_a,str_b)
75+
simi = lcs_similarity(str_a,str_b)
76+
print(f"LCS Distance:{dist:.0f}\nLCS Similarity:{simi:.2f}")
7777

7878
>> LCS Distance:5
7979
>> LCS Similarity:0.62
8080
```
8181

82-
8382
<a id='dam_dis'></a>
84-
> **[Damerau-Levenshtein Distance & Similarity](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)**: edit with insertion, deletion, substitution, and transposition of two adjacent units
83+
**[Damerau-Levenshtein Distance & Similarity](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)**: edit with insertion, deletion, substitution, and transposition of two adjacent units
84+
85+
```python
86+
from pytextdist.edit_distance import damerau_levenshtein_distance, damerau_levenshtein_similarity
87+
88+
str_a = 'kitten'
89+
str_b = 'sitting'
90+
dist = damerau_levenshtein_distance(str_a,str_b)
91+
simi = damerau_levenshtein_similarity(str_a,str_b)
92+
print(f"Damerau-Levenshtein Distance:{dist:.0f}\nDamerau-Levenshtein Similarity:{simi:.2f}")
8593

94+
>> Damerau-Levenshtein Distance:3
95+
>> Damerau-Levenshtein Similarity:0.57
96+
```
8697

8798
<a id='ham_dis'></a>
88-
> **[Hamming Distance & Similarity](https://en.wikipedia.org/wiki/Hamming_distance)**: edit with substition
99+
**[Hamming Distance & Similarity](https://en.wikipedia.org/wiki/Hamming_distance)**: edit with substition; note that hamming metric only works for phrases of the same lengths
100+
101+
```python
102+
from pytextdist.edit_distance import hamming_distance, hamming_similarity
103+
104+
str_a = 'kittens'
105+
str_b = 'sitting'
106+
dist = hamming_distance(str_a,str_b)
107+
simi = hamming_similarity(str_a,str_b)
108+
print(f"Hamming Distance:{dist:.0f}\nHamming Similarity:{simi:.2f}")
89109

110+
>> Hamming Distance:3
111+
>> Hamming Similarity:0.57
112+
```
90113

91114
<a id='jaro_dis'></a>
92-
> **[Jaro & Jaro-Winkler Similarity](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance)**: edit with transposition
115+
**[Jaro & Jaro-Winkler Similarity](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance)**: edit with transposition
116+
117+
```python
118+
from pytextdist.edit_distance import jaro_similarity, jaro_winkler_similarity
119+
120+
str_a = 'sitten'
121+
str_b = 'sitting'
122+
simi_j = jaro_similarity(str_a,str_b)
123+
simi_jw = jaro_winkler_similarity(str_a,str_b)
124+
print(f"Jaro Similarity:{simi_j:.2f}\nJaro-Winkler Similarity:{simi_jw:.2f}")
125+
126+
>> Jaro Similarity:0.85
127+
>> Jaro-Winkler Similarity:0.91
128+
```
93129

94130
<a id='vec'></a>
95131
### Vector Similarity
96132

133+
By default functions in this module use unigram (single word) to build vectors and calculate similarity. You can change `n` to other numbers for n-gram (n contiguous words) vector similarity.
134+
97135
<a id='cos_sim'></a>
98136
> **[Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)**
99137
138+
```python
139+
from pytextdist.vector_similarity import cosine_similarity
140+
141+
phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
142+
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
143+
simi_1 = cosine_similarity(phrase_a, phrase_b, n=1)
144+
simi_2 = cosine_similarity(phrase_a, phrase_b, n=2)
145+
print(f"Unigram Cosine Similarity:{simi_1:.2f}\nBigram Cosine Similarity:{simi_2:.2f}")
146+
147+
>> Unigram Cosine Similarity:0.65
148+
>> Bigram Cosine Similarity:0.38
149+
```
150+
100151
<a id='jac_sim'></a>
101152
> **[Jaccard Similarity](https://en.wikipedia.org/wiki/Jaccard_index)**
102153
154+
```python
155+
from pytextdist.vector_similarity import jaccard_similarity
156+
157+
phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
158+
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
159+
simi_1 = jaccard_similarity(phrase_a, phrase_b, n=1)
160+
simi_2 = jaccard_similarity(phrase_a, phrase_b, n=2)
161+
print(f"Unigram Jaccard Similarity:{simi_1:.2f}\nBigram Jaccard Similarity:{simi_2:.2f}")
162+
163+
>> Unigram Jaccard Similarity:0.47
164+
>> Bigram Jaccard Similarity:0.22
165+
```
166+
103167
<a id='sor_sim'></a>
104168
> **[Sorensen Dice Similarity](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient)**
105169
170+
```python
171+
from pytextdist.vector_similarity import sorensen_dice_similarity
172+
173+
phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
174+
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
175+
simi_1 = sorensen_dice_similarity(phrase_a, phrase_b, n=1)
176+
simi_2 = sorensen_dice_similarity(phrase_a, phrase_b, n=2)
177+
print(f"Unigram Sorensen Dice Similarity:{simi_1:.2f}\nBigram Sorensen Dice Similarity:{simi_2:.2f}")
178+
179+
>> Unigram Sorensen Dice Similarity:0.64
180+
>> Bigram Sorensen Dice Similarity:0.36
181+
```
182+
106183
<a id='qgr_sim'></a>
107184
> **[Q-Grams Similarity](https://www.sciencedirect.com/science/article/pii/0304397592901434)**
108185
186+
```python
187+
from pytextdist.vector_similarity import qgram_similarity
188+
189+
phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
190+
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
191+
simi_1 = qgram_similarity(phrase_a, phrase_b, n=1)
192+
simi_2 = qgram_similarity(phrase_a, phrase_b, n=2)
193+
print(f"Unigram Q-Gram Similarity:{simi_1:.2f}\nBigram Q-Gram Similarity:{simi_2:.2f}")
194+
195+
>> Unigram Q-Gram Similarity:0.32
196+
>> Bigram Q-Gram Similarity:0.15
197+
```
198+
109199
<a id='preprocessing'></a>
110200
## Customize Preprocessing
111201

0 commit comments

Comments
 (0)