Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
contrib		contrib
evaluate		evaluate
jamspell		jamspell
main		main
test_data		test_data
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
clear.sh		clear.sh
jamspell.i		jamspell.i
setup.cfg		setup.cfg
setup.py		setup.py
test_jamspell.py		test_jamspell.py

Repository files navigation

JamSpell

JamSpell is a spell checking library with following features:

accurate - it consider words surroundings (context) for better correction
fast - near 2K words per second
multi-language - it's written in C++ and available for many languages with swig bindings

Benchmarks

	Errors	Top 7 Errors	Fix Rate	Top 7 Fix Rate	Broken	Speed (words/second)
JamSpell	3.25%	1.27%	79.53%	84.10%	0.64%	1833
Norvig	7.62%	5.00%	46.58%	66.51%	0.69%	395
Hunspell	13.10%	10.33%	47.52%	68.56%	7.14%	163
Dummy	13.14%	13.14%	0.00%	0.00%	0.00%	-

Model was trained on 300K wikipedia sentences + 300K news sentences (english). 95% was used for train, 5% was used for evaluation. Errors model was used to generate errored text from the original one. JamSpell corrector was compared with Norvig's one, Hunspell and a dummy one (no corrections).

We used following metrics:

Errors - percent of words with errors after spell checker processed
Top 7 Errors - percent of words missing in top7 candidated
Fix Rate - percent of errored words fixed by spell checker
Top 7 Fix Rate - percent of errored words fixed by one of top7 candidates
Broken - percent of non-errored words broken by spell checker
Speed - number of words per second

To ensure that our model is not too overfitted for wikipedia+news we checked it on "The Adventures of Sherlock Holmes" text:

	Errors	Top 7 Errors	Fix Rate	Top 7 Fix Rate	Broken	Speed (words per second)
JamSpell	3.56%	1.27%	72.03%	79.73%	0.50%	1764
Norvig	7.60%	5.30%	35.43%	56.06%	0.45%	647
Hunspell	9.36%	6.44%	39.61%	65.77%	2.95%	284
Dummy	11.16%	11.16%	0.00%	0.00%	0.00%	-

More details about reproducing available in "Train" section.

Usage

Python

Install swig3 (usually it is in your distro package manager)
Install jamspel:

pip install jamspell

Download or train language model
Use it:

import jamspell

corrector = jamspell.TSpellCorrector()
corrector.LoadLangModel('model_en.bin')

corrector.FixFragment('I am the begt spell cherken!')
# u'I am the best spell checker!'

corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 3)
# (u'best', u'beat', u'belt', u'bet', u'bent', ... )

corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 5)
# (u'checker', u'chicken', u'checked', u'wherein', u'coherent', ...)

C++

Add jamspell and contrib dirs to your project
Use it:

#include <jamspell/spell_corrector.hpp>

int main(int argc, const char** argv) {

    NJamSpell::TSpellCorrector corrector;
    corrector.LoadLangModel("model.bin");

    corrector.FixFragment(L"I am the begt spell cherken!");
    // "I am the best spell checker!"

    corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
    // "best", "beat", "belt", "bet", "bent", ... )

    corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
    // "checker", "chicken", "checked", "wherein", "coherent", ... )
    return 0;
}

Other languages

You can generate extensions for other languages using swig tutorial. The swig interface file is jamspell.i. Pull requests with build scripts are welcome.

Train

To train custom model you need:

Install cmake
Clone and build jamspell:

git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make

Prepare a utf-8 text file with sentences to train at (eg. sherlockholmes.txt) and another file with language alphabet (eg. alphabet_en.txt)
Train model:

./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.bin

To evaluate spellchecker you can use evaluate/evaluate.py script:

python evaluate/evalute.py -a alphabet_file.txt -jsp your_model.bin -mx 50000 your_test_data.txt

You can use evaluate/generate_dataset.py to generate you train/test data. It supports txt files, Leipzig Corpora Collection format and fb2 books.

Download models

Here is a few simple models. They trained on 300K news + 300k wikipedia sentences. We strongly recomend to train your own model, at least on a few million sentences to achieve better quality. See Train secion above.

en.tar.gz (35Mb)
fr.tar.gz (31Mb)
ru.tar.gz (38Mb)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JamSpell

Content

Benchmarks

Usage

Python

C++

Other languages

Train

Download models

About

Releases 4

Packages

Contributors 17

Languages

License

bakwc/JamSpell

Folders and files

Latest commit

History

Repository files navigation

JamSpell

Content

Benchmarks

Usage

Python

C++

Other languages

Train

Download models

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 17

Languages

Packages