Automatic spelling correction component is based on
An Improved Error Model for Noisy Channel Spelling Correction
by Eric Brill and Robert C. Moore and uses statistics based error model,
a static dictionary and an ARPA language model to correct spelling errors.
We provide everything you need to build a spelling correction module for russian and english languages
and some hints on how to collect appropriate datasets for other languages.
in
— list with one element: name of this component's input in chainer's shared memoryout
— list with one element: name for this component's output in chainer's shared memoryname
always equals to"spelling_error_model"
save_path
— path where the model will be saved at after a training sessionload_path
— path to the pretrained modelwindow
— window size for the error model from0
to4
, defaults to1
lm_file
— path to the ARPA language model file. If omitted, all of the dictionary words will be handled as equally probabledictionary
— description of a static dictionary model, instance of (or inherited from)deeppavlov.vocabs.static_dictionary.StaticDictionary
-
name
—"static_dictionary"
for a custom dictionary or one of two provided:"russian_words_vocab"
to automatically download and use a list of russian words from https://github.com/danakt/russian-words/"wikitionary_100K_vocab"
to automatically download a list of most common words from Project Gutenberg from Wiktionary
-
dictionary_name
— name of a directory where a dictionary will be built to and loaded from, defaults to"dictionary"
for static_dictionary -
raw_dictionary_path
— path to a file with a line-separated list of dictionary words, required for static_dictionary
-
This module expects sentence strings with space-separated tokens in lowercase as its input, so it is advised to add appropriate preprocessing in chainer.
A working config could look like this:
{
"chainer":{
"in": ["x"],
"pipe": [
{
"name": "str_lower",
"in": ["x"],
"out": ["x_lower"]
},
{
"name": "nltk_tokenizer",
"in": ["x_lower"],
"out": ["x_tokeninzed"]
},
{
"in": ["x_tokeninzed"],
"out": ["y_predicted"],
"name": "spelling_error_model",
"window": 1,
"save_path": "error_model/error_model.tsv",
"load_path": "error_model/error_model.tsv",
"dictionary": {
"name": "wikitionary_100K_vocab"
},
"lm_file": "/data/data/enwiki_no_punkt.arpa.binary"
}
],
"out": ["y_predicted"]
}
}
This model expects a sentence string with space-separated tokens in lowercase as its input and returns the same string with corrected words. Here's an example code that will read input data from stdin line by line and output resulting text into stdout:
import json
import sys
from deeppavlov.core.commands.infer import build_model_from_config
CONFIG_PATH = 'configs/error_model/brillmoore_kartaslov_ru.json'
with open(CONFIG_PATH) as config_file:
config = json.load(config_file)
model = build_model_from_config(config)
for line in sys.stdin:
print(model([line])[0], flush=True)
if we save it as example.py
then it could be used like so:
cat input.txt | python3 example.py > out.txt
For the training phase config file needs to also include these parameters:
dataset
— it should always be set like"dataset": {"name": "typos_dataset"}
name
always equals totypos_dataset
test_ratio
— ratio of test data to train, from0.
to1.
, defaults to0.
dataset_reader
name
—typos_custom_reader
for a custom dataset or one of two provided:typos_kartaslov_reader
to automatically download and process misspellings dataset for russian language from https://github.com/dkulagin/kartaslov/tree/master/dataset/orfo_and_typostypos_wikipedia_reader
to automatically download and process a list of common misspellings from english Wikipedia
data_path
— required for typos_custom_reader as a path to a dataset file, where each line contains a misspelling and a correct spelling of a word separated by a tab symbol
Component's configuration also has to have as fit_on
parameter — list of two elements: names of component's input
and true output in chainer's shared memory
A working training config could look something like:
{
"dataset_reader": {
"name": "typos_wikipedia_reader"
},
"dataset": {
"name": "typos_dataset",
"test_ratio": 0.05
},
"chainer":{
"in": ["x"],
"in_y": ["y"],
"pipe": [
{
"name": "str_lower",
"id": "lower",
"in": ["x"],
"out": ["x_lower"]
},
{
"name": "nltk_tokenizer",
"id": "tokenizer",
"in": ["x_lower"],
"out": ["x_tokenized"]
},
{
"ref": "lower",
"in": ["y"],
"out": ["y_lower"]
},
{
"ref": "tokenizer",
"in": ["y"],
"out": ["y_tokenized"]
},
{
"fit_on": ["x_tokenized", "y_tokenized"],
"in": ["x_tokenized"],
"out": ["y_predicted"],
"name": "spelling_error_model",
"window": 1,
"dictionary": {
"name": "wikitionary_100K_vocab"
},
"save_path": "error_model/error_model.tsv",
"load_path": "error_model/error_model.tsv"
}
],
"out": ["y_predicted"]
},
"train": {
"validate_best": false,
"test_best": true
}
}
And a script to use this config:
from deeppavlov.core.commands.train import train_model_from_config
MODEL_CONFIG_PATH = 'configs/error_model/brillmoore_wikitypos_en.json'
train_model_from_config(MODEL_CONFIG_PATH)
This model uses KenLM to process language models, so if you want to build your own, we suggest you consult with its website. We do also provide our own language models for english (5.5GB) and russian (3.1GB) languages.
We compared this module with Yandex.Speller and GNU Aspell on the test set for the SpellRuEval competition on Automatic Spelling Correction for Russian:
Correction method | Precision | Recall | F-measure |
---|---|---|---|
Yandex.Speller | 83.09 | 59.86 | 69.59 |
Our model with the provided language model | 51.92 | 53.94 | 52.91 |
Our model with no language model | 41.42 | 37.21 | 39.20 |
GNU Aspell, always first candidate | 27.85 | 34.07 | 30.65 |
- locate bottlenecks in code and rewrite them in Cython to improve performance
- use multiprocessing or multithreading for batch elements
- find a way to add skipped spaces and remove superfluous ones
- find or learn a proper balance between an error model and a language model scores when ranking candidates
- implement Discriminative Reranking for Spelling Correction by Yang Zhang, Pilian He, Wei Xiang and Mu Li
- use a better dataset for getting misspellings statistics
- add handcrafted features to use phonetic information