Skip to content

Latest commit

 

History

History
213 lines (180 loc) · 8.22 KB

File metadata and controls

213 lines (180 loc) · 8.22 KB

License Apache 2.0 Python 3.6

Automatic spelling correction component

Automatic spelling correction component is based on An Improved Error Model for Noisy Channel Spelling Correction by Eric Brill and Robert C. Moore and uses statistics based error model, a static dictionary and an ARPA language model to correct spelling errors.
We provide everything you need to build a spelling correction module for russian and english languages and some hints on how to collect appropriate datasets for other languages.

Usage

Component config parameters:

  • in — list with one element: name of this component's input in chainer's shared memory
  • out — list with one element: name for this component's output in chainer's shared memory
  • name always equals to "spelling_error_model"
  • save_path — path where the model will be saved at after a training session
  • load_path — path to the pretrained model
  • window — window size for the error model from 0 to 4, defaults to 1
  • lm_file — path to the ARPA language model file. If omitted, all of the dictionary words will be handled as equally probable
  • dictionary — description of a static dictionary model, instance of (or inherited from) deeppavlov.vocabs.static_dictionary.StaticDictionary
    • name — "static_dictionary" for a custom dictionary or one of two provided:

    • dictionary_name — name of a directory where a dictionary will be built to and loaded from, defaults to "dictionary" for static_dictionary

    • raw_dictionary_path — path to a file with a line-separated list of dictionary words, required for static_dictionary

This module expects sentence strings with space-separated tokens in lowercase as its input, so it is advised to add appropriate preprocessing in chainer.

A working config could look like this:

{
  "chainer":{
    "in": ["x"],
    "pipe": [
      {
        "name": "str_lower",
        "in": ["x"],
        "out": ["x_lower"]
      },
      {
        "name": "nltk_tokenizer",
        "in": ["x_lower"],
        "out": ["x_tokeninzed"]
      },
      {
        "in": ["x_tokeninzed"],
        "out": ["y_predicted"],
        "name": "spelling_error_model",
        "window": 1,
        "save_path": "error_model/error_model.tsv",
        "load_path": "error_model/error_model.tsv",
        "dictionary": {
          "name": "wikitionary_100K_vocab"
        },
        "lm_file": "/data/data/enwiki_no_punkt.arpa.binary"
      }
    ],
    "out": ["y_predicted"]
  }
}

Usage example

This model expects a sentence string with space-separated tokens in lowercase as its input and returns the same string with corrected words. Here's an example code that will read input data from stdin line by line and output resulting text into stdout:

import json
import sys

from deeppavlov.core.commands.infer import build_model_from_config

CONFIG_PATH = 'configs/error_model/brillmoore_kartaslov_ru.json'

with open(CONFIG_PATH) as config_file:
    config = json.load(config_file)

model = build_model_from_config(config)
for line in sys.stdin:
    print(model([line])[0], flush=True)

if we save it as example.py then it could be used like so:

cat input.txt | python3 example.py > out.txt

Training

Error model

For the training phase config file needs to also include these parameters:

  • dataset — it should always be set like "dataset": {"name": "typos_dataset"}
    • name always equals to typos_dataset
    • test_ratio — ratio of test data to train, from 0. to 1., defaults to 0.
  • dataset_reader

Component's configuration also has to have as fit_on parameter — list of two elements: names of component's input and true output in chainer's shared memory

A working training config could look something like:

{
  "dataset_reader": {
    "name": "typos_wikipedia_reader"
  },
  "dataset": {
    "name": "typos_dataset",
    "test_ratio": 0.05
  },
  "chainer":{
    "in": ["x"],
    "in_y": ["y"],
    "pipe": [
      {
        "name": "str_lower",
        "id": "lower",
        "in": ["x"],
        "out": ["x_lower"]
      },
      {
        "name": "nltk_tokenizer",
        "id": "tokenizer",
        "in": ["x_lower"],
        "out": ["x_tokenized"]
      },
      {
        "ref": "lower",
        "in": ["y"],
        "out": ["y_lower"]
      },
      {
        "ref": "tokenizer",
        "in": ["y"],
        "out": ["y_tokenized"]
      },
      {
        "fit_on": ["x_tokenized", "y_tokenized"],
        "in": ["x_tokenized"],
        "out": ["y_predicted"],
        "name": "spelling_error_model",
        "window": 1,
        "dictionary": {
          "name": "wikitionary_100K_vocab"
        },
        "save_path": "error_model/error_model.tsv",
        "load_path": "error_model/error_model.tsv"
      }
    ],
    "out": ["y_predicted"]
  },
  "train": {
    "validate_best": false,
    "test_best": true
  }
}

And a script to use this config:

from deeppavlov.core.commands.train import train_model_from_config

MODEL_CONFIG_PATH = 'configs/error_model/brillmoore_wikitypos_en.json'
train_model_from_config(MODEL_CONFIG_PATH)

Language model

This model uses KenLM to process language models, so if you want to build your own, we suggest you consult with its website. We do also provide our own language models for english (5.5GB) and russian (3.1GB) languages.

Comparison

We compared this module with Yandex.Speller and GNU Aspell on the test set for the SpellRuEval competition on Automatic Spelling Correction for Russian:

Correction method Precision Recall F-measure
Yandex.Speller 83.09 59.86 69.59
Our model with the provided language model 51.92 53.94 52.91
Our model with no language model 41.42 37.21 39.20
GNU Aspell, always first candidate 27.85 34.07 30.65

Ways to improve

  • locate bottlenecks in code and rewrite them in Cython to improve performance
  • use multiprocessing or multithreading for batch elements
  • find a way to add skipped spaces and remove superfluous ones
  • find or learn a proper balance between an error model and a language model scores when ranking candidates
  • implement Discriminative Reranking for Spelling Correction by Yang Zhang, Pilian He, Wei Xiang and Mu Li
  • use a better dataset for getting misspellings statistics
  • add handcrafted features to use phonetic information