Tree spell checker #119

obromios · 2019-03-28T20:37:36Z

This PR is to develop a spell checker that exploits a dictionary that has a tree structure. Example applications could be rake tasks e.g. rake db:migrate, and rspec tests, eg. rspec spec/models/user_model_spec.rb.

Fix that HumanTypo gives correct words

Comment out rspec tests Move across HumanTypo files Pass first minitests Pass initial tests Align TreeSpellChecker with SpellChecker api Fix issue with array initialisation Working why easy cases are returning [] Pass all tests and first_times Refactor tests Working on human typo boundary cases Add test_word_error First working version, ATP Working on print out Working on re-arranging files to have tree_spell suffix Add combined to table Add explore test for HumanTypo Add separator argument to TSP Tidy up Remove pry Reconcile Gemfile

obromios · 2019-03-28T20:41:36Z

I have developed a prototype of the TreeSpellChecker to investigate the feasibility and accuracy of such an approach. The current version is rough but does allow a comparison with the standard DidYouMean::SpellChecker. This comparison is shown in the table below:

                                  Summary                                       
--------------------------------------------------------------------------------
 Method  |   First Time (%)    Mean Suggestions       Failures (%)              
--------------------------------------------------------------------------------
 Tree               91.6                1.1                 8.4                 
 Standard           98.8                2.2                 1.1                 
 Combined           99.3                1.1                 0.6

I have used three criteria for the comparison, the percentage of first suggestions that were correct, the mean number of suggestions that were offered, and the percentage of failures i.e. number of correct suggestion returned.

The code for the TreeSpellChecker can be found at lib/did_you_mean/tree_spell_checker.rb. I tested the Standard and Tree spell checker using HumanTypo which can be found at lib/did_you_mean/human_typo.rb. The HumanTypo simulates a human typist by assuming that errors are poisson distributed at a rate of one typo per 20 characters. Typos can be either a deletion, an insertion or a transposition. The above table was produced by feeding 10,000 errors made by HumanTypo into the two spell checkers.

It can be seen that Tree spell checker normally just offers one suggestion, and that suggestion is correct about 92% of the time. The Standard spell checker is correct 99% of the time but on average offers twice as many suggestions. At present the failure rate of the Tree spell checker is high about 8%. I see evidence that much of this failure rate is just due to bugs in my code. However, part of the failure rate is due to a design decision that I have made.

The decision is to assume that the input has valid tree syntax. Some errors will violate this assumption, e.g. an error that deletes a separator, providing an input such as spec/modelsuser_model_spec.rb. Although I think it is possible to provide for this type of error, at this stage I think it is simpler to use the Standard spell checker only when the Tree spell checker does not return a suggestion. The final line of the above table shows the performance of such a combination, which is over 99% accurate and returns only 1.1 suggestions on average.

So at this point, I would appreciate feedback on the following points

Is the HumanTypo a valid way to compare the spell checkers, or is the some other way that you use.
Do you agree with the three criteria that I am using? Are there other criteria I should be using?
Are the preliminary results encouraging or were you expecting something else?

obromios · 2019-03-30T20:45:51Z

Thank you. I will try some enhancements to the algorithm, improve test coverage and refactor the code. Then I will push it to here for a detailed review.

yuki24 · 2019-03-30T21:19:30Z

Thanks @obromios for all the work on compiling this PR! I haven't taken a deep dive into it yet but I think this looks very promising. I think the entire change is in a good shape and there's only a few changes needed to merge this PR (maybe just removing .DS_store and move the TreeSpellHumanTypo to test).

I'm going to take a week off for my open source projects in the week of April 7th and I should be able to take a look at this around that timeframe. Thank you again for such a wonderful proposal and even implementing it!

…_backup into tree_spell_checker

obromios · 2019-03-31T23:13:28Z

I agree with the two changes you have suggested. I will also be adding more extensive tests and there are some bugs I need to track down. I will push a new version for you to review by April 7.

Add test for too tiny input to HumanTypo Add rspec directory Add test for spec directory Eliminate correct word errors Correct special cases Choose only option if only one option Add extra words to special_words Measure execution speed Surprised to see tree is more than 10 time faster than standard Add comments to temp.md Add augment: true option tree_spell_checker.new Add tests for this and leafless word Change base_name to find_leaves Move HumanTypo Refactor HumanTypo Add substitution to HumanTypo Add subtitution to HumanTypo Tidy explore execution time Move test files into tree_spell folder move TreeSpellChecker into DidYouMean module Remove .DS_Store Test different lambdas for accuracy Tidy up explore and did_you_mean.rb Refactor check_element Refactor parse Refactor tree_spell_checker Complete first cut of documentation Update documentation Remove pry Tidy code

obromios · 2019-04-08T09:39:14Z

I have re-factored the code, improved test coverage, done more extensive analysis of accuracy, and measured the execution time.

The results appear to be promising, for a typical rspec project, the tree speller augmented by the standard spell checker gives higher accuracy, many fewer suggestions and has a much faster execution time. It seems that the tree spell checker and the standard spell checker are synergistic.

More information on the algorithm and execution times can be found in doc/tree_spell_algorithm.md.

This performance tests are done with test/tree_spell/explore_test.rb. This is not a proper test file in that there are no assertions in it. As well, it takes over ten minutes to run, accordingly, I have disabled it by setting the constant TREE_SPELL_EXPLORE to false at the top of the file. To run the file, set TREE_SPELL_EXPLORE to true. I know this is unusual, so let me know if you want to handle this file in another way. Note that it is also possible to run quick assessments by using a smaller value of n_repeat in the various tests.

I was not sure how this project handles detailed api documentation, so I placed the api documentation for HumanTypo and TreeSpellChecker in the doc folder. Please let me know how you want me to do this i.e. move to the actual files or put it somewhere else.

obromios · 2019-05-06T00:46:34Z

I just realised that the documentation files were not included in my last push. The reason for this is that the /doc directory is in .gitignore. I am not sure why this is, but rather than change that, I have moved the three documentation files (file type *..md) to the test/tree_spell directory, pending a decision on the correct place to store them.

Here is a summary of the performance of the augmented tree_spell algorithm on a decent sized rspec project, rspec-core:

                               Rspec Summary                                
--------------------------------------------------------------------------------
 Method  |   First Time (%)    Mean Suggestions       Failures (%)              
--------------------------------------------------------------------------------
 Tree                94.7                1.0                 5.3                 
 Standard            98.2                4.2                 1.1                 
 Augmented           99.7                1.2                 0.2

It can be see the overall error rate is reduced by a factor of 5, and the mean number of suggestions reduced by a factor of 4.

The execution speeds for the above were


Testing execution time of Standard
Average time (ms): 40.6

Testing execution time of Tree
Average time (ms): 2.7

Testing execution time of Augmented Tree
Average time (ms): 4.5

so the execution time is also reduced by a factor of almost 10.

Please let me know if you have any questions.

Chris Drane added 24 commits March 16, 2019 14:53

Move initial attempt across from sandbox

f757997

Add rest of protoptype structure

6302595

Comment out rspec tests

e120b55

Move across HumanTypo files

272b26f

Pass first minitests

fc90155

Pass initial tests

fdd5552

Align TreeSpellChecker with SpellChecker api

d7ce927

Fix issue with array initialisation

58863b2

Fixing no suggestions if word is correct

c0646bc

Fix that HumanTypo gives correct words

Working why easy cases are returning []

6909d07

Pass all tests and first_times

3721ce2

Refactor tests

5d75937

Working on human typo boundary cases

4d0ec2d

Add test_word_error

1c61bf2

First working version, ATP

5dc6573

Working on print out

b26ed34

Working on re-arranging files to have tree_spell suffix

c975034

Add combined to table

8a15d90

Add explore test for HumanTypo

4dd92c1

Add separator argument to TSP

6fa3ddc

Tidy up

492f2b3

Remove pry

be2a65c

Reconcile Gemfile

0b0fdae

Chris Drane added 2 commits March 31, 2019 09:26

Merge branch 'tree_spell_checker' of github.com:obromios/did_you_mean…

f2c0155

…_backup into tree_spell_checker

Refactor HumanTypo

4c4dba0

Chris Drane added 4 commits April 1, 2019 16:14

Add test for too tiny input to HumanTypo

4120065

Add rspec directory

a33c98b

Remove DS_Store

1773ed9

Chris Drane added 2 commits May 6, 2019 10:31

Merge from origin

b19a3bb

Move documentation files

362bc80

yuki24 merged commit 63b92df into ruby:master Jun 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tree spell checker #119

Tree spell checker #119

obromios commented Mar 28, 2019

obromios commented Mar 28, 2019

obromios commented Mar 30, 2019

yuki24 commented Mar 30, 2019

obromios commented Mar 31, 2019

obromios commented Apr 8, 2019

obromios commented May 6, 2019 •

edited

Loading

Tree spell checker #119

Tree spell checker #119

Conversation

obromios commented Mar 28, 2019

obromios commented Mar 28, 2019

obromios commented Mar 30, 2019

yuki24 commented Mar 30, 2019

obromios commented Mar 31, 2019

obromios commented Apr 8, 2019

obromios commented May 6, 2019 • edited Loading

obromios commented May 6, 2019 •

edited

Loading