Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tree spell checker #119

Merged
merged 32 commits into from
Jun 10, 2019
Merged

Tree spell checker #119

merged 32 commits into from
Jun 10, 2019

Conversation

obromios
Copy link
Collaborator

This PR is to develop a spell checker that exploits a dictionary that has a tree structure. Example applications could be rake tasks e.g. rake db:migrate, and rspec tests, eg. rspec spec/models/user_model_spec.rb.

Chris Drane added 24 commits March 16, 2019 14:53
Fix that HumanTypo gives correct words
Comment out rspec tests

Move across HumanTypo files

Pass first minitests

Pass initial tests

Align TreeSpellChecker with SpellChecker api

Fix issue with array initialisation

Working why easy cases are returning []

Pass all tests and first_times

Refactor tests

Working on human typo boundary cases

Add test_word_error

First working version, ATP

Working on print out

Working on re-arranging files to have tree_spell suffix

Add combined to table

Add explore test for HumanTypo

Add separator argument to TSP

Tidy up

Remove pry

Reconcile Gemfile
@obromios
Copy link
Collaborator Author

I have developed a prototype of the TreeSpellChecker to investigate the feasibility and accuracy of such an approach. The current version is rough but does allow a comparison with the standard DidYouMean::SpellChecker. This comparison is shown in the table below:

                                  Summary                                       
--------------------------------------------------------------------------------
 Method  |   First Time (%)    Mean Suggestions       Failures (%)              
--------------------------------------------------------------------------------
 Tree               91.6                1.1                 8.4                 
 Standard           98.8                2.2                 1.1                 
 Combined           99.3                1.1                 0.6                 

I have used three criteria for the comparison, the percentage of first suggestions that were correct, the mean number of suggestions that were offered, and the percentage of failures i.e. number of correct suggestion returned.

The code for the TreeSpellChecker can be found at lib/did_you_mean/tree_spell_checker.rb. I tested the Standard and Tree spell checker using HumanTypo which can be found at lib/did_you_mean/human_typo.rb. The HumanTypo simulates a human typist by assuming that errors are poisson distributed at a rate of one typo per 20 characters. Typos can be either a deletion, an insertion or a transposition. The above table was produced by feeding 10,000 errors made by HumanTypo into the two spell checkers.

It can be seen that Tree spell checker normally just offers one suggestion, and that suggestion is correct about 92% of the time. The Standard spell checker is correct 99% of the time but on average offers twice as many suggestions. At present the failure rate of the Tree spell checker is high about 8%. I see evidence that much of this failure rate is just due to bugs in my code. However, part of the failure rate is due to a design decision that I have made.

The decision is to assume that the input has valid tree syntax. Some errors will violate this assumption, e.g. an error that deletes a separator, providing an input such as spec/modelsuser_model_spec.rb. Although I think it is possible to provide for this type of error, at this stage I think it is simpler to use the Standard spell checker only when the Tree spell checker does not return a suggestion. The final line of the above table shows the performance of such a combination, which is over 99% accurate and returns only 1.1 suggestions on average.

So at this point, I would appreciate feedback on the following points

  • Is the HumanTypo a valid way to compare the spell checkers, or is the some other way that you use.
  • Do you agree with the three criteria that I am using? Are there other criteria I should be using?
  • Are the preliminary results encouraging or were you expecting something else?

@obromios
Copy link
Collaborator Author

Thank you. I will try some enhancements to the algorithm, improve test coverage and refactor the code. Then I will push it to here for a detailed review.

@yuki24
Copy link
Member

yuki24 commented Mar 30, 2019

Thanks @obromios for all the work on compiling this PR! I haven't taken a deep dive into it yet but I think this looks very promising. I think the entire change is in a good shape and there's only a few changes needed to merge this PR (maybe just removing .DS_store and move the TreeSpellHumanTypo to test).

I'm going to take a week off for my open source projects in the week of April 7th and I should be able to take a look at this around that timeframe. Thank you again for such a wonderful proposal and even implementing it!

@obromios
Copy link
Collaborator Author

I agree with the two changes you have suggested. I will also be adding more extensive tests and there are some bugs I need to track down. I will push a new version for you to review by April 7.

Chris Drane added 4 commits April 1, 2019 16:14
Add test for too tiny input to HumanTypo

Add rspec directory

Add test for spec directory

Eliminate correct word errors

Correct special cases

Choose only option if only one option

Add extra words to special_words

Measure execution speed

Surprised to see tree is more than 10 time faster
than standard

Add comments to temp.md

Add augment: true option tree_spell_checker.new

Add tests for this and leafless word

Change base_name to find_leaves

Move HumanTypo

Refactor HumanTypo

Add substitution to HumanTypo

Add subtitution to HumanTypo

Tidy explore execution time

Move test files into tree_spell folder

move TreeSpellChecker into DidYouMean module

Remove .DS_Store

Test different lambdas for accuracy

Tidy up explore and did_you_mean.rb

Refactor check_element

Refactor parse

Refactor tree_spell_checker

Complete first cut of documentation

Update documentation

Remove pry

Tidy code
@obromios
Copy link
Collaborator Author

obromios commented Apr 8, 2019

I have re-factored the code, improved test coverage, done more extensive analysis of accuracy, and measured the execution time.

The results appear to be promising, for a typical rspec project, the tree speller augmented by the standard spell checker gives higher accuracy, many fewer suggestions and has a much faster execution time. It seems that the tree spell checker and the standard spell checker are synergistic.

More information on the algorithm and execution times can be found in doc/tree_spell_algorithm.md.

This performance tests are done with test/tree_spell/explore_test.rb. This is not a proper test file in that there are no assertions in it. As well, it takes over ten minutes to run, accordingly, I have disabled it by setting the constant TREE_SPELL_EXPLORE to false at the top of the file. To run the file, set TREE_SPELL_EXPLORE to true. I know this is unusual, so let me know if you want to handle this file in another way. Note that it is also possible to run quick assessments by using a smaller value of n_repeat in the various tests.

I was not sure how this project handles detailed api documentation, so I placed the api documentation for HumanTypo and TreeSpellChecker in the doc folder. Please let me know how you want me to do this i.e. move to the actual files or put it somewhere else.

@obromios
Copy link
Collaborator Author

obromios commented May 6, 2019

I just realised that the documentation files were not included in my last push. The reason for this is that the /doc directory is in .gitignore. I am not sure why this is, but rather than change that, I have moved the three documentation files (file type *..md) to the test/tree_spell directory, pending a decision on the correct place to store them.

Here is a summary of the performance of the augmented tree_spell algorithm on a decent sized rspec project, rspec-core:

                               Rspec Summary                                
--------------------------------------------------------------------------------
 Method  |   First Time (%)    Mean Suggestions       Failures (%)              
--------------------------------------------------------------------------------
 Tree                94.7                1.0                 5.3                 
 Standard            98.2                4.2                 1.1                 
 Augmented           99.7                1.2                 0.2   

It can be see the overall error rate is reduced by a factor of 5, and the mean number of suggestions reduced by a factor of 4.

The execution speeds for the above were


Testing execution time of Standard
Average time (ms): 40.6

Testing execution time of Tree
Average time (ms): 2.7

Testing execution time of Augmented Tree
Average time (ms): 4.5

so the execution time is also reduced by a factor of almost 10.

Please let me know if you have any questions.

@yuki24 yuki24 merged commit 63b92df into ruby:master Jun 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants