Skip to content

Conversation

@bact
Copy link
Member

@bact bact commented May 21, 2021

Remove and fix misspellings e.g.

  • โซเดียมไฮดรอกไซค์
  • เตอร์กเมนิสถาน
  • สารขันฑ์

Remove these words, as they are not found elsewhere on the internet (the few results returned are all from the word list similar to words_th.txt):

  • โวลล์แมชชีน
  • ไวอาร์ตีลเอส

This will fix #557

@bact bact added bug bugs in the library corpus corpus/dataset-related issues labels May 21, 2021
@bact bact self-assigned this May 21, 2021
@bact bact added this to the 2.4 milestone May 21, 2021
@bact bact changed the title Fix dict Fix misspellings in dictionary (words_th.txt) May 21, 2021
@coveralls
Copy link

coveralls commented May 21, 2021

Coverage Status

Coverage decreased (-0.03%) to 95.767% when pulling e611cd9 on bact:fix-dict into 68c3e81 on PyThaiNLP:dev.

@bact bact merged commit 3e4b585 into PyThaiNLP:dev May 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug bugs in the library corpus corpus/dataset-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misspellings and errors in dictionary for word tokenization

2 participants