Skip to content

Conversation

@bact
Copy link
Member

@bact bact commented Oct 20, 2019

  • remove words with invalid character sequences, like double tonemarks, SARA E + SARA E
  • remove some words in the pattern ไม่+adj (that adj is already existed)
  • remove some words in the pattern อย่าง+adv (that adv is already existed)

- remove some words in the pattern ไม่+adj
- remove some words in the pattern อย่าง+adv
@coveralls
Copy link

Coverage Status

Coverage remained the same at 90.265% when pulling b0c7a86 on update-dict3 into 89c21d3 on dev.

@bact bact merged commit 5e7f032 into dev Oct 21, 2019
@bact bact added the corpus corpus/dataset-related issues label Oct 21, 2019
@bact bact added this to the 2.1 milestone Oct 21, 2019
@bact bact deleted the update-dict3 branch October 25, 2019 19:26
@bact bact changed the title Update dictionary Remove error words in tokenization dictionary May 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

corpus corpus/dataset-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants