Skip to content

Common contracted forms are missing from the English stop word list #22

@DavidNemeskey

Description

@DavidNemeskey

While the list contains s and t (most likely because they can occur after an apostrophe as part of a contraction in e.g. dog's and can't), other common forms, i.e.

  • d as in she'd,
  • ll as in we'll,
  • m as in I'm,
  • o as in o'clock,
  • re as in you're,
  • ve as in they've,
  • y as in y'all
    are missing.

Also missing are the parts of these contractions that fall to the left of the apostrophe, e.g. ain (but don is there).

Of course, the lack of these forms could be justified by pointing out that if the tokenizer does not split by apostrophes, then these forms will not occur in the tokenized text. However, it is a strong assumption, especially taking into account that nltk's own Punkt tokenizer, for instance, does split at the apostrophes. Also, some of the contractions seem to be handled (don't , can't, the possessive s), so it does not make sense to not include the rest.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions