-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Closed
Milestone
Description
While the list contains s and t (most likely because they can occur after an apostrophe as part of a contraction in e.g. dog's and can't), other common forms, i.e.
das in she'd,llas in we'll,mas in I'm,oas in o'clock,reas in you're,veas in they've,yas in y'all
are missing.
Also missing are the parts of these contractions that fall to the left of the apostrophe, e.g. ain (but don is there).
Of course, the lack of these forms could be justified by pointing out that if the tokenizer does not split by apostrophes, then these forms will not occur in the tokenized text. However, it is a strong assumption, especially taking into account that nltk's own Punkt tokenizer, for instance, does split at the apostrophes. Also, some of the contractions seem to be handled (don't , can't, the possessive s), so it does not make sense to not include the rest.
Metadata
Metadata
Assignees
Labels
No labels