Closed
Description
In the following sentence (from Twitter), 'email'
is being tokenized as 'em
, ail
, and '
. This is obviously incorrect. What can be done to stop this split?
- It's official (according to the AP) it's 'email' not 'e-mail' and 'website' not 'web-site'!
I have the following parameters set:
tokenize.language
: English
tokenize.whitespace
: false (because we want tokens like it's
to separate into it
and 's
)
tokenize.keepeol
: false
tokenize.verbose
: false
tokenize.options
: invertible=true,splitAssimilations=false,splitHyphenated=false,splitForwardSlash=true,untokenizable=allKeep,strictTreebank3=true,normalizeSpace=false,ellipses=original
Metadata
Metadata
Assignees
Labels
No labels