'email' tokenizing as 'em, ail, and '

In the following sentence (from Twitter), `'email'` is being tokenized as `'em`, `ail`, and `'`. This is obviously incorrect. What can be done to stop this split?

* It's official (according to the AP) it's 'email' not 'e-mail' and 'website' not 'web-site'!

I have the following parameters set:
`tokenize.language`: English
`tokenize.whitespace`: false (because we want tokens like `it's` to separate into `it` and `'s`)
`tokenize.keepeol`: false
`tokenize.verbose`: false
`tokenize.options`: invertible=true,splitAssimilations=false,splitHyphenated=false,splitForwardSlash=true,untokenizable=allKeep,strictTreebank3=true,normalizeSpace=false,ellipses=original

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'email' tokenizing as 'em, ail, and ' #1316

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

'email' tokenizing as 'em, ail, and ' #1316

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions