Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compound words tokenization failure #8

Open
cathxiao opened this issue Sep 2, 2017 · 1 comment
Open

Compound words tokenization failure #8

cathxiao opened this issue Sep 2, 2017 · 1 comment
Assignees
Labels

Comments

@cathxiao
Copy link

cathxiao commented Sep 2, 2017

Expected Behavior

Compound words (e.g. pick-me-up, hand-me-down, know-it-all, etc.) should be tokenized as single tokens.

Actual Behavior

hyphens are treated as separators, and the components are tokenized separately.

@cathxiao cathxiao added the bug label Sep 2, 2017
@cathxiao cathxiao self-assigned this Sep 2, 2017
@jdchoi77
Copy link
Member

jdchoi77 commented Sep 3, 2017

These should be tokenized because they can occur without the hyphens (e.g., pick me up) and it should be tokenized consistently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants