Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 6 #7

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Issue 6 #7

wants to merge 4 commits into from

Conversation

joshweir
Copy link

Fix #6

Created new splittable PRE_N_POST_ONLY which holds characters which can be both prefixes and suffixes but are only a splittable if at the beginning or end of a token with the exception of being prefixed/suffixed by other splittables.
Taking the single quote ' as a PRE_N_POST_ONLY splittable, the following would be valid use cases as a splittable:

  • 'test quotes'
  • 'test quotes'. <- suffixed by another splittable
  • ('test quotes'). <- prefixed and suffixed by another splittable

The following would not be valid uses as a splittable:

  • l'interrelation
  • l'imagerie

…eg. https://www.google.com, google.com, etc

fix hardcoding of tokenizer path in test_tokenize_urls test

refactor tokenizer

fix bug when url contains directories the entire url would not be a single token

refactor lib/tokenizer
# The first commit's message is:
recognize a complete url as a token, this includes various url forms eg. https://www.google.com, google.com, etc

# This is the 2nd commit message:

fix hardcoding of tokenizer path in test_tokenize_urls test

# This is the 3rd commit message:

refactor tokenizer

# This is the 4th commit message:

fix bug when url contains directories the entire url would not be a single token
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

french words that contains single quote get broken down
1 participant