-
Notifications
You must be signed in to change notification settings - Fork 279
Fix/broken numeric data format (#652) #723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/broken numeric data format (#652) #723
Conversation
Nice and clean code. Thanks for the contribution. Just asking for some ideas. What about putting it like this? word_tokenize(
preprocessors = Callable[[List[str]], List[str]] = []
postprocessors = Callable[[List[str]], List[str]] = [
join_broken_numeric_data_format
]
) So the user can have some control over the behavior of the tokenizer. |
I think we should use booleans for common use-cases and pre/postprocessors for customized ones. def word_tokenize(
text,
custom_dict,
engine,
keep_whitespace=true,
join_broken_numeric_format=true, # here
preprocessors = Callable[[str], str] = []
postprocessors = Callable[[List[str]], List[str]] = []
):
# do something
preprocs = [normalize_gamer_style_input, remove_spaces] # transform str -> str
postprocs = [add_prefix_postfix] # transform List[str] -> List[str]
word_tokenize(
"L ด็ ก ๆ กิ u ข้ า ว sึ ยั J",
join_broken_numeric_format=true,
preprocessors=preprocs,
postprocessors=postprocs
)
# output: ["<sos>", "เด็ก", "ๆ", "กินข้าว", "รึ", "ยัง", "<eos>"] |
That's neat. |
Cool. Let me implement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the names of the parameter and the postprocessor, just to make them shorter.
Also revise the code example and slightly edit the description.
Looks good, @wannaphong I think we can merge.
Thanks! |
Kudos, SonarCloud Quality Gate passed!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Do you want to get hacktoberfest?
Nope I'm good. Thanks 👍 |
What does this changes
Fix incorrectly word_tokenize-d numeric data formats such as time, comma-separated numbers, decimal number, and ip address.
What was wrong
For some tokenizers (esp. neural nets), numbers separated with symbols (. , :) are tokenized into multiple parts. They should be atomic.
Example:
How this fixes it
Postprocess the tokenized tokens with regular expression.
r"(\d+[\.\,:])+\d+"
: digits followed by either ".,:" once, possibly repeated, and ended with digits.Fixes #652
Your checklist for this pull request
🚨Please review the guidelines for contributing to this repository.