Skip to content

Fix/broken numeric data format (#652) #723

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Oct 12, 2022

Conversation

noppayut
Copy link
Contributor

@noppayut noppayut commented Oct 11, 2022

What does this changes

Fix incorrectly word_tokenize-d numeric data formats such as time, comma-separated numbers, decimal number, and ip address.

What was wrong

For some tokenizers (esp. neural nets), numbers separated with symbols (. , :) are tokenized into multiple parts. They should be atomic.

Example:

"12:34pm" -> ["12", ":", "34pm"]

How this fixes it

Postprocess the tokenized tokens with regular expression.
r"(\d+[\.\,:])+\d+": digits followed by either ".,:" once, possibly repeated, and ended with digits.

Fixes #652

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

  • [O] Passed code styles and structures
  • [O] Passed code linting checks and unit test

@wannaphong wannaphong requested review from bact and wannaphong October 11, 2022 08:14
@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 93.803% when pulling 7255e2c on noppayut:fix/broken-numeric-data-format into 39b814a on PyThaiNLP:dev.

@coveralls
Copy link

coveralls commented Oct 11, 2022

Coverage Status

Coverage increased (+0.1%) to 93.791% when pulling c0e48e9 on noppayut:fix/broken-numeric-data-format into 39b814a on PyThaiNLP:dev.

@bact bact added the bug bugs in the library label Oct 11, 2022
@bact
Copy link
Member

bact commented Oct 11, 2022

Nice and clean code. Thanks for the contribution.

Just asking for some ideas. What about putting it like this?

word_tokenize(
preprocessors = Callable[[List[str]], List[str]] = []
postprocessors = Callable[[List[str]], List[str]] = [
        join_broken_numeric_data_format
    ]
)

So the user can have some control over the behavior of the tokenizer.

@noppayut
Copy link
Contributor Author

I think we should use booleans for common use-cases and pre/postprocessors for customized ones.

def word_tokenize(
    text, 
    custom_dict, 
    engine, 
    keep_whitespace=true, 
    join_broken_numeric_format=true,   # here
    preprocessors = Callable[[str], str] = []
    postprocessors = Callable[[List[str]], List[str]] = []
):
    # do something

preprocs = [normalize_gamer_style_input, remove_spaces]   # transform str -> str
postprocs = [add_prefix_postfix]   # transform List[str] -> List[str]
word_tokenize(
    "L ด็ ก ๆ กิ u ข้ า ว sึ ยั J", 
    join_broken_numeric_format=true, 
    preprocessors=preprocs,
    postprocessors=postprocs
)
# output: ["<sos>", "เด็ก", "ๆ", "กินข้าว", "รึ", "ยัง", "<eos>"]

@bact
Copy link
Member

bact commented Oct 11, 2022

That's neat.

@noppayut
Copy link
Contributor Author

Cool. Let me implement join_broken_numeric_format option for this PR, and send another one for the new word_tokenize API. ;)

Copy link
Member

@bact bact left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the names of the parameter and the postprocessor, just to make them shorter.

Also revise the code example and slightly edit the description.

Looks good, @wannaphong I think we can merge.

@bact bact self-assigned this Oct 11, 2022
@bact bact added this to the 3.2 milestone Oct 12, 2022
@noppayut
Copy link
Contributor Author

Thanks!

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Copy link
Member

@wannaphong wannaphong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Do you want to get hacktoberfest?

@noppayut
Copy link
Contributor Author

Nope I'm good. Thanks 👍

@wannaphong wannaphong merged commit 2606f85 into PyThaiNLP:dev Oct 12, 2022
@bact bact added Hacktoberfest for Hacktoberfest event hacktoberfest-accepted hacktoberfest accepted pull requests. labels Oct 12, 2022
@wannaphong wannaphong mentioned this pull request Apr 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug bugs in the library Hacktoberfest for Hacktoberfest event hacktoberfest-accepted hacktoberfest accepted pull requests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mistake in word tokenization for text containing digit related time and finance
4 participants