Fix/broken numeric data format (#652) #723

noppayut · 2022-10-11T08:02:36Z

What does this changes

Fix incorrectly word_tokenize-d numeric data formats such as time, comma-separated numbers, decimal number, and ip address.

What was wrong

For some tokenizers (esp. neural nets), numbers separated with symbols (. , :) are tokenized into multiple parts. They should be atomic.

Example:

"12:34pm" -> ["12", ":", "34pm"]

How this fixes it

Postprocess the tokenized tokens with regular expression.
r"(\d+[\.\,:])+\d+": digits followed by either ".,:" once, possibly repeated, and ended with digits.

Fixes #652

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

[O] Passed code styles and structures
[O] Passed code linting checks and unit test

coveralls · 2022-10-11T08:43:20Z

Coverage increased (+0.1%) to 93.803% when pulling 7255e2c on noppayut:fix/broken-numeric-data-format into 39b814a on PyThaiNLP:dev.

coveralls · 2022-10-11T08:43:20Z

Coverage increased (+0.1%) to 93.791% when pulling c0e48e9 on noppayut:fix/broken-numeric-data-format into 39b814a on PyThaiNLP:dev.

bact · 2022-10-11T09:52:38Z

Nice and clean code. Thanks for the contribution.

Just asking for some ideas. What about putting it like this?

word_tokenize(
preprocessors = Callable[[List[str]], List[str]] = []
postprocessors = Callable[[List[str]], List[str]] = [
        join_broken_numeric_data_format
    ]
)

So the user can have some control over the behavior of the tokenizer.

noppayut · 2022-10-11T13:11:37Z

I think we should use booleans for common use-cases and pre/postprocessors for customized ones.

def word_tokenize(
    text, 
    custom_dict, 
    engine, 
    keep_whitespace=true, 
    join_broken_numeric_format=true,   # here
    preprocessors = Callable[[str], str] = []
    postprocessors = Callable[[List[str]], List[str]] = []
):
    # do something

preprocs = [normalize_gamer_style_input, remove_spaces]   # transform str -> str
postprocs = [add_prefix_postfix]   # transform List[str] -> List[str]
word_tokenize(
    "L ด็ ก ๆ กิ u ข้ า ว sึ ยั J", 
    join_broken_numeric_format=true, 
    preprocessors=preprocs,
    postprocessors=postprocs
)
# output: ["<sos>", "เด็ก", "ๆ", "กินข้าว", "รึ", "ยัง", "<eos>"]

bact · 2022-10-11T13:43:08Z

That's neat.

noppayut · 2022-10-11T13:51:16Z

Cool. Let me implement join_broken_numeric_format option for this PR, and send another one for the new word_tokenize API. ;)

For boldness

bact

I changed the names of the parameter and the postprocessor, just to make them shorter.

Also revise the code example and slightly edit the description.

Looks good, @wannaphong I think we can merge.

noppayut · 2022-10-12T01:16:05Z

Thanks!

sonarqubecloud · 2022-10-12T02:30:17Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
2 Code Smells

No Coverage information
0.0% Duplication

wannaphong

Awesome! Do you want to get hacktoberfest?

noppayut · 2022-10-12T11:40:27Z

Nope I'm good. Thanks 👍

Noppayut Sriwatanasakdi and others added 3 commits October 11, 2022 16:41

Implement numeric data format fix as a postprocessor

d3ab7b0

Integrate numeric format postprocessor to word tokenizer

b9169a3

Add unittest for numeric data format fix

7255e2c

wannaphong requested review from bact and wannaphong October 11, 2022 08:14

bact added the bug bugs in the library label Oct 11, 2022

noppayut and others added 11 commits October 12, 2022 00:34

Allow user-defined postprocessors and rename functions

741273f

Add option to join broken numeric formats

2077fcf

Rename function, update docstring

f12834b

Use strip_whitespace as postprocessor

348f5b9

Update test cases to follow new word_tokenize api

375f3e9

Linting

63a2000

Another linting

5029e8c

Rename: fix_numeric_data_format -> rejoin_formatted_num

67004cc

Rename: join_broken_numeric_format -> join_broken_num

722bf47

For boldness

Update test_tokenize.py

e88e21f

Adjust join_broken_num example

399858d

bact approved these changes Oct 11, 2022

View reviewed changes

bact self-assigned this Oct 11, 2022

bact added this to the 3.2 milestone Oct 12, 2022

Sort engine names alphabetically

c0e48e9

wannaphong approved these changes Oct 12, 2022

View reviewed changes

wannaphong merged commit 2606f85 into PyThaiNLP:dev Oct 12, 2022

bact added Hacktoberfest for Hacktoberfest event hacktoberfest-accepted hacktoberfest accepted pull requests. labels Oct 12, 2022

wannaphong mentioned this pull request Dec 7, 2022

PyThaiNLP 4.0 change log #714

Closed

wannaphong mentioned this pull request Apr 1, 2023

PyThaiNLP 4.0 beta 1 #786

Merged

wannaphong mentioned this pull request Apr 14, 2023

PyThaiNLP 4.0 Released! #789

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/broken numeric data format (#652) #723

Fix/broken numeric data format (#652) #723

noppayut commented Oct 11, 2022 •

edited

Loading

coveralls commented Oct 11, 2022

coveralls commented Oct 11, 2022 •

edited

Loading

bact commented Oct 11, 2022

noppayut commented Oct 11, 2022

bact commented Oct 11, 2022

noppayut commented Oct 11, 2022

bact left a comment •

edited

Loading

noppayut commented Oct 12, 2022

sonarqubecloud bot commented Oct 12, 2022

wannaphong left a comment

noppayut commented Oct 12, 2022

Fix/broken numeric data format (#652) #723

Fix/broken numeric data format (#652) #723

Conversation

noppayut commented Oct 11, 2022 • edited Loading

What does this changes

What was wrong

How this fixes it

Your checklist for this pull request

coveralls commented Oct 11, 2022

coveralls commented Oct 11, 2022 • edited Loading

bact commented Oct 11, 2022

noppayut commented Oct 11, 2022

bact commented Oct 11, 2022

noppayut commented Oct 11, 2022

bact left a comment • edited Loading

Choose a reason for hiding this comment

noppayut commented Oct 12, 2022

sonarqubecloud bot commented Oct 12, 2022

wannaphong left a comment

Choose a reason for hiding this comment

noppayut commented Oct 12, 2022

noppayut commented Oct 11, 2022 •

edited

Loading

coveralls commented Oct 11, 2022 •

edited

Loading

bact left a comment •

edited

Loading