Skip to content

bug: empty string ('') added (in some cases) when using word_tokenize with join_broken_num=True #911

@S2P2

Description

@S2P2

Description

The empty string ('') added after apply_postprocessors with rejoin_formatted_num when join_broken_num=True (default) in word_tokenize, in some cases such as : ',1.1 ', '(1:5) '

which cause the error in spaCy-PyThaiNLP (PyThaiNLP/spaCy-PyThaiNLP#4)

Expected results

from pythainlp.tokenize import word_tokenize

text = ",1.1 "

word_tokenize(text) # join_broken_num=True as a default
#output [',1.1', ' ']

Current results

from pythainlp.tokenize import word_tokenize

text = ",1.1 "

word_tokenize(text) # join_broken_num=True as a default
#output [',1.1', '', ' ']

Note that the empty string '' is added unexpectedly.

Steps to reproduce

when using word_tokenize with join_broken_num=True to process text with pattern like ",1.1 " or "(1:5) "

PyThaiNLP version

5.0.2

Python version

3.10.12

Operating system and version

Google Colab

More info

No response

Possible solution

After the investigation, I found that the problem lies in the rejoin_formatted_num function

def rejoin_formatted_num(segments: List[str]) -> List[str]:

In the nested while loop below, there are cases where the statement pos < match.end() is false during the first iteration. Consequently, the statements inside never execute, resulting in connected_token remaining an empty string, which is then appended to tokens_joined

while segment_idx < len(segments) and match:
        is_span_beginning = pos >= match.start()
        token = segments[segment_idx]
        if is_span_beginning:
            connected_token = ""
            while pos < match.end() and segment_idx < len(segments): # here 
                connected_token += segments[segment_idx]
                pos += len(segments[segment_idx])
                segment_idx += 1

            tokens_joined.append(connected_token)
            match = next(matching_results, None)
        else:
            tokens_joined.append(token)
            segment_idx += 1
            pos += len(token)

One possible solution is to add the statement if connected_token != "": to check before appending connected_token to tokens_joined

Result in the full code of the function "rejoin_formatted_num" is provided below

def rejoin_formatted_num(segments: List[str]) -> List[str]:
    """
    Rejoin well-known formatted numeric that are over-tokenized.
    The formatted numeric are numbers separated by ":", ",", or ".",
    such as time, decimal numbers, comma-added numbers, and IP addresses.

    :param List[str] segments: result from word tokenizer
    :return: a list of fixed tokens
    :rtype: List[str]

    :Example:
        tokens = ['ขณะ', 'นี้', 'เวลา', ' ', '12', ':', '00น', ' ', 'อัตรา',
                'แลกเปลี่ยน', ' ', '1', ',', '234', '.', '5', ' ', 'baht/zeny']
        rejoin_formatted_num(tokens)
        # output:
        # ['ขณะ', 'นี้', 'เวลา', ' ', '12:00น', ' ', 'อัตรา', 'แลกเปลี่ยน', ' ', '1,234.5', ' ', 'baht/zeny']

        tokens = ['IP', ' ', 'address', ' ', 'ของ', 'คุณ', 'คือ', ' ', '127', '.', '0', '.', '0', '.', '1', ' ', 'ครับ']
        rejoin_formatted_num(tokens)
        # output:
        # ['IP', ' ', 'address', ' ', 'ของ', 'คุณ', 'คือ', ' ', '127.0.0.1', ' ', 'ครับ']
    """
    original = "".join(segments)
    matching_results = _DIGITS_WITH_SEPARATOR.finditer(original)
    tokens_joined = []
    pos = 0
    segment_idx = 0

    match = next(matching_results, None)
    while segment_idx < len(segments) and match:
        is_span_beginning = pos >= match.start()
        token = segments[segment_idx]
        if is_span_beginning:
            connected_token = ""
            while pos < match.end() and segment_idx < len(segments):
                connected_token += segments[segment_idx]
                pos += len(segments[segment_idx])
                segment_idx += 1
            if connected_token != "":
              tokens_joined.append(connected_token)
            match = next(matching_results, None)
        else:
            tokens_joined.append(token)
            segment_idx += 1
            pos += len(token)
    tokens_joined += segments[segment_idx:]
    return tokens_joined

Files

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugbugs in the library

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions