bug: empty string ('') added (in some cases) when using word_tokenize with join_broken_num=True

### Description

The empty string ('') added after `apply_postprocessors` with `rejoin_formatted_num` when `join_broken_num=True` (default) in `word_tokenize`, in some cases such as : `',1.1 '`, `'(1:5) '`

which cause the error in spaCy-PyThaiNLP (https://github.com/PyThaiNLP/spaCy-PyThaiNLP/issues/4)

### Expected results

```
from pythainlp.tokenize import word_tokenize

text = ",1.1 "

word_tokenize(text) # join_broken_num=True as a default
#output [',1.1', ' ']
```

### Current results

```
from pythainlp.tokenize import word_tokenize

text = ",1.1 "

word_tokenize(text) # join_broken_num=True as a default
#output [',1.1', '', ' ']
```

Note that the empty string `''` is added unexpectedly.

### Steps to reproduce

when using `word_tokenize` with `join_broken_num=True` to process text with pattern like `",1.1 "` or `"(1:5) "`

### PyThaiNLP version

5.0.2

### Python version

3.10.12

### Operating system and version

Google Colab

### More info

_No response_

### Possible solution

After the investigation, I found that the problem lies in the `rejoin_formatted_num` function https://github.com/PyThaiNLP/pythainlp/blob/a38fd5e84148402929c861c7e49afd0c5a08abfb/pythainlp/tokenize/_utils.py#L26

In the nested while loop below, there are cases where the statement `pos < match.end()` is false during the first iteration. Consequently, the statements inside never execute, resulting in `connected_token` remaining an empty string, which is then appended to `tokens_joined`

```
while segment_idx < len(segments) and match:
        is_span_beginning = pos >= match.start()
        token = segments[segment_idx]
        if is_span_beginning:
            connected_token = ""
            while pos < match.end() and segment_idx < len(segments): # here 
                connected_token += segments[segment_idx]
                pos += len(segments[segment_idx])
                segment_idx += 1

            tokens_joined.append(connected_token)
            match = next(matching_results, None)
        else:
            tokens_joined.append(token)
            segment_idx += 1
            pos += len(token)
```
One possible solution is to add the statement `if connected_token != "":` to check before appending `connected_token` to `tokens_joined`

Result in the full code of the function "rejoin_formatted_num" is provided below
```
def rejoin_formatted_num(segments: List[str]) -> List[str]:
    """
    Rejoin well-known formatted numeric that are over-tokenized.
    The formatted numeric are numbers separated by ":", ",", or ".",
    such as time, decimal numbers, comma-added numbers, and IP addresses.

    :param List[str] segments: result from word tokenizer
    :return: a list of fixed tokens
    :rtype: List[str]

    :Example:
        tokens = ['ขณะ', 'นี้', 'เวลา', ' ', '12', ':', '00น', ' ', 'อัตรา',
                'แลกเปลี่ยน', ' ', '1', ',', '234', '.', '5', ' ', 'baht/zeny']
        rejoin_formatted_num(tokens)
        # output:
        # ['ขณะ', 'นี้', 'เวลา', ' ', '12:00น', ' ', 'อัตรา', 'แลกเปลี่ยน', ' ', '1,234.5', ' ', 'baht/zeny']

        tokens = ['IP', ' ', 'address', ' ', 'ของ', 'คุณ', 'คือ', ' ', '127', '.', '0', '.', '0', '.', '1', ' ', 'ครับ']
        rejoin_formatted_num(tokens)
        # output:
        # ['IP', ' ', 'address', ' ', 'ของ', 'คุณ', 'คือ', ' ', '127.0.0.1', ' ', 'ครับ']
    """
    original = "".join(segments)
    matching_results = _DIGITS_WITH_SEPARATOR.finditer(original)
    tokens_joined = []
    pos = 0
    segment_idx = 0

    match = next(matching_results, None)
    while segment_idx < len(segments) and match:
        is_span_beginning = pos >= match.start()
        token = segments[segment_idx]
        if is_span_beginning:
            connected_token = ""
            while pos < match.end() and segment_idx < len(segments):
                connected_token += segments[segment_idx]
                pos += len(segments[segment_idx])
                segment_idx += 1
            if connected_token != "":
              tokens_joined.append(connected_token)
            match = next(matching_results, None)
        else:
            tokens_joined.append(token)
            segment_idx += 1
            pos += len(token)
    tokens_joined += segments[segment_idx:]
    return tokens_joined
```

### Files

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: empty string ('') added (in some cases) when using word_tokenize with join_broken_num=True #911

Description

Expected results

Current results

Steps to reproduce

PyThaiNLP version

Python version

Operating system and version

More info

Possible solution

Files

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: empty string ('') added (in some cases) when using word_tokenize with join_broken_num=True #911

Description

Description

Expected results

Current results

Steps to reproduce

PyThaiNLP version

Python version

Operating system and version

More info

Possible solution

Files

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions