-
Notifications
You must be signed in to change notification settings - Fork 287
Description
Description
The empty string ('') added after apply_postprocessors with rejoin_formatted_num when join_broken_num=True (default) in word_tokenize, in some cases such as : ',1.1 ', '(1:5) '
which cause the error in spaCy-PyThaiNLP (PyThaiNLP/spaCy-PyThaiNLP#4)
Expected results
from pythainlp.tokenize import word_tokenize
text = ",1.1 "
word_tokenize(text) # join_broken_num=True as a default
#output [',1.1', ' ']
Current results
from pythainlp.tokenize import word_tokenize
text = ",1.1 "
word_tokenize(text) # join_broken_num=True as a default
#output [',1.1', '', ' ']
Note that the empty string '' is added unexpectedly.
Steps to reproduce
when using word_tokenize with join_broken_num=True to process text with pattern like ",1.1 " or "(1:5) "
PyThaiNLP version
5.0.2
Python version
3.10.12
Operating system and version
Google Colab
More info
No response
Possible solution
After the investigation, I found that the problem lies in the rejoin_formatted_num function
pythainlp/pythainlp/tokenize/_utils.py
Line 26 in a38fd5e
| def rejoin_formatted_num(segments: List[str]) -> List[str]: |
In the nested while loop below, there are cases where the statement pos < match.end() is false during the first iteration. Consequently, the statements inside never execute, resulting in connected_token remaining an empty string, which is then appended to tokens_joined
while segment_idx < len(segments) and match:
is_span_beginning = pos >= match.start()
token = segments[segment_idx]
if is_span_beginning:
connected_token = ""
while pos < match.end() and segment_idx < len(segments): # here
connected_token += segments[segment_idx]
pos += len(segments[segment_idx])
segment_idx += 1
tokens_joined.append(connected_token)
match = next(matching_results, None)
else:
tokens_joined.append(token)
segment_idx += 1
pos += len(token)
One possible solution is to add the statement if connected_token != "": to check before appending connected_token to tokens_joined
Result in the full code of the function "rejoin_formatted_num" is provided below
def rejoin_formatted_num(segments: List[str]) -> List[str]:
"""
Rejoin well-known formatted numeric that are over-tokenized.
The formatted numeric are numbers separated by ":", ",", or ".",
such as time, decimal numbers, comma-added numbers, and IP addresses.
:param List[str] segments: result from word tokenizer
:return: a list of fixed tokens
:rtype: List[str]
:Example:
tokens = ['ขณะ', 'นี้', 'เวลา', ' ', '12', ':', '00น', ' ', 'อัตรา',
'แลกเปลี่ยน', ' ', '1', ',', '234', '.', '5', ' ', 'baht/zeny']
rejoin_formatted_num(tokens)
# output:
# ['ขณะ', 'นี้', 'เวลา', ' ', '12:00น', ' ', 'อัตรา', 'แลกเปลี่ยน', ' ', '1,234.5', ' ', 'baht/zeny']
tokens = ['IP', ' ', 'address', ' ', 'ของ', 'คุณ', 'คือ', ' ', '127', '.', '0', '.', '0', '.', '1', ' ', 'ครับ']
rejoin_formatted_num(tokens)
# output:
# ['IP', ' ', 'address', ' ', 'ของ', 'คุณ', 'คือ', ' ', '127.0.0.1', ' ', 'ครับ']
"""
original = "".join(segments)
matching_results = _DIGITS_WITH_SEPARATOR.finditer(original)
tokens_joined = []
pos = 0
segment_idx = 0
match = next(matching_results, None)
while segment_idx < len(segments) and match:
is_span_beginning = pos >= match.start()
token = segments[segment_idx]
if is_span_beginning:
connected_token = ""
while pos < match.end() and segment_idx < len(segments):
connected_token += segments[segment_idx]
pos += len(segments[segment_idx])
segment_idx += 1
if connected_token != "":
tokens_joined.append(connected_token)
match = next(matching_results, None)
else:
tokens_joined.append(token)
segment_idx += 1
pos += len(token)
tokens_joined += segments[segment_idx:]
return tokens_joined
Files
No response