Skip to content

bug: sent_tokenize with engine="whitespace" should return [] not [[]] #954

@bact

Description

@bact

Description

sent_tokenize(str, engine="whitespace") returns a list of a list of string, instead of a list of string.

Expected results

sent_tokenize should return a list ([]), not a list of a list ([[]])

Current results

https://github.com/PyThaiNLP/pythainlp/actions/runs/11627805664/job/32381817268

======================================================================
FAIL: test_sent_tokenize (tests.test_tokenize.TokenizeTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/runner/work/pythainlp/pythainlp/tests/test_tokenize.py", line 222, in test_sent_tokenize
    self.assertEqual(
AssertionError: Lists differ: [['รักน้ำ', 'รักปลา', '']] != ['รักน้ำ', 'รักปลา', '']

First differing element 0:
['รักน้ำ', 'รักปลา', '']
'รักน้ำ'

Second list contains 2 additional elements.
First extra element 1:
'รักปลา'

- [['รักน้ำ', 'รักปลา', '']]
? -                        -

+ ['รักน้ำ', 'รักปลา', '']

Steps to reproduce

        self.assertEqual(
            sent_tokenize("รักน้ำ  รักปลา  ", engine="whitespace"),
            ["รักน้ำ", "รักปลา", ""],
        )

PyThaiNLP version

5.0.4

Python version

All

Operating system and version

All

More info

No response

Possible solution

No response

Files

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugbugs in the library

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions