"newmm-safe" option -- fix newmm issue, take too long time for long text with lots of ambiguity breaking points #302

bact · 2019-10-12T08:12:44Z

This PR proposed newmm-safe tokenization engine, which is a newmm engine with additional mechanism to avoid possible exponentially long wait for long text with a lot of ambiguity in breaking points. Details are below.

Fix newmm issue with long ambiguous text

The problem

The issue is discribed in pythainlp.word_tokenize ปัญหาตัดคำประโยคที่ยาวต่อเนื่องโดยไม่มี space [newmm] #241
For example, if an input text is "หน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้าน", as "หน้าด้าน" and "ด้านหน้า" are both valid word (according to the dictionary), lots of word graphs representing different possibility of word breaks will be created.
If the ambiguos text is very long, it will take so long time to finished the job, as the possibility will grow explonetially.

Proposed solution

Preprocess the input text before the actual tokenization
if the input text is longer than the limit (TEXT_LIMIT) with window size left and right), breaks them into smaller chunks (text_parts) then tokenizes each chunk
when breaks the text, the breaking point should fall at valid word/syllable ends
- if there's a space, use the right-most space as a breaking point
- if there's no space, current implementation find candidate breaking points by feed a smaller sample text in a window (from TEXT_LIMIT - TEXT_SCAN_LEFT to TEXT_LIMIT + TEXT_SCAN_RIGHT) to newmm again - and choose to break after the longest token
This preprocess can be enable/disable by safe_mode parameter.
- If safe_mode is True, do the preprocess (will be slower)
- If safe_mode is False (default), don't do the preprocess (may face long wait for edge cases)

Effects from the proprosed solution

chunk size, window size, and complexity of finding valid breaking strategy can affect the speed and accurary
current chunk size is 100 +- 20
- too large chunk size = still long text, will not solve to issue --> affect speed
- too small chunk size = more work to do --> affect speed, and maybe accuracy as well
- too large window size = more work to do --> affect speed
- too small window size = can break a single word into two words --> affect accuray
need to check the performance with the pythainlp.benchmarks

Alternative solution

The proposed preprocssing can be alternatively provided as a separated function that the user needs to call by themselves, if they want a safe tokenization.
Another alternative, and more elegant, solution is maybe to limit nodes in word graph to a certian number. (Do it directly inside _bfs_paths_graph())

- break text into smaller chunks - tokenizes each chunk

pep8speaks · 2019-10-12T08:12:48Z

Hello @bact! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file pythainlp/tokenize/__init__.py:

Line 356:80: E501 line too long (84 > 79 characters)

In the file pythainlp/tokenize/newmm.py:

Line 114:80: E501 line too long (89 > 79 characters)

In the file tests/test_tokenize.py:

Line 192:80: E501 line too long (138 > 79 characters)
Line 193:80: E501 line too long (225 > 79 characters)
Line 194:80: E501 line too long (83 > 79 characters)
Line 195:80: E501 line too long (82 > 79 characters)

Comment last updated at 2019-11-13 19:19:08 UTC

coveralls · 2019-10-12T08:22:14Z

Coverage increased (+0.2%) to 90.47% when pulling b2d72d1 on fix-newmm-longtext into fd5c44d on dev.

p16i · 2019-10-13T13:20:23Z

pythainlp/tokenize/ssg.py



 def segment(text: str) -> List[str]:
+    if not text or not isinstance(text, str):


This condition appears many times in the code. I think we should consider refactor it.

@heytitle isinstance(text, str) only ?

I mean this if-condition appears many times.

Remove tnc.word_freq

bact · 2019-10-15T10:57:14Z

Results from the code in this notebook
https://colab.research.google.com/drive/15BYn_XOouej4M3CDL6ltO1j5uAq99Whq

For artificial texts, crafted to replicated issue pythainlp.word_tokenize ปัญหาตัดคำประโยคที่ยาวต่อเนื่องโดยไม่มี space [newmm] #241, fix-newmm-longtext is faster and can survived all cases in a reasonable time.
- These are text with edge cases (case that will produce complicated word graphs)
For "natural" texts 4, 5, and 9, fix-newmm-longtext is slightly slower
- These are texts without edge cases

Length of test texts (9 texts in total)

16
99
99
309 - (natural) text from Wikipedia
405 - (natural) text from Wikipedia
408 - (artificial) complicated text crafted to replicate issue pythainlp.word_tokenize ปัญหาตัดคำประโยคที่ยาวต่อเนื่องโดยไม่มี space [newmm] #241
576 - (artificial) complicated text crafted to replicate issue pythainlp.word_tokenize ปัญหาตัดคำประโยคที่ยาวต่อเนื่องโดยไม่มี space [newmm] #241
1,848 - (artificial) complicated text crafted to replicate issue pythainlp.word_tokenize ปัญหาตัดคำประโยคที่ยาวต่อเนื่องโดยไม่มี space [newmm] #241
8,531 - (natural) text from Wikipedia

From dev branch

10000 loops, best of 3: 31.9 µs per loop
1000 loops, best of 3: 236 µs per loop
1000 loops, best of 3: 265 µs per loop
1000 loops, best of 3: 777 µs per loop *
1000 loops, best of 3: 946 µs per loop *
100 loops, best of 3: 2.58 ms per loop
(Cannot finished)
(Cannot finished)
10 loops, best of 3: 26.3 ms per loop *

From fix-newmm-longtext branch

10000 loops, best of 3: 31.5 µs per loop
1000 loops, best of 3: 235 µs per loop
1000 loops, best of 3: 270 µs per loop
1000 loops, best of 3: 992 µs per loop
1000 loops, best of 3: 1.14 ms per loop
1000 loops, best of 3: 1.33 ms per loop *
10 loops, best of 3: 75.2 ms per loop *
1 loop, best of 3: 470 ms per loop *
10 loops, best of 3: 30.6 ms per loop

wannaphong · 2019-10-20T09:58:29Z

From PyThaiNLP 2.0.7 (use marisa-trie)

10000 loops, best of 3: 26.9 µs per loop
1000 loops, best of 3: 232 µs per loop
1000 loops, best of 3: 248 µs per loop
1000 loops, best of 3: 813 µs per loop
1000 loops, best of 3: 1.04 ms per loop
100 loops, best of 3: 2.56 ms per loop
(Cannot finished)
(Cannot finished)
10 loops, best of 3: 90.2 ms per loop

Colab : Colab

wannaphong · 2019-10-20T10:21:53Z

check the performance with the pythainlp.benchmarks

Update dict from dev branch

wannaphong · 2019-10-20T10:34:21Z

Tokenisation Speed Benchmark

Notebook by @heytitle

dev Notebook
fix-newmm-longtext Notebook

Update from dev branch

bact · 2019-10-20T10:47:26Z

Tokenisation Speed Benchmark

	method	Char(200)	Char(500)	Char(1000)	Char(2000)
0	newmm (dev)	0.0007±0.0001	0.0017±0.0001	0.0029±0.0002	0.0059±0.0003
1	newmm (fix-newmm-longtext)	0.0008±0.0001	0.0017±0.0001	0.0039±0.0009	0.0069±0.0002

wannaphong · 2019-10-20T11:52:40Z

check the performance with the pythainlp.benchmarks

dev

Colab : Google colab

============== Benchmark Result ==============
               metric      mean±std       min    max
        char_level:tp   25.49±27.33  1.000000  240.0
        char_level:tn  91.16±104.23  0.000000  921.0
        char_level:fp     1.42±3.43  0.000000   46.0
        char_level:fn     5.21±7.13  0.000000   64.0
 char_level:precision     0.93±0.18  0.021277    1.0
    char_level:recall     0.86±0.10  0.142857    1.0
        char_level:f1     0.87±0.15  0.041667    1.0
 word_level:precision     0.75±0.23  0.000000    1.0
    word_level:recall     0.67±0.23  0.000000    1.0
        word_level:f1     0.70±0.23  0.000000    1.0

fix-newmm-longtext

Colab : Google colab

============== Benchmark Result ==============
               metric      mean±std       min    max
        char_level:tp   25.73±27.76  1.000000  242.0
        char_level:tn  90.14±102.41  0.000000  915.0
        char_level:fp     2.43±4.65  0.000000   49.0
        char_level:fn     4.98±6.69  0.000000   60.0
 char_level:precision     0.91±0.18  0.021277    1.0
    char_level:recall     0.87±0.10  0.142857    1.0
        char_level:f1     0.87±0.15  0.041667    1.0
 word_level:precision     0.73±0.23  0.000000    1.0
    word_level:recall     0.67±0.23  0.000000    1.0
        word_level:f1     0.69±0.23  0.000000    1.0

Update new dict from dev branch

Update to changes from dev branch

pythainlp/tokenize/newmm.py

bact · 2019-11-08T14:40:16Z

Add safe_mode parameter, to enable/disable the preprocessing.

If safe_mode is True, do the preprocess (will be slower)
If safe_mode is False (default), don't do the preprocess (may face long wait for edge cases)

wannaphong · 2019-11-10T17:06:04Z

It's well. If this pull request is ready, You can merge.

p16i

The code looks good to me. However, I find that it does't have any test for the new option.

@bact it would be good if we could have minimal tests for this new option. One possible test might be using newmm-safe to tokenize that problematic string.

p16i

LGTM.

…hainlp into fix-newmm-longtext * 'fix-newmm-longtext' of https://github.com/PyThaiNLP/pythainlp: Fixed Travis CI : Update ThaiNER

bact added 2 commits October 12, 2019 11:36

Initial commit, the long text fix will follow

1145ee8

fix newmm issue with long text

4e7b639

- break text into smaller chunks - tokenizes each chunk

bact mentioned this pull request Oct 12, 2019

pythainlp.word_tokenize ปัญหาตัดคำประโยคที่ยาวต่อเนื่องโดยไม่มี space [newmm] #241

Closed

p16i reviewed Oct 13, 2019

View reviewed changes

bact added 4 commits October 13, 2019 23:25

fix PEP 8 issues

7b0c3a4

Update test_corpus.py

1fb4c1d

Remove tnc.word_freq

remove person names

8f894ee

remove few hyphened words

2af315b

bact and others added 7 commits October 17, 2019 15:35

Try smaller window

f3b4daa

Fix appveyor.yml

7f60b2a

Merge branch 'dev' into fix-newmm-longtext

dcded4a

try break by the right-most space first, to speed up

307cc9f

fix cut_pos

78270ad

add more test cases

a1df24b

remove obvious compound words from dictionary

bb75b8e

wannaphong added this to the 2.1 milestone Oct 20, 2019

wannaphong approved these changes Oct 20, 2019

View reviewed changes

bact added 2 commits October 20, 2019 18:28

Comments in English

73a1827

Merge pull request #308 from PyThaiNLP/dev

a17a604

Update dict from dev branch

Merge pull request #310 from PyThaiNLP/dev

fad2e83

Update from dev branch

Merge pull request #312 from PyThaiNLP/dev

142e9b1

Update new dict from dev branch

bact added the bug bugs in the library label Oct 21, 2019

Merge pull request #315 from PyThaiNLP/dev

9309ee6

Update to changes from dev branch

wannaphong reviewed Nov 3, 2019

View reviewed changes

pythainlp/tokenize/newmm.py Outdated Show resolved Hide resolved

p16i reviewed Nov 5, 2019

View reviewed changes

pythainlp/tokenize/newmm.py Outdated Show resolved Hide resolved

p16i reviewed Nov 5, 2019

View reviewed changes

pythainlp/tokenize/newmm.py Show resolved Hide resolved

add "newmm-safe" option

af16bea

wannaphong requested a review from p16i November 10, 2019 08:14

p16i reviewed Nov 10, 2019

View reviewed changes

bact added 2 commits November 12, 2019 17:18

Add test for newmm-safe mode

0332cf2

Update docstring for newmm-safe

c627f82

bact changed the title ~~Fix newmm issue, take too long time with long text~~ "newmm-safe" option -- fix newmm issue, take too long time with some long text Nov 12, 2019

Fixed Travis CI : Update ThaiNER

1f3faf0

p16i self-requested a review November 12, 2019 20:11

p16i approved these changes Nov 12, 2019

View reviewed changes

bact added 2 commits November 13, 2019 17:26

add long text test case for newmm-safe

4a04e98

Merge branch 'fix-newmm-longtext' of https://github.com/PyThaiNLP/pyt…

c464c6d

…hainlp into fix-newmm-longtext * 'fix-newmm-longtext' of https://github.com/PyThaiNLP/pythainlp: Fixed Travis CI : Update ThaiNER

bact self-assigned this Nov 13, 2019

bact added 4 commits November 13, 2019 18:33

add more type hints

817c2c8

fix Generator type hinting

9159940

Update Tennsorflow version to 2 for deepcut test

152e238

Merge branch 'dev' into fix-newmm-longtext

b2d72d1

bact merged commit 23f3856 into dev Nov 14, 2019

bact deleted the fix-newmm-longtext branch November 14, 2019 10:11

wannaphong mentioned this pull request Nov 15, 2019

PyThaiNLP 2.1 change log #181

Closed

bact changed the title ~~"newmm-safe" option -- fix newmm issue, take too long time with some long text~~ "newmm-safe" option -- fix newmm issue, take too long time for long text with lots of ambiguity breaking points Nov 15, 2019

bact mentioned this pull request Dec 13, 2019

Add graph size limit in _onecut() to avoid long wait for ambiguous text #333

Merged

wannaphong mentioned this pull request May 17, 2025

word_tokenize stuck wannaphong/LaoNLP#19

Open



		def segment(text: str) -> List[str]:
		if not text or not isinstance(text, str):

"newmm-safe" option -- fix newmm issue, take too long time for long text with lots of ambiguity breaking points #302

"newmm-safe" option -- fix newmm issue, take too long time for long text with lots of ambiguity breaking points #302

Uh oh!

Conversation

bact commented Oct 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix newmm issue with long ambiguous text

The problem

Proposed solution

Effects from the proprosed solution

Alternative solution

Uh oh!

pep8speaks commented Oct 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2019-11-13 19:19:08 UTC

Uh oh!

coveralls commented Oct 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p16i Oct 13, 2019

Choose a reason for hiding this comment

Uh oh!

wannaphong Oct 20, 2019

Choose a reason for hiding this comment

Uh oh!

p16i Oct 20, 2019

Choose a reason for hiding this comment

Uh oh!

bact commented Oct 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Length of test texts (9 texts in total)

From dev branch

From fix-newmm-longtext branch

Uh oh!

wannaphong commented Oct 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

From PyThaiNLP 2.0.7 (use marisa-trie)

Uh oh!

wannaphong commented Oct 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wannaphong commented Oct 20, 2019

Tokenisation Speed Benchmark

Uh oh!

bact commented Oct 20, 2019 • edited by wannaphong Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tokenisation Speed Benchmark

Uh oh!

wannaphong commented Oct 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

check the performance with the pythainlp.benchmarks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bact commented Nov 8, 2019

Uh oh!

wannaphong commented Nov 10, 2019

Uh oh!

p16i left a comment

Choose a reason for hiding this comment

Uh oh!

p16i left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

bact commented Oct 12, 2019 •

edited

Loading

pep8speaks commented Oct 12, 2019 •

edited

Loading

coveralls commented Oct 12, 2019 •

edited

Loading

bact commented Oct 15, 2019 •

edited

Loading

wannaphong commented Oct 20, 2019 •

edited

Loading

wannaphong commented Oct 20, 2019 •

edited

Loading

bact commented Oct 20, 2019 •

edited by wannaphong

Loading

wannaphong commented Oct 20, 2019 •

edited

Loading