Skip to content

Conversation

@bact
Copy link
Member

@bact bact commented Oct 12, 2019

This PR proposed newmm-safe tokenization engine, which is a newmm engine with additional mechanism to avoid possible exponentially long wait for long text with a lot of ambiguity in breaking points. Details are below.

Fix newmm issue with long ambiguous text

The problem

Proposed solution

  • Preprocess the input text before the actual tokenization
  • if the input text is longer than the limit (TEXT_LIMIT) with window size left and right), breaks them into smaller chunks (text_parts) then tokenizes each chunk
  • when breaks the text, the breaking point should fall at valid word/syllable ends
    • if there's a space, use the right-most space as a breaking point
    • if there's no space, current implementation find candidate breaking points by feed a smaller sample text in a window (from TEXT_LIMIT - TEXT_SCAN_LEFT to TEXT_LIMIT + TEXT_SCAN_RIGHT) to newmm again - and choose to break after the longest token
  • This preprocess can be enable/disable by safe_mode parameter.
    • If safe_mode is True, do the preprocess (will be slower)
    • If safe_mode is False (default), don't do the preprocess (may face long wait for edge cases)

Effects from the proprosed solution

  • chunk size, window size, and complexity of finding valid breaking strategy can affect the speed and accurary
  • current chunk size is 100 +- 20
    • too large chunk size = still long text, will not solve to issue --> affect speed
    • too small chunk size = more work to do --> affect speed, and maybe accuracy as well
    • too large window size = more work to do --> affect speed
    • too small window size = can break a single word into two words --> affect accuray
  • need to check the performance with the pythainlp.benchmarks

Alternative solution

  • The proposed preprocssing can be alternatively provided as a separated function that the user needs to call by themselves, if they want a safe tokenization.
  • Another alternative, and more elegant, solution is maybe to limit nodes in word graph to a certian number. (Do it directly inside _bfs_paths_graph())

bact added 2 commits October 12, 2019 11:36
- break text into smaller chunks
- tokenizes each chunk
@pep8speaks
Copy link

pep8speaks commented Oct 12, 2019

Hello @bact! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 356:80: E501 line too long (84 > 79 characters)

Line 114:80: E501 line too long (89 > 79 characters)

Line 192:80: E501 line too long (138 > 79 characters)
Line 193:80: E501 line too long (225 > 79 characters)
Line 194:80: E501 line too long (83 > 79 characters)
Line 195:80: E501 line too long (82 > 79 characters)

Comment last updated at 2019-11-13 19:19:08 UTC

@coveralls
Copy link

coveralls commented Oct 12, 2019

Coverage Status

Coverage increased (+0.2%) to 90.47% when pulling b2d72d1 on fix-newmm-longtext into fd5c44d on dev.



def segment(text: str) -> List[str]:
if not text or not isinstance(text, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition appears many times in the code. I think we should consider refactor it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heytitle isinstance(text, str) only ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean this if-condition appears many times.

@bact
Copy link
Member Author

bact commented Oct 15, 2019

Results from the code in this notebook
https://colab.research.google.com/drive/15BYn_XOouej4M3CDL6ltO1j5uAq99Whq

Length of test texts (9 texts in total)

  1. 16
  2. 99
  3. 99
  4. 309 - (natural) text from Wikipedia
  5. 405 - (natural) text from Wikipedia
  6. 408 - (artificial) complicated text crafted to replicate issue pythainlp.word_tokenize ปัญหาตัดคำประโยคที่ยาวต่อเนื่องโดยไม่มี space [newmm] #241
  7. 576 - (artificial) complicated text crafted to replicate issue pythainlp.word_tokenize ปัญหาตัดคำประโยคที่ยาวต่อเนื่องโดยไม่มี space [newmm] #241
  8. 1,848 - (artificial) complicated text crafted to replicate issue pythainlp.word_tokenize ปัญหาตัดคำประโยคที่ยาวต่อเนื่องโดยไม่มี space [newmm] #241
  9. 8,531 - (natural) text from Wikipedia

From dev branch

  1. 10000 loops, best of 3: 31.9 µs per loop
  2. 1000 loops, best of 3: 236 µs per loop
  3. 1000 loops, best of 3: 265 µs per loop
  4. 1000 loops, best of 3: 777 µs per loop *
  5. 1000 loops, best of 3: 946 µs per loop *
  6. 100 loops, best of 3: 2.58 ms per loop
  7. (Cannot finished)
  8. (Cannot finished)
  9. 10 loops, best of 3: 26.3 ms per loop *

From fix-newmm-longtext branch

  1. 10000 loops, best of 3: 31.5 µs per loop
  2. 1000 loops, best of 3: 235 µs per loop
  3. 1000 loops, best of 3: 270 µs per loop
  4. 1000 loops, best of 3: 992 µs per loop
  5. 1000 loops, best of 3: 1.14 ms per loop
  6. 1000 loops, best of 3: 1.33 ms per loop *
  7. 10 loops, best of 3: 75.2 ms per loop *
  8. 1 loop, best of 3: 470 ms per loop *
  9. 10 loops, best of 3: 30.6 ms per loop

@wannaphong
Copy link
Member

wannaphong commented Oct 20, 2019

From PyThaiNLP 2.0.7 (use marisa-trie)

  1. 10000 loops, best of 3: 26.9 µs per loop
  2. 1000 loops, best of 3: 232 µs per loop
  3. 1000 loops, best of 3: 248 µs per loop
  4. 1000 loops, best of 3: 813 µs per loop
  5. 1000 loops, best of 3: 1.04 ms per loop
  6. 100 loops, best of 3: 2.56 ms per loop
  7. (Cannot finished)
  8. (Cannot finished)
  9. 10 loops, best of 3: 90.2 ms per loop

Colab : Colab

@wannaphong wannaphong added this to the 2.1 milestone Oct 20, 2019
@wannaphong
Copy link
Member

wannaphong commented Oct 20, 2019

  • check the performance with the pythainlp.benchmarks

@wannaphong
Copy link
Member

Tokenisation Speed Benchmark

Notebook by @heytitle

@bact
Copy link
Member Author

bact commented Oct 20, 2019

Tokenisation Speed Benchmark

  method Char(200) Char(500) Char(1000) Char(2000)
0 newmm (dev) 0.0007±0.0001 0.0017±0.0001 0.0029±0.0002 0.0059±0.0003
1 newmm (fix-newmm-longtext) 0.0008±0.0001 0.0017±0.0001 0.0039±0.0009 0.0069±0.0002

@wannaphong
Copy link
Member

wannaphong commented Oct 20, 2019

check the performance with the pythainlp.benchmarks

dev

Colab : Google colab

============== Benchmark Result ==============
               metric      mean±std       min    max
        char_level:tp   25.49±27.33  1.000000  240.0
        char_level:tn  91.16±104.23  0.000000  921.0
        char_level:fp     1.42±3.43  0.000000   46.0
        char_level:fn     5.21±7.13  0.000000   64.0
 char_level:precision     0.93±0.18  0.021277    1.0
    char_level:recall     0.86±0.10  0.142857    1.0
        char_level:f1     0.87±0.15  0.041667    1.0
 word_level:precision     0.75±0.23  0.000000    1.0
    word_level:recall     0.67±0.23  0.000000    1.0
        word_level:f1     0.70±0.23  0.000000    1.0

fix-newmm-longtext

Colab : Google colab

============== Benchmark Result ==============
               metric      mean±std       min    max
        char_level:tp   25.73±27.76  1.000000  242.0
        char_level:tn  90.14±102.41  0.000000  915.0
        char_level:fp     2.43±4.65  0.000000   49.0
        char_level:fn     4.98±6.69  0.000000   60.0
 char_level:precision     0.91±0.18  0.021277    1.0
    char_level:recall     0.87±0.10  0.142857    1.0
        char_level:f1     0.87±0.15  0.041667    1.0
 word_level:precision     0.73±0.23  0.000000    1.0
    word_level:recall     0.67±0.23  0.000000    1.0
        word_level:f1     0.69±0.23  0.000000    1.0

Update new dict from dev branch
@bact bact added the bug bugs in the library label Oct 21, 2019
Update to changes from dev branch
@bact
Copy link
Member Author

bact commented Nov 8, 2019

Add safe_mode parameter, to enable/disable the preprocessing.

  • If safe_mode is True, do the preprocess (will be slower)
  • If safe_mode is False (default), don't do the preprocess (may face long wait for edge cases)

@wannaphong wannaphong requested a review from p16i November 10, 2019 08:14
@wannaphong
Copy link
Member

It's well. If this pull request is ready, You can merge.

Copy link
Contributor

@p16i p16i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me. However, I find that it does't have any test for the new option.

@bact it would be good if we could have minimal tests for this new option. One possible test might be using newmm-safe to tokenize that problematic string.

@bact bact changed the title Fix newmm issue, take too long time with long text "newmm-safe" option -- fix newmm issue, take too long time with some long text Nov 12, 2019
@p16i p16i self-requested a review November 12, 2019 20:11
Copy link
Contributor

@p16i p16i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@bact bact self-assigned this Nov 13, 2019
@bact bact merged commit 23f3856 into dev Nov 14, 2019
@bact bact deleted the fix-newmm-longtext branch November 14, 2019 10:11
@bact bact changed the title "newmm-safe" option -- fix newmm issue, take too long time with some long text "newmm-safe" option -- fix newmm issue, take too long time for long text with lots of ambiguity breaking points Nov 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug bugs in the library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants