-
Notifications
You must be signed in to change notification settings - Fork 287
"newmm-safe" option -- fix newmm issue, take too long time for long text with lots of ambiguity breaking points #302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- break text into smaller chunks - tokenizes each chunk
|
Hello @bact! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2019-11-13 19:19:08 UTC |
|
|
||
|
|
||
| def segment(text: str) -> List[str]: | ||
| if not text or not isinstance(text, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition appears many times in the code. I think we should consider refactor it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heytitle isinstance(text, str) only ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean this if-condition appears many times.
Remove tnc.word_freq
|
Results from the code in this notebook
Length of test texts (9 texts in total)
From dev branch
From fix-newmm-longtext branch
|
From PyThaiNLP 2.0.7 (use marisa-trie)
Colab : Colab |
|
Update dict from dev branch
Update from dev branch
|
check the performance with the pythainlp.benchmarksdev Colab : Google colab fix-newmm-longtext Colab : Google colab |
Update new dict from dev branch
Update to changes from dev branch
|
Add
|
|
It's well. If this pull request is ready, You can merge. |
p16i
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me. However, I find that it does't have any test for the new option.
@bact it would be good if we could have minimal tests for this new option. One possible test might be using newmm-safe to tokenize that problematic string.
p16i
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
…hainlp into fix-newmm-longtext * 'fix-newmm-longtext' of https://github.com/PyThaiNLP/pythainlp: Fixed Travis CI : Update ThaiNER
This PR proposed
newmm-safetokenization engine, which is anewmmengine with additional mechanism to avoid possible exponentially long wait for long text with a lot of ambiguity in breaking points. Details are below.Fix newmm issue with long ambiguous text
The problem
Proposed solution
TEXT_LIMIT) with window size left and right), breaks them into smaller chunks (text_parts) then tokenizes each chunkTEXT_LIMIT - TEXT_SCAN_LEFTtoTEXT_LIMIT + TEXT_SCAN_RIGHT) to newmm again - and choose to break after the longest tokensafe_modeparameter.safe_modeisTrue, do the preprocess (will be slower)safe_modeisFalse(default), don't do the preprocess (may face long wait for edge cases)Effects from the proprosed solution
pythainlp.benchmarksAlternative solution
_bfs_paths_graph())