Fix tokenization benchmark issue #269

p16i · 2019-09-06T22:06:37Z

Related to #268

coveralls · 2019-09-06T22:13:15Z

Coverage decreased (-0.03%) to 90.051% when pulling c6a70be on fix-tokenization-benchmark-issue into bb2fc33 on dev.

bin/word-tokenization-benchmark

bact · 2019-09-07T22:48:23Z

On Windows, there is Unicode encoding problem at line 11 of tests/test_benchmarks.py.
May need to specify the encoding when read:

with open("./tests/data/sentences.yml", "r", encoding="utf-8") as stream:
    TEST_DATA = yaml.safe_load(stream)

bact

remove unused imports, sort imports
specify encoding (utf-8) when open sentences.yml

bin/word-tokenization-benchmark

bact · 2019-09-08T05:16:13Z

load_dict() in attacut\utils.py (line 92) may need to specify encoding (utf-8) when open file.

Code is already updated: PyThaiNLP/attacut@c866982
Wait for attacut to update the package in PyPI.

p16i · 2019-09-08T07:35:42Z

@bact I've updated the attacut package with the load_dict patch. At the moment, it's a pre-release version.

https://pypi.org/project/attacut/1.0.2.dev0/

p16i · 2019-09-08T09:27:02Z

@bact I've renamed the benchmark module to word_tokenization.
Please note here that I didn't change this part: https://github.com/PyThaiNLP/pythainlp/pull/269/files#diff-4a1ef3fd7db1693d63243fd8a5ecb972R214, because it's an attribute that's used on the visualisation web.

bact · 2019-09-08T11:26:56Z

Thanks!

p16i · 2019-09-08T13:36:24Z

docs/api/benchmarks.rst

-.. autofunction:: pythainlp.benchmarks.word_tokenisation.preprocessing
+.. autofunction:: pythainlp.benchmarks.word_tokenization.compute_stats
+.. autofunction:: pythainlp.benchmarks.word_tokenization.benchmark
+.. autofunction:: pythainlp.benchmarks.word_tokenization.preprocessing


oh, thanks!

p16i added 3 commits September 7, 2019 00:03

refactor correctly tokenized word counting code

74357b6

better description for a cli param

c91dde6

update tokenization benchmark figure

86b384e

p16i requested review from lalital and wannaphong September 7, 2019 08:57

wannaphong reviewed Sep 7, 2019

View reviewed changes

bin/word-tokenization-benchmark Outdated Show resolved Hide resolved

add file type

c6a70be

p16i commented Sep 7, 2019

View reviewed changes

bin/word-tokenization-benchmark Show resolved Hide resolved

bact added 4 commits September 8, 2019 10:48

change open mode for sentences.yml to "rb"

fcc9d28

Merge branch 'dev' into fix-tokenization-benchmark-issue

21ea11e

Update test_benchmarks.py

ceb167f

sort imports

fd5817d

bact reviewed Sep 8, 2019

View reviewed changes

bin/word-tokenization-benchmark Show resolved Hide resolved

write with utf-8 encoding

08c365d

fix naming consistency (tokenization instead of tokenisation)

3e630c4

Update word-tokenization-benchmark

a853b75

bact approved these changes Sep 8, 2019

View reviewed changes

wannaphong approved these changes Sep 8, 2019

View reviewed changes

Update benchmarks.rst

b4ee5d6

p16i commented Sep 8, 2019

View reviewed changes

wannaphong added this to the 2.1 milestone Sep 8, 2019

wannaphong merged commit 21871bc into dev Sep 8, 2019

bact mentioned this pull request Sep 9, 2019

Tokenization benchmark miscalculate word-level metrics #268

Closed

bact deleted the fix-tokenization-benchmark-issue branch September 21, 2019 06:20

wannaphong mentioned this pull request Oct 5, 2019

PyThaiNLP 2.1 change log #181

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix tokenization benchmark issue #269

Fix tokenization benchmark issue #269

Uh oh!

p16i commented Sep 6, 2019

Uh oh!

coveralls commented Sep 6, 2019 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

bact commented Sep 7, 2019 •

edited

Loading

Uh oh!

bact left a comment

Uh oh!

Uh oh!

bact commented Sep 8, 2019 •

edited

Loading

Uh oh!

p16i commented Sep 8, 2019

Uh oh!

p16i commented Sep 8, 2019

Uh oh!

bact commented Sep 8, 2019

Uh oh!

p16i Sep 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix tokenization benchmark issue #269

Fix tokenization benchmark issue #269

Uh oh!

Conversation

p16i commented Sep 6, 2019

Uh oh!

coveralls commented Sep 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bact commented Sep 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bact left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bact commented Sep 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p16i commented Sep 8, 2019

Uh oh!

p16i commented Sep 8, 2019

Uh oh!

bact commented Sep 8, 2019

Uh oh!

p16i Sep 8, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

coveralls commented Sep 6, 2019 •

edited

Loading

bact commented Sep 7, 2019 •

edited

Loading

bact commented Sep 8, 2019 •

edited

Loading