Skip to content

Conversation

@p16i
Copy link
Contributor

@p16i p16i commented Sep 6, 2019

Related to #268

@coveralls
Copy link

coveralls commented Sep 6, 2019

Coverage Status

Coverage decreased (-0.03%) to 90.051% when pulling c6a70be on fix-tokenization-benchmark-issue into bb2fc33 on dev.

@p16i p16i requested review from lalital and wannaphong September 7, 2019 08:57
@bact
Copy link
Member

bact commented Sep 7, 2019

On Windows, there is Unicode encoding problem at line 11 of tests/test_benchmarks.py.
May need to specify the encoding when read:

with open("./tests/data/sentences.yml", "r", encoding="utf-8") as stream:
    TEST_DATA = yaml.safe_load(stream)

Copy link
Member

@bact bact left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • remove unused imports, sort imports
  • specify encoding (utf-8) when open sentences.yml

@bact
Copy link
Member

bact commented Sep 8, 2019

load_dict() in attacut\utils.py (line 92) may need to specify encoding (utf-8) when open file.

Code is already updated: PyThaiNLP/attacut@c866982
Wait for attacut to update the package in PyPI.

@p16i
Copy link
Contributor Author

p16i commented Sep 8, 2019

@bact I've updated the attacut package with the load_dict patch. At the moment, it's a pre-release version.

https://pypi.org/project/attacut/1.0.2.dev0/

@p16i
Copy link
Contributor Author

p16i commented Sep 8, 2019

@bact I've renamed the benchmark module to word_tokenization.
Please note here that I didn't change this part: https://github.com/PyThaiNLP/pythainlp/pull/269/files#diff-4a1ef3fd7db1693d63243fd8a5ecb972R214, because it's an attribute that's used on the visualisation web.

@bact
Copy link
Member

bact commented Sep 8, 2019

Thanks!

.. autofunction:: pythainlp.benchmarks.word_tokenisation.preprocessing
.. autofunction:: pythainlp.benchmarks.word_tokenization.compute_stats
.. autofunction:: pythainlp.benchmarks.word_tokenization.benchmark
.. autofunction:: pythainlp.benchmarks.word_tokenization.preprocessing
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, thanks!

@wannaphong wannaphong added this to the 2.1 milestone Sep 8, 2019
@wannaphong wannaphong merged commit 21871bc into dev Sep 8, 2019
@bact bact deleted the fix-tokenization-benchmark-issue branch September 21, 2019 06:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants