-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WikiCorpus Tokenization issue #1534
Comments
Agreed, we need more flexibility there! Related to #1489 (same thing for Thai). Can you open a PR with a fix @roopalgarg ? |
Sounds good. Will do. |
menshikh-iv
pushed a commit
that referenced
this issue
Sep 18, 2017
* code to better handle tokenization Adding the ability to define: 1. Define min and max token length 2. Define min number of tokens for valid articles 3. Call a custom function to handle tokenization with the configured parameter on the class instance 4. Control if lowercase is desired * adding another test case adding a test case to check "lower" parameter with the custom tokenizer * cleaning up code * clean up code for formatting * cleaning up indentation * missing backtick
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
Currently the wikicorpus.py file houses the class WikiCorpus and the tokenize function used while processing a wikipedia dump.
The tokenize function in wikicorpus.py imposes certain hard coded filters which there doesnt seem to be a way to control from the outside.
This tokenize function in turn calls a tokenize function from utils.py which uses a regex to tokenize text.
Now this regex will not work for languages like Japanese which need a tokenizer like mecab since word boundaries are not explicitly defined.
This causes issues when trying to use the text for certain use cases where unaltered text is needed.
Steps/Code/Corpus to Reproduce
Expected Results
Should be able to control tokenization by passing a reference to a custom function as an optional parameter to the WikiCorpus object
Actual Results
Tokenization is broken for languages like Japanese and there is no control over the hard coded rules.
Versions
Linux-4.10.0-32-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Nov 19 2016, 06:48:10) \n[GCC 5.4.0 20160609]')
('NumPy', '1.13.1')
('SciPy', '0.19.1')
('gensim', '2.2.0')
('FAST_VERSION', 1)
The text was updated successfully, but these errors were encountered: