Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WikiCorpus Tokenization issue #1534

Closed
roopalgarg opened this issue Aug 16, 2017 · 2 comments
Closed

WikiCorpus Tokenization issue #1534

roopalgarg opened this issue Aug 16, 2017 · 2 comments

Comments

@roopalgarg
Copy link
Contributor

Description

Currently the wikicorpus.py file houses the class WikiCorpus and the tokenize function used while processing a wikipedia dump.

def tokenize(content):
    """
    Tokenize a piece of text from wikipedia. The input string `content` is assumed
    to be mark-up free (see `filter_wiki()`).

    Return list of tokens as utf8 bytestrings. Ignore words shorted than 2 or longer
    that 15 characters (not bytes!).
    """
    # TODO maybe ignore tokens with non-latin characters? (no chinese, arabic, russian etc.)
    return [
        utils.to_unicode(token) for token in utils.tokenize(content, lower=True, errors='ignore')
        if 2 <= len(token) <= 15 and not token.startswith('_')
    ]

The tokenize function in wikicorpus.py imposes certain hard coded filters which there doesnt seem to be a way to control from the outside.

This tokenize function in turn calls a tokenize function from utils.py which uses a regex to tokenize text.

PAT_ALPHABETIC = re.compile('(((?![\d])\w)+)', re.UNICODE)

def tokenize(text, lowercase=False, deacc=False, errors="strict", to_lower=False, lower=False):
    """
    Iteratively yield tokens as unicode strings, removing accent marks
    and optionally lowercasing the unidoce string by assigning True
    to one of the parameters, lowercase, to_lower, or lower.

    Input text may be either unicode or utf8-encoded byte string.

    The tokens on output are maximal contiguous sequences of alphabetic
    characters (no digits!).

    >>> list(tokenize('Nic nemůže letět rychlostí vyšší, než 300 tisíc kilometrů za sekundu!', deacc = True))
    [u'Nic', u'nemuze', u'letet', u'rychlosti', u'vyssi', u'nez', u'tisic', u'kilometru', u'za', u'sekundu']

    """
    lowercase = lowercase or to_lower or lower
    text = to_unicode(text, errors=errors)
    if lowercase:
        text = text.lower()
    if deacc:
        text = deaccent(text)
    for match in PAT_ALPHABETIC.finditer(text):
        yield match.group()

Now this regex will not work for languages like Japanese which need a tokenizer like mecab since word boundaries are not explicitly defined.

This causes issues when trying to use the text for certain use cases where unaltered text is needed.

Steps/Code/Corpus to Reproduce

import logging
from gensim.corpora import WikiCorpus

def get_text_from_wiki_bz(wiki_bz_path, wiki_text_path, log_status_every=10000):
    """
    convert the bz wiki dump to text where each line is one article
    :param wiki_bz_path: path to the wiki bz dump file
    :param wiki_text_path: path to the output file
    :param log_status_every: log the status after this count. set to None if no logs are required
    :return: 
    """
    wiki = WikiCorpus(wiki_bz_path, lemmatize=False, dictionary={})

    i = 0
    with open(wiki_text_path, 'w') as fp:
        for text in wiki.get_texts():
                fp.write(" ".join(text) + "\n")
                i = i + 1
                if log_status_every and i % log_status_every == 0:
                    logging.info("Saved " + str(i) + " articles")
        logging.info("Finished Saved " + str(i) + " articles")

    return True

Expected Results

Should be able to control tokenization by passing a reference to a custom function as an optional parameter to the WikiCorpus object

Actual Results

Tokenization is broken for languages like Japanese and there is no control over the hard coded rules.

Versions

Linux-4.10.0-32-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Nov 19 2016, 06:48:10) \n[GCC 5.4.0 20160609]')
('NumPy', '1.13.1')
('SciPy', '0.19.1')
('gensim', '2.2.0')
('FAST_VERSION', 1)

@piskvorky
Copy link
Owner

Agreed, we need more flexibility there! Related to #1489 (same thing for Thai).

Can you open a PR with a fix @roopalgarg ?

@roopalgarg
Copy link
Contributor Author

Sounds good. Will do.

menshikh-iv pushed a commit that referenced this issue Sep 18, 2017
* code to better handle tokenization

Adding the ability to define:
1. Define min and max token length
2. Define min number of tokens for valid articles
3. Call a custom function to handle tokenization with the configured
parameter on the class instance
4. Control if lowercase is desired

* adding another test case

adding a test case to check "lower" parameter with the custom tokenizer

* cleaning up code

* clean up code for formatting

* cleaning up indentation

* missing backtick
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants