WikiCorpus Tokenization issue #1534

roopalgarg · 2017-08-16T05:52:50Z

Description

Currently the wikicorpus.py file houses the class WikiCorpus and the tokenize function used while processing a wikipedia dump.

def tokenize(content):
    """
    Tokenize a piece of text from wikipedia. The input string `content` is assumed
    to be mark-up free (see `filter_wiki()`).

    Return list of tokens as utf8 bytestrings. Ignore words shorted than 2 or longer
    that 15 characters (not bytes!).
    """
    # TODO maybe ignore tokens with non-latin characters? (no chinese, arabic, russian etc.)
    return [
        utils.to_unicode(token) for token in utils.tokenize(content, lower=True, errors='ignore')
        if 2 <= len(token) <= 15 and not token.startswith('_')
    ]

The tokenize function in wikicorpus.py imposes certain hard coded filters which there doesnt seem to be a way to control from the outside.

This tokenize function in turn calls a tokenize function from utils.py which uses a regex to tokenize text.

PAT_ALPHABETIC = re.compile('(((?![\d])\w)+)', re.UNICODE)

def tokenize(text, lowercase=False, deacc=False, errors="strict", to_lower=False, lower=False):
    """
    Iteratively yield tokens as unicode strings, removing accent marks
    and optionally lowercasing the unidoce string by assigning True
    to one of the parameters, lowercase, to_lower, or lower.

    Input text may be either unicode or utf8-encoded byte string.

    The tokens on output are maximal contiguous sequences of alphabetic
    characters (no digits!).

    >>> list(tokenize('Nic nemůže letět rychlostí vyšší, než 300 tisíc kilometrů za sekundu!', deacc = True))
    [u'Nic', u'nemuze', u'letet', u'rychlosti', u'vyssi', u'nez', u'tisic', u'kilometru', u'za', u'sekundu']

    """
    lowercase = lowercase or to_lower or lower
    text = to_unicode(text, errors=errors)
    if lowercase:
        text = text.lower()
    if deacc:
        text = deaccent(text)
    for match in PAT_ALPHABETIC.finditer(text):
        yield match.group()

Now this regex will not work for languages like Japanese which need a tokenizer like mecab since word boundaries are not explicitly defined.

This causes issues when trying to use the text for certain use cases where unaltered text is needed.

Steps/Code/Corpus to Reproduce

import logging
from gensim.corpora import WikiCorpus

def get_text_from_wiki_bz(wiki_bz_path, wiki_text_path, log_status_every=10000):
    """
    convert the bz wiki dump to text where each line is one article
    :param wiki_bz_path: path to the wiki bz dump file
    :param wiki_text_path: path to the output file
    :param log_status_every: log the status after this count. set to None if no logs are required
    :return: 
    """
    wiki = WikiCorpus(wiki_bz_path, lemmatize=False, dictionary={})

    i = 0
    with open(wiki_text_path, 'w') as fp:
        for text in wiki.get_texts():
                fp.write(" ".join(text) + "\n")
                i = i + 1
                if log_status_every and i % log_status_every == 0:
                    logging.info("Saved " + str(i) + " articles")
        logging.info("Finished Saved " + str(i) + " articles")

    return True

Expected Results

Should be able to control tokenization by passing a reference to a custom function as an optional parameter to the WikiCorpus object

Actual Results

Tokenization is broken for languages like Japanese and there is no control over the hard coded rules.

Versions

Linux-4.10.0-32-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Nov 19 2016, 06:48:10) \n[GCC 5.4.0 20160609]')
('NumPy', '1.13.1')
('SciPy', '0.19.1')
('gensim', '2.2.0')
('FAST_VERSION', 1)

The text was updated successfully, but these errors were encountered:

piskvorky · 2017-08-16T07:27:15Z

Agreed, we need more flexibility there! Related to #1489 (same thing for Thai).

Can you open a PR with a fix @roopalgarg ?

roopalgarg · 2017-08-16T07:40:09Z

Sounds good. Will do.

* code to better handle tokenization Adding the ability to define: 1. Define min and max token length 2. Define min number of tokens for valid articles 3. Call a custom function to handle tokenization with the configured parameter on the class instance 4. Control if lowercase is desired * adding another test case adding a test case to check "lower" parameter with the custom tokenizer * cleaning up code * clean up code for formatting * cleaning up indentation * missing backtick

roopalgarg mentioned this issue Aug 17, 2017

Update WikiCorpus tokenization. Fix #1534 #1537

Merged

menshikh-iv closed this as completed in #1537 Sep 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WikiCorpus Tokenization issue #1534

WikiCorpus Tokenization issue #1534

roopalgarg commented Aug 16, 2017

piskvorky commented Aug 16, 2017

roopalgarg commented Aug 16, 2017

WikiCorpus Tokenization issue #1534

WikiCorpus Tokenization issue #1534

Comments

roopalgarg commented Aug 16, 2017

Description

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Versions

piskvorky commented Aug 16, 2017

roopalgarg commented Aug 16, 2017