Skip to content

LdaModel trains beyond size of corpus when using an iterable #2553

Closed
@gochristoph

Description

Problem description

When streaming documents/bag of words to LdaModel via a custom iterable, LdaModel will train beyond the size of the corpus, with output like 19-07-05 22:53:43 PROGRESS: pass 0, at document #178000/50000 -- where the number left to the / is higher than the number right to it.

Steps/code/corpus to reproduce

from gensim.models import LdaModel
import logging
logging.basicConfig(format='%(asctime)s  %(message)s', \
    datefmt='%y-%m-%d %H:%M:%S', level=logging.INFO)

class TestIterable:
    def __init__(self):
        self.bag_of_words = [(0,2), (3,1), (6,1), (100,2)]
        self.cursor = 0

    def __iter__(self):
        self.cursor = 0
        logging.info('TestIterable() __iter__ was called')
        return self

    def __next__(self):
        if self.cursor < 50000:
            self.cursor += 1
            return self.bag_of_words
        else:
            logging.info('TestIterable() returned StopIteration')
            raise StopIteration


corpus = TestIterable()
# uncommenting this part will make a list out of the corpus
# corpus = [document for document in corpus]

logging.info('performing lda training')
trained_model = LdaModel(corpus, num_topics=2)

Using the TestIterable() will result in LdaModel training indefinitively. Converting the TestIterable() corpus to a list will lead to the expected result of a proper training.

I have not written too many iterables so far, and of course there could be a problem there. But as far as I could infer from the LdaModel documentation, all that is required is an interable -- and to the best of my knowledge, corpus = TestIterable() is a proper iterable, and iterator as well.

Thanks a lot!

Versions

Linux-3.10.0-862.14.4.el7.x86_64-x86_64-with-centos-7.5.1804-Core
Python 3.6.4 (default, Apr 10 2018, 07:54:00)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
NumPy 1.14.2
SciPy 1.0.1
gensim 3.7.3
FAST_VERSION 0

Metadata

Assignees

Labels

HacktoberfestIssues marked for hacktoberfestdifficulty mediumMedium issue: required good gensim understanding & python skillshelp wanted

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions