LdaModel trains beyond size of corpus when using an iterable #2553
Description
Problem description
When streaming documents/bag of words to LdaModel via a custom iterable, LdaModel will train beyond the size of the corpus, with output like 19-07-05 22:53:43 PROGRESS: pass 0, at document #178000/50000
-- where the number left to the /
is higher than the number right to it.
Steps/code/corpus to reproduce
from gensim.models import LdaModel
import logging
logging.basicConfig(format='%(asctime)s %(message)s', \
datefmt='%y-%m-%d %H:%M:%S', level=logging.INFO)
class TestIterable:
def __init__(self):
self.bag_of_words = [(0,2), (3,1), (6,1), (100,2)]
self.cursor = 0
def __iter__(self):
self.cursor = 0
logging.info('TestIterable() __iter__ was called')
return self
def __next__(self):
if self.cursor < 50000:
self.cursor += 1
return self.bag_of_words
else:
logging.info('TestIterable() returned StopIteration')
raise StopIteration
corpus = TestIterable()
# uncommenting this part will make a list out of the corpus
# corpus = [document for document in corpus]
logging.info('performing lda training')
trained_model = LdaModel(corpus, num_topics=2)
Using the TestIterable() will result in LdaModel training indefinitively. Converting the TestIterable() corpus to a list will lead to the expected result of a proper training.
I have not written too many iterables so far, and of course there could be a problem there. But as far as I could infer from the LdaModel documentation, all that is required is an interable -- and to the best of my knowledge, corpus = TestIterable()
is a proper iterable, and iterator as well.
Thanks a lot!
Versions
Linux-3.10.0-862.14.4.el7.x86_64-x86_64-with-centos-7.5.1804-Core
Python 3.6.4 (default, Apr 10 2018, 07:54:00)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
NumPy 1.14.2
SciPy 1.0.1
gensim 3.7.3
FAST_VERSION 0