Clean up data streaming in topic coherence #2941
Labels
bug
Issue described a bug
impact HIGH
Show-stopper for affected users
reach LOW
Affects only niche use-case users
Problem description
The topic coherence functionality seems to require the entire input corpus to be in RAM. This is bad because:
a) Unnecessary and wasteful: The input is only accessed as
for doc in texts:
so streamed iterables would work fine.b) Not in line in Gensim's core interfaces, which stream data and expect iterables, precisely to be memory-efficient.
Steps/code/corpus to reproduce
See https://groups.google.com/g/gensim/c/tZ_qV5wsBDw for a user report.
OTOH, @gojomo reports another code path where it seems that iterables on input should work. So IDK. Investigate & clean up the docs to not require a "list of lists", at the very least. And if something doesn't support streaming of large data, it doesn't belong in Gensim.
The text was updated successfully, but these errors were encountered: