Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up data streaming in topic coherence #2941

Open
piskvorky opened this issue Sep 10, 2020 · 0 comments
Open

Clean up data streaming in topic coherence #2941

piskvorky opened this issue Sep 10, 2020 · 0 comments
Labels
bug Issue described a bug impact HIGH Show-stopper for affected users reach LOW Affects only niche use-case users

Comments

@piskvorky
Copy link
Owner

piskvorky commented Sep 10, 2020

Problem description

The topic coherence functionality seems to require the entire input corpus to be in RAM. This is bad because:

a) Unnecessary and wasteful: The input is only accessed as for doc in texts: so streamed iterables would work fine.
b) Not in line in Gensim's core interfaces, which stream data and expect iterables, precisely to be memory-efficient.

Steps/code/corpus to reproduce

See https://groups.google.com/g/gensim/c/tZ_qV5wsBDw for a user report.

OTOH, @gojomo reports another code path where it seems that iterables on input should work. So IDK. Investigate & clean up the docs to not require a "list of lists", at the very least. And if something doesn't support streaming of large data, it doesn't belong in Gensim.

@piskvorky piskvorky added bug Issue described a bug impact HIGH Show-stopper for affected users reach LOW Affects only niche use-case users labels Sep 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug impact HIGH Show-stopper for affected users reach LOW Affects only niche use-case users
Projects
None yet
Development

No branches or pull requests

1 participant