Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added random_seed parameter to make LsiModel reproducible #3194

Merged
merged 6 commits into from
Oct 24, 2021
Merged

Added random_seed parameter to make LsiModel reproducible #3194

merged 6 commits into from
Oct 24, 2021

Conversation

parashardhapola
Copy link
Contributor

The results from multiple runs of LsiModel on same input data produces different results.
This is primarily due to the random gaussian matrix creation step in stochastic_svd function.
Hence, it is hard to setup a reproducible pipeline without resorting to changing global randomness controlling methods.

What was done in this PR :

  • A local random state (numpy.random.RandomState) was created in the stochastic_svd function. This random state can be seeded by a user provided parameter (random_seed). The gaussian matrices are drawn using this random state.
  • The random_seed parameter was exposed in calls to stochastic_svd from LsiModel.add_documents and Projection.__init__
  • 'random_seed' parameter was added to 'init' of both LsiModel and Projection classes.
  • Calls to Projection from LsiModel methods (__init__ and add_documents) now expose the random_seed parameter
  • The default value of random_seed in Projection, LsiModel and stochastic_svd was set to None. This means the current behavior is persevered by default.

Overall, the above changes allows relaying a 'random_seed' from LsiModel class to downstream calls. The changes suggested in this PR are highly localized and are not expected to have any global impact.

from gensim.test.utils import common_dictionary, common_corpus
from gensim.models import LsiModel
import numpy as np

model1 = LsiModel(common_corpus, id2word=common_dictionary)
model2 = LsiModel(common_corpus, id2word=common_dictionary)

print (np.all(model1.get_topics() - model2.get_topics() < 1e-10))
# Prints 'False' for both the current version and this PR.

With random_seed parameter introduced in this PR:

model1 = LsiModel(common_corpus, id2word=common_dictionary, random_seed=1)
model2 = LsiModel(common_corpus, id2word=common_dictionary, random_seed=1)

print (np.all(model1.get_topics() - model2.get_topics() < 1e-10))
# Prints 'True'

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good feature and nicely crafted PR.

I left some code style nitpicks in the comments. I realize some of the existing code may not follow this code style (LSI is the oldest part of Gensim) but please try to follow / update the code style where you can. Thanks!

gensim/models/lsimodel.py Outdated Show resolved Hide resolved
@piskvorky
Copy link
Owner

Looks good to me. Let's wait for @mpenkov 's blessing & merge.

@parashardhapola
Copy link
Contributor Author

@mpenkov and @piskvorky : A gentle reminder

@piskvorky piskvorky added this to the 4.1.0 milestone Aug 8, 2021
@piskvorky
Copy link
Owner

Approved by me; but @mpenkov is busy, thanks for your patience.

@@ -411,7 +425,8 @@ def __init__(

self.docs_processed = 0
self.projection = Projection(
self.num_terms, self.num_topics, power_iters=self.power_iters, extra_dims=self.extra_samples, dtype=dtype
self.num_terms, self.num_topics, power_iters=self.power_iters,
extra_dims=self.extra_samples, dtype=dtype, random_seed=self.random_seed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
extra_dims=self.extra_samples, dtype=dtype, random_seed=self.random_seed
extra_dims=self.extra_samples, dtype=dtype, random_seed=self.random_seed,

@piskvorky piskvorky changed the title added random_seed parameter to make LsiModel reproducible Added random_seed parameter to make LsiModel reproducible Aug 12, 2021
@piskvorky
Copy link
Owner

piskvorky commented Sep 2, 2021

@mpenkov was good merge for 4.1.0, seems we have missed this PR: https://github.com/RaRe-Technologies/gensim/milestone/5

Do we merge now or anything missing?

@mpenkov
Copy link
Collaborator

mpenkov commented Sep 28, 2021

Yeah, looks like we missed this PR. I'll merge it in now. Do we want to do 4.1.3 (or 4.2.0) or can that wait till later?

@mpenkov mpenkov self-assigned this Sep 28, 2021
@piskvorky
Copy link
Owner

piskvorky commented Sep 28, 2021

I think the version decision can wait – 4.1.3devX is fine for now. Thanks!

@mpenkov mpenkov merged commit 64fcedf into piskvorky:develop Oct 24, 2021
@mpenkov
Copy link
Collaborator

mpenkov commented Oct 24, 2021

Finally merged. Thank you @parashardhapola !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants