Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WikiCorpus: Support random sample #308

Closed
adammenges opened this issue Mar 21, 2015 · 7 comments
Closed

WikiCorpus: Support random sample #308

adammenges opened this issue Mar 21, 2015 · 7 comments

Comments

@adammenges
Copy link

Related to #307.

random.sample(wiki, n) will get killed on a reasonably spec'd machine, regardless of the size of n, it'd be nice to get a decently sized random sample for creating models quickly.

@piskvorky
Copy link
Owner

I agree, sounds like useful functionality.

There actually exist algorithms that take provably random samples even from a stream (non-repeatable generator).

I actually thought Python's random.sample implements such an algo -- why do you think it wouldn't work? Did you try?

@vlejd
Copy link
Contributor

vlejd commented Jun 10, 2017

random.sample(population, n) is (very) silently expecting a list. It calls len(population) (not a problem) and population[i] (this is a problem) so it does not work on streams. To be honest, the implementation of random.sample is quiet bad. I will implement the algorithm that does work on streams.
There are two options.

  • Reservoir sampling: have an array of size n and for each element of stream decide, if it should be there or not.
  • Online sampling: if n is large, and we know size of the generator, we do not need to have all samples in memory, but for each element we can decide, if we want to choose it or not. This seems like a better solution.

@menshikh-iv
Copy link
Contributor

Resolved in #1408

@gojomo
Copy link
Collaborator

gojomo commented Jun 15, 2017

I see I'm late but a couple thoughts:

Finding the length can require an initial extra full-scan of the texts – allowing a knowledgeable caller to optionally pre-specify a length could skip this potentially costly operation.

Using the global random means any one sampling isn't easily reproducible from run-to-run – adding an optional random seed parameter would allow samples to be reproduced.

@vlejd
Copy link
Contributor

vlejd commented Jun 16, 2017

I believe that global random actually means better reproducibility! If you manually set the random.seed() in your script, it will set the seed for the library as well. I have tested it only on a toy example so maybe I am horribly wrong, but it should work.

Regarding the length. We should at least mention it in documentation. There are two options. Ask user, to manually set the self.length variable or to add a parameter first_x to the random sample function that would be interpreted as stream length (but not saved to self.length). Then the random sample would be only from first_x elements. However I am not sure what should happend if the first_x would be greater than the real length of stream.
I am not sure how strict gensim wants to be about this type of performance.

@gojomo
Copy link
Collaborator

gojomo commented Jun 16, 2017

If people try pre-seeding the global random to try to achieve reproducibility in this function, it will have side-effects on other consumers of random numbers, and potentially suffer interference from any global-random-consumers in other threads, or in the (interleaved) processing of each next()-offered text. Using a local random instance, explicitly seeded (if a seed provided) ensures no side-effects or interleaved interference.

I don't believe setting a length property on the object will be enough to have built-in len() use that value – that depends on the existence of a __len__() special-method. An optionally-passed in length-hint would definitely solve the performance problem, and you are right to notice that if it were less than the actual length, it'd be essentially a "from the first x" limiter, and if it were longer than the actual length, might result in fewer than n yielded results.

@vlejd
Copy link
Contributor

vlejd commented Jun 16, 2017

Ou I see. To be honest this will the first time I will use random.Radnom :) . I will add an optional paramer. If set, it will initialize it's own random stream with that seed, otherwise it will use global random.

Setting length should be enough, because len is overridden at TextCorpus to use self.length if defined or to compute it. However another optional parameter should not hurt. @menshikh-iv @piskvorky any ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants