-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WikiCorpus: Support random sample #308
Comments
I agree, sounds like useful functionality. There actually exist algorithms that take provably random samples even from a stream (non-repeatable generator). I actually thought Python's |
|
Resolved in #1408 |
I see I'm late but a couple thoughts: Finding the length can require an initial extra full-scan of the texts – allowing a knowledgeable caller to optionally pre-specify a length could skip this potentially costly operation. Using the global |
I believe that global random actually means better reproducibility! If you manually set the random.seed() in your script, it will set the seed for the library as well. I have tested it only on a toy example so maybe I am horribly wrong, but it should work. Regarding the length. We should at least mention it in documentation. There are two options. Ask user, to manually set the self.length variable or to add a parameter first_x to the random sample function that would be interpreted as stream length (but not saved to self.length). Then the random sample would be only from first_x elements. However I am not sure what should happend if the first_x would be greater than the real length of stream. |
If people try pre-seeding the global random to try to achieve reproducibility in this function, it will have side-effects on other consumers of random numbers, and potentially suffer interference from any global-random-consumers in other threads, or in the (interleaved) processing of each I don't believe setting a |
Ou I see. To be honest this will the first time I will use random.Radnom :) . I will add an optional paramer. If set, it will initialize it's own random stream with that seed, otherwise it will use global random. Setting length should be enough, because len is overridden at TextCorpus to use self.length if defined or to compute it. However another optional parameter should not hurt. @menshikh-iv @piskvorky any ideas? |
Related to #307.
random.sample(wiki, n)
will get killed on a reasonably spec'd machine, regardless of the size ofn
, it'd be nice to get a decently sized random sample for creating models quickly.The text was updated successfully, but these errors were encountered: