sampling
is available as a Maven artifact from
Clojars.
This library supports three flavors of random sampling: simple sampling, reservoir sampling, and stream sampling.
Simple sampling is the best choice if the data is small enough to comfortably keep in memory. If that's not true but the sample you want to take is small enough for memory, then reservoir sampling is a good choice. If neither the original population nor the sample can reside in memory, then take a look at stream sampling.
As we review each, feel free to follow along in the REPL:
user> (ns test
(:require (bigml.sampling [simple :as simple]
[reservoir :as reservoir]
[stream :as stream])
(bigml.sampling.test [stream :as stream-test])))
sample.simple
provides simple random sampling. With this technique
the original population is kept in memory but the resulting sample is
a lazy sequence.
By default, sampling is done without replacement. This is equivalent to a lazy Fisher-Yates shuffle.
test> (simple/sample (range 5))
(2 3 1 0 4)
Setting :replace
as true will sample with replacement. Since there
is no limit to the number of items that may be sampled with
replacement from a population, the result will be an infinite length
list. So make sure to take
however many samples you need.
test> (take 10 (simple/sample (range 5) :replace true))
(2 3 3 2 4 1 1 1 3 0)
Each call to simple/sample
will return a new sample order.
test> (simple/sample (range 5))
(0 2 3 1 4)
test> (simple/sample (range 5))
(3 1 4 2 0)
Setting the :seed
parameter allows the sample order to be recreated.
test> (simple/sample (range 5) :seed 7)
(2 1 3 4 0)
test> (simple/sample (range 5) :seed 7)
(2 1 3 4 0)
Any value that's hashable is valid as a seed:
test> (simple/sample (range 5) :seed :foo)
(1 0 3 2 4)
The underlying random number generator may also be selected with the
:generator
parameter. The two options are :lcg
(linear congruential generator)
and :twister
(Marsenne twister).
The default is :lcg
.
test> (simple/sample (range 5) :seed 7 :generator :lcg)
(2 1 3 4 0)
test> (simple/sample (range 5) :seed 7 :generator :twister)
(4 3 0 1 2)
A sample may be weighted using the :weigh
parameter. If the
parameter is supplied with a function that takes an item and produces
a non-negative weight, then the resulting sample will be weighted
accordingly.
test> (take 5 (simple/sample [:heads :tails]
:weigh {:heads 0.5 :tails 0.5}
:replace true))
(:tails :heads :heads :heads :tails)
The weights need not sum to 1.
test> (frequencies (take 100 (simple/sample [:heads :tails]
:weigh {:heads 2 :tails 1}
:replace true)))
{:heads 66, :tails 34}
sample.reservoir
provides functions for reservoir sampling. Reservoir sampling
keeps the sampled population in memory (the 'reservoir'). However,
the original population is streamed through the reservoir so it does
not need to reside in memory. This makes reservoirs useful when the
original population is too large to fit into memory or the overall
size of the population is unknown.
To create a sample reservoir, use reservoir/create
and give it the
number of samples you desire. The resulting reservoir acts as a
collection, so you can simply conj
values into the reservoir to
create a sample. For example:
test> (reduce conj (reservoir/create 3) (range 10))
(5 7 2)
Similarly, a collection can be fed into the reservoir with into
:
test> (into (reservoir/create 3) (range 10))
(7 0 8)
To see how the reservoir changes as items are added, we can use
reductions
:
test> (reductions conj (reservoir/create 3) (range 10))
(() (0) (0 1) (0 1 2) (0 3 2) (0 3 2) (5 3 2) (6 3 2) (6 3 2) (6 3 2) (6 9 2))
For convenience, reservoir/sample
accepts a collection and a
reservoir size and returns the final reservoir:
test> (reservoir/sample (range 10) 5)
(0 9 2 1 4)
Both reservoir/sample
and reservoir/create
support the :replace
,
:seed
, :generator
, and :weigh
parameters.
test> (reservoir/sample (range 10) 5 :replace true :seed 1 :weigh identity)
(7 2 4 5 1)
One caveat is that samples for reservoirs using :weigh
won't be in a
random order (with respect to item weights). So you may need to
shuffle the results if that's important for you.
Reservoirs may be merged with reservoir/merge
. The resulting sample
will be similar to a single reservoir over the entire population. For
example:
test> (reduce + (reservoir/sample (range 0 10000) 500))
2517627
test> (reduce + (reservoir/merge
(reservoir/sample (range 0 5000) 500)
(reservoir/sample (range 5000 8000) 500)
(reservoir/sample (range 8000 10000) 500)))
2527384
With reservoir/merge
, reservoirs may be built in parallel on subsets
of the population and combined afterwords, even if the subsets are of
varying size.
Lastly, there are two implementations of reservoir sampling available:
:insertion
and :efraimdis
. :efraimdis
is the default and
generally the better option. :insertion
does not support the
:weigh
parameter, however it can be faster when sampling from
small-ish populations or when sampling with replacement.
The implementation may be selected when calling either
reservoir/sample
or reservoir/create
by using the
:implementation
parameter:
test> (time (count (reservoir/sample (range 10000) 2000
:implementation :efraimdis
:replace true)))
"Elapsed time: 4197.798 msecs"
2000
test> (time (count (reservoir/sample (range 10000) 2000
:implementation :insertion
:replace true)))
"Elapsed time: 651.868 msecs"
2000
sample.stream
is useful when taking a large sample from a large
population. Neither the original population or the resulting sample
are kept in memory. There are a couple of caveats. First, unlike the
other sampling techniques, the resulting sample stream is not in
random order. It will be in the order of the original population. So
if you need a random ordering, you'll want to shuffle the sample. The
second caveat is that, unlike reservoir sampling, the size of the
population must be declared up-front.
To use stream sampling, call stream/sample
with the population, the
desired number of samples, and the size of the population. The result
is a lazy sequence of samples.
As an example, we take five samples from a population of ten values:
test> (stream/sample (range) 5 10)
(1 2 4 7 9)
As elsewhere, stream/sample
supports :replace
, :seed
, and
:generator
:
test> (stream/sample (range) 5 10 :replace true :seed 2)
(0 0 4 6 7)
If an item isn't selected as part of a sampling, it's called
out-of-bag. Setting the :out-of-bag
parameter to true will return
a sequence of the out-of-bag items instead of the sampled items. This
can be useful when paired with :seed
.
test> (stream/sample (range) 7 10 :seed 0)
(0 2 3 5 6 7 9)
test> (stream/sample (range) 7 10 :seed 0 :out-of-bag true)
(1 4 8)
It's computationally expensive to select the exact number of desired
samples when using stream/sample
with replacement. If you're okay
with the number of samples being approximately the desired number,
then you can set :rate
to true to decrease the computation cost.
When this is the case, the probability of selecting an item will be
calculated only once and then applied to each item in the population
independently. As an example:
test> (time (count (stream/sample (range 10000) 5000 10000
:replace true)))
"Elapsed time: 374.021 msecs"
5000
test> (time (count (stream/sample (range 10000) 5000 10000
:replace true :rate true)))
"Elapsed time: 33.923 msecs"
4954
:rate
is also useful if you want to sample the population at a
particular rate rather than collect a specific sample size.
To illustrate, when stream/sample
is given an infinite list of
values as the population, the default behavior is to take the
requested samples from the expected population. In this case, it
means taking exactly one sample from the first thousand values of the
population:
test> (stream/sample (range) 1 1000)
(229)
However, when :rate
is true the resulting sample is also infinite,
with each item sampled at a probability of 1/1000
:
test> (take 10 (stream/sample (range) 1 1000 :rate true))
(1149 1391 1562 3960 4359 4455 5141 5885 6310 7568 7828)
While stream sampling does not yet support sample weights, the
cond-sample
fn can be useful for fine tuned sampling.
cond-sample
accepts a collection followed by pairs of clauses and
sample definitions. A clause should be a function that accepts an
item and returns either true of false. After each clause should
follow a sample defition that describes the sampling technique to use
when the condition is true.
As an example, we'll use the well known iris dataset:
test> (first stream-test/iris-data)
[5.1 3.5 1.4 0.2 "Iris-setosa"]
There are 50 instances of each species:
test> (frequencies (map last stream-test/iris-data))
{"Iris-setosa" 50, "Iris-versicolor" 50, "Iris-virginica" 50}
Let's say we want to sample all of Iris-setosa
, half as many
Iris-versicolor
, and none of the Iris-virginica
. If you knew the
population for each class ahead of time, you could use cond-sample
like so:
test> (def new-sample
(stream/cond-sample stream-test/iris-data
#(= "Iris-setosa" (last %)) [50 50]
#(= "Iris-versicolor" (last %)) [25 50]
#(= "Iris-virginica" (last %)) [0 50]))
test> (frequencies (map last new-sample))
{"Iris-setosa" 50, "Iris-versicolor" 25}
If you did not know the class populations ahead of time, a similar
sample could be done using :rate
. Also, an item that doesn't
satisfy any condition will be left out of the final sample. So
Iris-virginica
does not need to have its own clause:
test> (def new-sample
(stream/cond-sample stream-test/iris-data
#(= "Iris-setosa" (last %)) [1 1 :rate true]
#(= "Iris-versicolor" (last %)) [1 2 :rate true]))
test> (frequencies (map last new-sample))
{"Iris-setosa" 50, "Iris-versicolor" 23}
The stream/multi-sample
fn can be used to generate multiple
samplings in one pass over the population. The fn takes the
population followed by sets of sampling parameters, one for each
desired sampling.
Each set of sample parameters should be composed of a consumer fn, the
sample size, the population size, and optionally the parameters
:replace
, :seed
, and :rate
.
multi-sample
will generate a unique sampling for every parameter
set. Whenever a value is sampled, it will be consumed by the
parameter set's consumer fn. A consumer fn should accept a single
parameter.
As an example, let's imagine we're running a retail store and want to distribute awards to the stream of customers entering the store. To do this we'll create two samplings from the customer stream: 1 out of 100 will win a gift certificate and 1 out of 500 will win a Hawaiian vacation.
test> (defn award-gift-certificate! [customer-id]
(println "Customer" customer-id "wins a gift certificate."))
test> (defn award-hawaiian-vacation! [customer-id]
(println "Customer" customer-id "wins a Hawaiian vacation."))
test> (def customer-ids (range 1000))
test> (stream/multi-sample customer-ids
[award-gift-certificate! 1 100 :rate true]
[award-hawaiian-vacation! 1 500 :rate true])
Customer 161 wins a Hawaiian vacation.
Customer 427 wins a gift certificate.
Customer 627 wins a gift certificate.
Customer 646 wins a gift certificate.
Customer 661 wins a gift certificate.
Customer 731 wins a gift certificate.
Customer 745 wins a gift certificate.
Customer 786 wins a gift certificate.
Customer 794 wins a gift certificate.
Customer 833 wins a Hawaiian vacation.
Customer 836 wins a gift certificate.
multi-reduce
is very similar to multi-sample
, except every set of
sample parameters defines a sampling along with a reduction function.
So each set of sample parameters should be composed of a reduce fn, an
initial reduce value, the sample size, the population size, and
optionally the :replace
, :seed
, and :rate
parameters.
multi-reduce
will return a seq of values, each value being the final
reduction for a sampling. A reducer fn should accept two parameters.
An example:
test> (stream/multi-reduce (range) [+ 0 20 30 :seed 3]
[+ 0 20 30 :seed 4])
(273 330)
Copyright (C) 2013-2018 BigML Inc.
Distributed under the Apache License, Version 2.0.