Skip to content

Commit

Permalink
add tratified sampling
Browse files Browse the repository at this point in the history
  • Loading branch information
ad min committed Mar 23, 2016
1 parent c8ed6b0 commit cf67248
Showing 1 changed file with 46 additions and 16 deletions.
62 changes: 46 additions & 16 deletions 基本统计/tratified-sampling.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,48 @@
# 分层取样

<p><code>\[
\newcommand{\R}{\mathbb{R}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\wv}{\mathbf{w}}
\newcommand{\av}{\mathbf{\alpha}}
\newcommand{\bv}{\mathbf{b}}
\newcommand{\N}{\mathbb{N}}
\newcommand{\id}{\mathbf{I}}
\newcommand{\ind}{\mathbf{1}}
\newcommand{\0}{\mathbf{0}}
\newcommand{\unit}{\mathbf{e}}
\newcommand{\one}{\mathbf{1}}
\newcommand{\zero}{\mathbf{0}}
\]</code></p>
&emsp;&emsp;先将总体的单位按某种特征分为若干次级总体(层),然后再从每一层内进行单纯随机抽样,组成一个样本的统计学计算方法叫做分层抽样。

&emsp;&emsp;与存在于`spark.mllib`中的其它统计函数不同,分层采样方法`sampleByKey``sampleByKeyExact`可以在`key-value`对的`RDD`上执行。在分层采样中,可以认为`key`是一个标签,
`value`是特定的属性。例如,`key`可以是男人或者女人或者文档`id`,它相应的`value`可能是一组年龄或者是文档中的词。`sampleByKey`方法通过掷硬币的方式决定是否采样一个观察数据,
因此它需要我们忽视(`pass over`)数据本身而只提供期望的数据大小。`sampleByKeyExact`比每层使用`sampleByKey`随机抽样需要更多的有意义的资源,但是它能使样本大小的准确性达到了`99.99%`

&emsp;&emsp;[sampleByKeyExact()](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions)允许用户准确抽取`f_k * n_k`个样本,
这里`f_k`表示期望获取键为`k`的样本的比例,`n_k`表示键为`k`的键值对的数量。下面是一个使用的例子:

```scala
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.PairRDDFunctions
val sc: SparkContext = ...
val data = ... // an RDD[(K, V)] of any key value pairs
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
// Get an exact sample from each stratum
val approxSample = data.sampleByKey(withReplacement = false, fractions)
val exactSample = data.sampleByKeyExact(withReplacement = false, fractions)
```

`withReplacement``true`时,采用`PoissonSampler`取样器,当`withReplacement``false`使,采用`BernoulliSampler`取样器。

```scala
def sampleByKey(withReplacement: Boolean,
fractions: Map[K, Double],
seed: Long = Utils.random.nextLong): RDD[(K, V)] = self.withScope {
val samplingFunc = if (withReplacement) {
StratifiedSamplingUtils.getPoissonSamplingFunction(self, fractions, false, seed)
} else {
StratifiedSamplingUtils.getBernoulliSamplingFunction(self, fractions, false, seed)
}
self.mapPartitionsWithIndex(samplingFunc, preservesPartitioning = true)
}
def sampleByKeyExact(
withReplacement: Boolean,
fractions: Map[K, Double],
seed: Long = Utils.random.nextLong): RDD[(K, V)] = self.withScope {
val samplingFunc = if (withReplacement) {
StratifiedSamplingUtils.getPoissonSamplingFunction(self, fractions, true, seed)
} else {
StratifiedSamplingUtils.getBernoulliSamplingFunction(self, fractions, true, seed)
}
self.mapPartitionsWithIndex(samplingFunc, preservesPartitioning = true)
}
```

0 comments on commit cf67248

Please sign in to comment.