|
1 | 1 | # 分层取样
|
2 | 2 |
|
3 |
| -<p><code>\[ |
4 |
| -\newcommand{\R}{\mathbb{R}} |
5 |
| -\newcommand{\E}{\mathbb{E}} |
6 |
| -\newcommand{\x}{\mathbf{x}} |
7 |
| -\newcommand{\y}{\mathbf{y}} |
8 |
| -\newcommand{\wv}{\mathbf{w}} |
9 |
| -\newcommand{\av}{\mathbf{\alpha}} |
10 |
| -\newcommand{\bv}{\mathbf{b}} |
11 |
| -\newcommand{\N}{\mathbb{N}} |
12 |
| -\newcommand{\id}{\mathbf{I}} |
13 |
| -\newcommand{\ind}{\mathbf{1}} |
14 |
| -\newcommand{\0}{\mathbf{0}} |
15 |
| -\newcommand{\unit}{\mathbf{e}} |
16 |
| -\newcommand{\one}{\mathbf{1}} |
17 |
| -\newcommand{\zero}{\mathbf{0}} |
18 |
| -\]</code></p> |
| 3 | +  先将总体的单位按某种特征分为若干次级总体(层),然后再从每一层内进行单纯随机抽样,组成一个样本的统计学计算方法叫做分层抽样。 |
| 4 | + |
| 5 | +  与存在于`spark.mllib`中的其它统计函数不同,分层采样方法`sampleByKey`和`sampleByKeyExact`可以在`key-value`对的`RDD`上执行。在分层采样中,可以认为`key`是一个标签, |
| 6 | +`value`是特定的属性。例如,`key`可以是男人或者女人或者文档`id`,它相应的`value`可能是一组年龄或者是文档中的词。`sampleByKey`方法通过掷硬币的方式决定是否采样一个观察数据, |
| 7 | +因此它需要我们忽视(`pass over`)数据本身而只提供期望的数据大小。`sampleByKeyExact`比每层使用`sampleByKey`随机抽样需要更多的有意义的资源,但是它能使样本大小的准确性达到了`99.99%`。 |
| 8 | + |
| 9 | +  [sampleByKeyExact()](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions)允许用户准确抽取`f_k * n_k`个样本, |
| 10 | +这里`f_k`表示期望获取键为`k`的样本的比例,`n_k`表示键为`k`的键值对的数量。下面是一个使用的例子: |
| 11 | + |
| 12 | +```scala |
| 13 | +import org.apache.spark.SparkContext |
| 14 | +import org.apache.spark.SparkContext._ |
| 15 | +import org.apache.spark.rdd.PairRDDFunctions |
| 16 | +val sc: SparkContext = ... |
| 17 | +val data = ... // an RDD[(K, V)] of any key value pairs |
| 18 | +val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key |
| 19 | +// Get an exact sample from each stratum |
| 20 | +val approxSample = data.sampleByKey(withReplacement = false, fractions) |
| 21 | +val exactSample = data.sampleByKeyExact(withReplacement = false, fractions) |
| 22 | +``` |
| 23 | + |
| 24 | +当`withReplacement`为`true`时,采用`PoissonSampler`取样器,当`withReplacement`为`false`使,采用`BernoulliSampler`取样器。 |
| 25 | + |
| 26 | +```scala |
| 27 | +def sampleByKey(withReplacement: Boolean, |
| 28 | + fractions: Map[K, Double], |
| 29 | + seed: Long = Utils.random.nextLong): RDD[(K, V)] = self.withScope { |
| 30 | + val samplingFunc = if (withReplacement) { |
| 31 | + StratifiedSamplingUtils.getPoissonSamplingFunction(self, fractions, false, seed) |
| 32 | + } else { |
| 33 | + StratifiedSamplingUtils.getBernoulliSamplingFunction(self, fractions, false, seed) |
| 34 | + } |
| 35 | + self.mapPartitionsWithIndex(samplingFunc, preservesPartitioning = true) |
| 36 | + } |
| 37 | +def sampleByKeyExact( |
| 38 | + withReplacement: Boolean, |
| 39 | + fractions: Map[K, Double], |
| 40 | + seed: Long = Utils.random.nextLong): RDD[(K, V)] = self.withScope { |
| 41 | + val samplingFunc = if (withReplacement) { |
| 42 | + StratifiedSamplingUtils.getPoissonSamplingFunction(self, fractions, true, seed) |
| 43 | + } else { |
| 44 | + StratifiedSamplingUtils.getBernoulliSamplingFunction(self, fractions, true, seed) |
| 45 | + } |
| 46 | + self.mapPartitionsWithIndex(samplingFunc, preservesPartitioning = true) |
| 47 | + } |
| 48 | +``` |
0 commit comments