Skip to content

Commit cf67248

Browse files
author
ad min
committed
add tratified sampling
1 parent c8ed6b0 commit cf67248

File tree

1 file changed

+46
-16
lines changed

1 file changed

+46
-16
lines changed

基本统计/tratified-sampling.md

+46-16
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,48 @@
11
# 分层取样
22

3-
<p><code>\[
4-
\newcommand{\R}{\mathbb{R}}
5-
\newcommand{\E}{\mathbb{E}}
6-
\newcommand{\x}{\mathbf{x}}
7-
\newcommand{\y}{\mathbf{y}}
8-
\newcommand{\wv}{\mathbf{w}}
9-
\newcommand{\av}{\mathbf{\alpha}}
10-
\newcommand{\bv}{\mathbf{b}}
11-
\newcommand{\N}{\mathbb{N}}
12-
\newcommand{\id}{\mathbf{I}}
13-
\newcommand{\ind}{\mathbf{1}}
14-
\newcommand{\0}{\mathbf{0}}
15-
\newcommand{\unit}{\mathbf{e}}
16-
\newcommand{\one}{\mathbf{1}}
17-
\newcommand{\zero}{\mathbf{0}}
18-
\]</code></p>
3+
&emsp;&emsp;先将总体的单位按某种特征分为若干次级总体(层),然后再从每一层内进行单纯随机抽样,组成一个样本的统计学计算方法叫做分层抽样。
4+
5+
&emsp;&emsp;与存在于`spark.mllib`中的其它统计函数不同,分层采样方法`sampleByKey``sampleByKeyExact`可以在`key-value`对的`RDD`上执行。在分层采样中,可以认为`key`是一个标签,
6+
`value`是特定的属性。例如,`key`可以是男人或者女人或者文档`id`,它相应的`value`可能是一组年龄或者是文档中的词。`sampleByKey`方法通过掷硬币的方式决定是否采样一个观察数据,
7+
因此它需要我们忽视(`pass over`)数据本身而只提供期望的数据大小。`sampleByKeyExact`比每层使用`sampleByKey`随机抽样需要更多的有意义的资源,但是它能使样本大小的准确性达到了`99.99%`
8+
9+
&emsp;&emsp;[sampleByKeyExact()](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions)允许用户准确抽取`f_k * n_k`个样本,
10+
这里`f_k`表示期望获取键为`k`的样本的比例,`n_k`表示键为`k`的键值对的数量。下面是一个使用的例子:
11+
12+
```scala
13+
import org.apache.spark.SparkContext
14+
import org.apache.spark.SparkContext._
15+
import org.apache.spark.rdd.PairRDDFunctions
16+
val sc: SparkContext = ...
17+
val data = ... // an RDD[(K, V)] of any key value pairs
18+
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
19+
// Get an exact sample from each stratum
20+
val approxSample = data.sampleByKey(withReplacement = false, fractions)
21+
val exactSample = data.sampleByKeyExact(withReplacement = false, fractions)
22+
```
23+
24+
`withReplacement``true`时,采用`PoissonSampler`取样器,当`withReplacement``false`使,采用`BernoulliSampler`取样器。
25+
26+
```scala
27+
def sampleByKey(withReplacement: Boolean,
28+
fractions: Map[K, Double],
29+
seed: Long = Utils.random.nextLong): RDD[(K, V)] = self.withScope {
30+
val samplingFunc = if (withReplacement) {
31+
StratifiedSamplingUtils.getPoissonSamplingFunction(self, fractions, false, seed)
32+
} else {
33+
StratifiedSamplingUtils.getBernoulliSamplingFunction(self, fractions, false, seed)
34+
}
35+
self.mapPartitionsWithIndex(samplingFunc, preservesPartitioning = true)
36+
}
37+
def sampleByKeyExact(
38+
withReplacement: Boolean,
39+
fractions: Map[K, Double],
40+
seed: Long = Utils.random.nextLong): RDD[(K, V)] = self.withScope {
41+
val samplingFunc = if (withReplacement) {
42+
StratifiedSamplingUtils.getPoissonSamplingFunction(self, fractions, true, seed)
43+
} else {
44+
StratifiedSamplingUtils.getBernoulliSamplingFunction(self, fractions, true, seed)
45+
}
46+
self.mapPartitionsWithIndex(samplingFunc, preservesPartitioning = true)
47+
}
48+
```

0 commit comments

Comments
 (0)