Skip to content

Commit aef6d07

Browse files
committed
add doc for random data generation
1 parent b99d94b commit aef6d07

File tree

2 files changed

+74
-2
lines changed

2 files changed

+74
-2
lines changed

docs/mllib-guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ filtering, dimensionality reduction, as well as underlying optimization primitiv
99

1010
* [Data types](mllib-basics.html)
1111
* [Basic statistics](mllib-stats.html)
12-
* data generators
12+
* random data generation
1313
* stratified sampling
1414
* summary statistics
1515
* hypothesis testing

docs/mllib-stats.md

Lines changed: 73 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,79 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Statistics Functionality
2525
\newcommand{\zero}{\mathbf{0}}
2626
\]`
2727

28-
## Data Generators
28+
## Random data generation
29+
30+
Random data generation is useful for randomized algorithms, prototyping, and performance testing.
31+
MLlib supports generating random RDDs with i.i.d. values drawn from a given distribution:
32+
uniform, standard normal, or Poisson.
33+
34+
<div class="codetabs">
35+
<div data-lang="scala" markdown="1">
36+
[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory
37+
methods to generate random double RDDs or vector RDDs.
38+
The following example generates a random double RDD, whose values follows the standard normal
39+
distribution `N(0, 1)`, and then map it to `N(1, 4)`.
40+
41+
{% highlight scala %}
42+
import org.apache.spark.SparkContext
43+
import org.apache.spark.mllib.random.RandomRDDs._
44+
45+
val sc: SparkContext = ...
46+
47+
// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
48+
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
49+
val u = normalRDD(sc, 1000000L, 10)
50+
// Apply a transform to get a random double RDD following `N(1, 4)`.
51+
val v = u.map(x => 1.0 + 2.0 * x)
52+
{% endhighlight %}
53+
</div>
54+
55+
<div data-lang="java" markdown="1">
56+
[`RandomRDDs`](api/java/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory
57+
methods to generate random double RDDs or vector RDDs.
58+
The following example generates a random double RDD, whose values follows the standard normal
59+
distribution `N(0, 1)`, and then map it to `N(1, 4)`.
60+
61+
{% highlight java %}
62+
import org.apache.spark.SparkContext;
63+
import org.apache.spark.api.JavaDoubleRDD;
64+
import static org.apache.spark.mllib.random.RandomRDDs.*;
65+
66+
JavaSparkContext jsc = ...
67+
68+
// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
69+
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
70+
JavaDoubleRDD u = normalJavaRDD(jsc, 1000000L, 10);
71+
// Apply a transform to get a random double RDD following `N(1, 4)`.
72+
JavaDoubleRDD v = u.map(
73+
new Function<Double, Double>() {
74+
public Double call(Double x) {
75+
return 1.0 + 2.0 * x;
76+
}
77+
});
78+
{% endhighlight %}
79+
</div>
80+
81+
<div data-lang="python" markdown="1">
82+
[`RandomRDDs`](api/python/pyspark.mllib.random.RandomRDDs-class.html) provides factory
83+
methods to generate random double RDDs or vector RDDs.
84+
The following example generates a random double RDD, whose values follows the standard normal
85+
distribution `N(0, 1)`, and then map it to `N(1, 4)`.
86+
87+
{% highlight python %}
88+
from pyspark.mllib.random import RandomRDDs
89+
90+
sc = ... # SparkContext
91+
92+
# Generate a random double RDD that contains 1 million i.i.d. values drawn from the
93+
# standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
94+
u = RandomRDDs.uniformRDD(sc, 1000000L, 10)
95+
# Apply a transform to get a random double RDD following `N(1, 4)`.
96+
v = u.map(lambda x: 1.0 + 2.0 * x)
97+
{% endhighlight %}
98+
</div>
99+
100+
</div>
29101

30102
## Stratified Sampling
31103

0 commit comments

Comments
 (0)