forked from yangqiang/BigDataBench-Spark
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
194 lines (145 loc) · 6.39 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
BigDataBench-Spark is an integrated part of the open source big data benchmark suite project: BigDataBench, publicly available from: http://prof.ict.ac.cn/BigDataBench
This version is for Spark-1.3.x.
If you need a citation for BigDataBench-Spark, please cite the following
paper:
[BigDataBench: a Big Data Benchmark Suite from Internet Services.](http://prof.ict.ac.cn/BigDataBench/wp-content/uploads/2013/10/Wang_BigDataBench.pdf)
Lei Wang, Jianfeng Zhan, ChunjieLuo, Yuqing Zhu, Qiang Yang, Yongqiang He,
WanlingGao, Zhen Jia, Yingjie Shi, Shujie Zhang, Cheng Zhen, Gang Lu, Kent
Zhan, Xiaona Li, and BizhuQiu. The 20th IEEE International Symposium On High
Performance Computer Architecture (HPCA-2014), February 15-19, 2014,
Orlando, Florida, USA.
How to use BigDataBench's Spark workloads?
Compile the source code or download a pre-build package(can be found is the `pre-build' folder). For compiling, please refer to: how-to-compile.txt
Preparations:
Make sure Spark-1.3.x has been successfully installed.
Configure you bash environment:
$SPARK_HOME points to the path where spark installed;
Add $SPARK_HOME/bin to the $PATH variable.
The workloads inculde:
Sort, Grep, Word Count, NaiveBayesTrainer, BayesClassifier, ConnectedComponent, PageRank, KMeans,
and CF(Collaborate Filtering -- ALS)
How to run:
Assume the bigdatabench-spark_*-1.3.0.jar file locates in $JAR_FILE.
Sort
run:
spark-submit --class cn.ac.ict.bigdatabench.Sort $JAR_FILE <data_file> <save_file> [<slices>]
parameters:
<data_file>: the HDFS path of input data, for example: /test/data.txt
<save_file>: the HDFS path to save the result
[<slices>]: optional, times of number of workers
input data format:
ordinary text files
Grep
run:
spark-submit --class cn.ac.ict.bigdatabench.Grep $JAR_FILE <data_file> <keyword> <save_file> [<slices>]
parameters:
<data_file>: the HDFS path of input data, for example: /test/data.txt
<keyword>: the keyword to filter the text
<save_file>: the HDFS path to save the result
[<slices>]: optional, times of number of workers
input data format:
ordinary text files
WordCount
run:
spark-submit --class cn.ac.ict.bigdatabench.WordCount $JAR_FILE <data_file> <save_file> [<slices>]
parameters:
<data_file>: the HDFS path of input data, for example: /test/data.txt
<save_file>: the HDFS path to save the result
[<slices>]: optional, times of number of workers
input data format:
ordinary text files
NaiveBayesTrainer
run:
spark-submit --class cn.ac.ict.bigdatabench.NaiveBayesTrainer $JAR_FILE <data_file> <save_file> [<slices>]
parameters:
<data_file>: the HDFS path of input data, for example: /test/data.txt
<save_file>: the HDFS path to save the result
[<slices>]: optional, times of number of workers
input data format:
classname text_content
for example: (class: dog/cat)
dog Dogs are awesome, cats too. I love my dog
cat Cats are more preferred by software developers. I never could stand cats. I have a dog
dog My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs
cat Cats are difficult animals, unlike dogs, really annoying, I hate them all
NaiveBayesClassifier
run:
spark-submit --class cn.ac.ict.bigdatabench.NaiveBayesClassifier $JAR_FILE <data_file> <model_file> <save_file> [<slices>]
parameters:
<data_file>: the HDFS path of input data, for example: /test/data.txt
<model_file>: the HDFS path of Bayes model data(generated with the training program), for example: /test/bayes_model
<save_file>: the HDFS path to save the classification result
[<slices>]: optional, times of number of workers
input data format:
text_content
for example:
Dogs are awesome, cats too. I love my dog
Cats are more preferred by software developers. I never could stand cats. I have a dog
My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs
Cats are difficult animals, unlike dogs, really annoying, I hate them all
output data format:
classname text_content
for example: (class: dog/cat)
dog Dogs are awesome, cats too. I love my dog
cat Cats are more preferred by software developers. I never could stand cats. I have a dog
dog My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs
cat Cats are difficult animals, unlike dogs, really annoying, I hate them all
ConnectedComponent
run:
spark-submit --class cn.ac.ict.bigdatabench.ConnectedComponent $JAR_FILE <data_file> [<slices>]
parameters:
<data_file>: the HDFS path of input data, for example: /test/data.txt
[<slices>]: optional, times of number of workers
input data format:
from_vertex to_vertex
for example:
1 2
1 3
2 5
4 6
6 7
PageRank
run:
spark-submit --class cn.ac.ict.bigdatabench.PageRank $JAR_FILE <file> <number_of_iterations> <save_path> [<slices>]
parameters:
<file>: the HDFS path of input data, for example: /test/data.txt
<number_of_iterations>: number of iterations to run the algorithm
<save_path>: path to save the result
[<slices>]: optional, times of number of workers
input data format
page neighbour_page
for example:
a b
a c
b d
CF(Collaborate Filtering, ALS)
run:
spark-submit --class cn.ac.ict.bigdatabench.ALS $JAR_FILE <ratings_file> <rank> <iterations> [<splits>]
parameters:
<ratings_file>: path of input data file
<rank>: number of features to train the model
<iterations>: number of iterations to run the algorithm
[<splits>]: optional, level of parallelism to split computation into
input data:
userID,productID,rating
for example:
1,1,5
1,3,4
1,5,1
2,1,4
2,5,5
KMeans
run
spark-submit --class cn.ac.ict.bigdatabench.KMeans $JAR_FILE <input_file> <k> <max_iterations> [<splits>]
parameters:
<input_file>: the HDFS path of input data, for example: /test/data.txt
<k>: number of centers
<max_iterations>: number of iterations to run the algorithm
[<splits>]: optional, level of parallelism to split computation into
input data:
x11 x12 x13 ... x1n
x21 x22 x23 ... x2n
for example
1.0 1.1 1.3 1.4
2.1 2.4 2.6 2.7
3.1 3.3 3.6 3.7