README

BigDataBench-Spark is an integrated part of the open source big data benchmark suite project: BigDataBench, publicly available from: http://prof.ict.ac.cn/BigDataBench

This version is for Spark-1.3.x.

If you need a citation for BigDataBench-Spark, please cite the following
paper:

[BigDataBench: a Big Data Benchmark Suite from Internet Services.](http://prof.ict.ac.cn/BigDataBench/wp-content/uploads/2013/10/Wang_BigDataBench.pdf)

Lei Wang, Jianfeng Zhan, ChunjieLuo, Yuqing Zhu, Qiang Yang, Yongqiang He,
WanlingGao, Zhen Jia, Yingjie Shi, Shujie Zhang, Cheng Zhen, Gang Lu, Kent
Zhan, Xiaona Li, and BizhuQiu. The 20th IEEE International Symposium On High
Performance Computer Architecture (HPCA-2014), February 15-19, 2014,
Orlando, Florida, USA.

How to use BigDataBench's Spark workloads?

Compile the source code or download a pre-build package(can be found is the `pre-build' folder). For compiling, please refer to: how-to-compile.txt

Preparations:
Make sure Spark-1.3.x has been successfully installed.
Configure you bash environment:
  $SPARK_HOME points to the path where spark installed;
  Add $SPARK_HOME/bin to the $PATH variable.

The workloads inculde:
  Sort, Grep, Word Count, NaiveBayesTrainer, BayesClassifier, ConnectedComponent, PageRank, KMeans,
  and CF(Collaborate Filtering -- ALS)


How to run:
Assume the bigdatabench-spark_*-1.3.0.jar file locates in $JAR_FILE.
  Sort
    run:
    spark-submit --class cn.ac.ict.bigdatabench.Sort $JAR_FILE <data_file> <save_file> [<slices>]

    parameters:
    <data_file>: the HDFS path of input data, for example: /test/data.txt
    <save_file>: the HDFS path to save the result
    [<slices>]: optional, times of number of workers

    input data format:
    ordinary text files


  Grep
    run:
    spark-submit --class cn.ac.ict.bigdatabench.Grep $JAR_FILE <data_file> <keyword> <save_file> [<slices>]

    parameters:
    <data_file>: the HDFS path of input data, for example: /test/data.txt
    <keyword>: the keyword to filter the text
    <save_file>: the HDFS path to save the result
    [<slices>]: optional, times of number of workers

    input data format:
    ordinary text files


  WordCount
    run:
    spark-submit --class cn.ac.ict.bigdatabench.WordCount $JAR_FILE <data_file> <save_file> [<slices>]

    parameters:
    <data_file>: the HDFS path of input data, for example: /test/data.txt
    <save_file>: the HDFS path to save the result
    [<slices>]: optional, times of number of workers

    input data format:
    ordinary text files


  NaiveBayesTrainer
    run:
    spark-submit --class cn.ac.ict.bigdatabench.NaiveBayesTrainer $JAR_FILE <data_file> <save_file> [<slices>]

    parameters:
    <data_file>: the HDFS path of input data, for example: /test/data.txt
    <save_file>: the HDFS path to save the result
    [<slices>]: optional, times of number of workers

    input data format:
    classname text_content

      for example: (class: dog/cat)
      dog Dogs are awesome, cats too. I love my dog
      cat Cats are more preferred by software developers. I never could stand cats. I have a dog
      dog My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs
      cat Cats are difficult animals, unlike dogs, really annoying, I hate them all


  NaiveBayesClassifier
    run:
    spark-submit --class cn.ac.ict.bigdatabench.NaiveBayesClassifier $JAR_FILE <data_file> <model_file> <save_file> [<slices>]

    parameters:
    <data_file>: the HDFS path of input data, for example: /test/data.txt
    <model_file>: the HDFS path of Bayes model data(generated with the training program), for example: /test/bayes_model
    <save_file>: the HDFS path to save the classification result
    [<slices>]: optional, times of number of workers

    input data format:
    text_content

      for example:
      Dogs are awesome, cats too. I love my dog
      Cats are more preferred by software developers. I never could stand cats. I have a dog
      My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs
      Cats are difficult animals, unlike dogs, really annoying, I hate them all

    output data format:
    classname text_content

      for example: (class: dog/cat)
      dog Dogs are awesome, cats too. I love my dog
      cat Cats are more preferred by software developers. I never could stand cats. I have a dog
      dog My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs
      cat Cats are difficult animals, unlike dogs, really annoying, I hate them all


  ConnectedComponent
    run:
    spark-submit --class cn.ac.ict.bigdatabench.ConnectedComponent $JAR_FILE <data_file> [<slices>]

    parameters:
    <data_file>: the HDFS path of input data, for example: /test/data.txt
    [<slices>]: optional, times of number of workers

    input data format:
    from_vertex to_vertex
      for example:
      1 2
      1 3
      2 5
      4 6
      6 7


  PageRank
    run:
    spark-submit --class cn.ac.ict.bigdatabench.PageRank $JAR_FILE <file> <number_of_iterations> <save_path> [<slices>]

    parameters:
      <file>: the HDFS path of input data, for example: /test/data.txt
      <number_of_iterations>: number of iterations to run the algorithm
      <save_path>: path to save the result
      [<slices>]: optional, times of number of workers

    input data format
      page neighbour_page
        for example:
        a b
        a c
        b d


  CF(Collaborate Filtering, ALS)
    run:
      spark-submit --class cn.ac.ict.bigdatabench.ALS $JAR_FILE <ratings_file> <rank> <iterations> [<splits>]

    parameters:
    <ratings_file>: path of input data file
    <rank>: number of features to train the model
    <iterations>: number of iterations to run the algorithm
    [<splits>]: optional, level of parallelism to split computation into

    input data:
      userID,productID,rating
        for example:
        1,1,5
        1,3,4
        1,5,1
        2,1,4
        2,5,5


  KMeans
    run
    spark-submit --class cn.ac.ict.bigdatabench.KMeans $JAR_FILE <input_file> <k> <max_iterations> [<splits>]

    parameters:
    <input_file>: the HDFS path of input data, for example: /test/data.txt
    <k>: number of centers
    <max_iterations>: number of iterations to run the algorithm
    [<splits>]: optional, level of parallelism to split computation into

    input data:
      x11 x12 x13 ... x1n
      x21 x22 x23 ... x2n

      for example
      1.0 1.1 1.3 1.4
      2.1 2.4 2.6 2.7
      3.1 3.3 3.6 3.7