Java 8 and Spark Learning tutorials

This is a collection of Java 8 and Apache Spark examples and concepts, from basic to advanced. It explain basic concepts introduced by Java 8 and how you can merge them with Apache Spark.

The current tutorial or set of examples provide a way of understand Spark 2.0 in details but also to get familiar with Java 8 and it new features like lambda, Stream and reaction programming.

Why Java 8

Java 8 is the latest version of Java which includes two major changes: Lambda expressions and Streams. Java 8 is a revolutionary release of the world’s #1 development platform. It includes a huge upgrade to the Java programming model and a coordinated evolution of the JVM, Java language, and libraries. Java 8 includes features for productivity, ease of use, improved polyglot programming, security and improved performance. Welcome to the latest iteration of the largest, open, standards-based, community-driven platform.

1- Lambda Expressions, a new language feature, has been introduced in this release. They enable you to treat functionality as a method argument, or code as data. Lambda expressions let you express instances of single-method interfaces (referred to as functional interfaces) more compactly.

2- Classes in the new java.util.stream package provide a Stream API to support functional-style operations on streams of elements. The Stream API is integrated into the Collections API, which enables bulk operations on collections, such as sequential or parallel map-reduce transformations including performance improvement for HashMaps with Key Collisions.

Why Spark

Apache Spark™ is a fast and general engine for large-scale data processing. Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Instructions

A good way of using these examples is by first cloning the repo, and then starting your own Spark Java 8.

Installing Java 8 and Spark

Java 8 can be download here. After the installation you need to be sure that the version you are using is java 8, you can check that by running:

java -version

In order to setup Spark locally in you machine you should download the spark version from here. Then you should follow the next steps:

> tar zxvf spark-xxx.tgz
> cd spark-xxx
> build/mvn -DskipTests clean package

After the compilation and before running your first example you should add to your profile the SPARK MASTER Variable:

 > export SPARK_LOCAL_IP=127.0.0.1

To be sure that you spark is installed properly in your machine you can run the first example from spark:

> ./bin/run-example SparkPi

Datasets

Some of the datasets we will use in this learning tutorial are:

Tweets Archive from @ypriverol is used in the word count
We will be using datasets from the KDD Cup 1999. The results of this competition can be found here.

References

The reference book for these and other Spark related topics is:

Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia.

Examples

The following examples can be examined individually, although there is a more or less linear 'story' when followed in sequence. By using different datasets they try to solve a related set of tasks with it.

RDD Basic Examples

Here a list of the most basic examples in Spark-Java8 and definition of the most basic concepts in Spark.

1- SparkWordCount: About How to create a simple JavaRDD in Spark.

2- MaptoDouble: How to generate general statistics about an RDD in Spark

3- SparkAverage: How to compute the average of a set of numbers in Spark.

RDD Sampling Examples

1- SparkSampling: Basic Spark Sampling using functions sample and takesample.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
hdfs		hdfs
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Java 8 and Spark Learning tutorials

Why Java 8

Why Spark

Instructions

Installing Java 8 and Spark

Datasets

References

Examples

RDD Basic Examples

RDD Sampling Examples

About

Releases

Packages

Languages

junnzhan/spark-java8

Folders and files

Latest commit

History

Repository files navigation

Java 8 and Spark Learning tutorials

Why Java 8

Why Spark

Instructions

Installing Java 8 and Spark

Datasets

References

Examples

RDD Basic Examples

RDD Sampling Examples

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages