Skip to content

Deanwampler master #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 2, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 24 additions & 15 deletions notebooks/WhyScala.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

* Scala Days NYC, May 5th, 2016
* GOTO Chicago, May 24, 2016
* Strata + Hadoop World London, June 3, 2016
* Scala Days Berlin, June 16th, 2016

See also the [Spark Notebook](http://spark-notebook.io) version of this content, available at [github.com/data-fellas/scala-for-data-science](https://github.com/data-fellas/scala-for-data-science).
Expand All @@ -17,6 +18,8 @@ While Python and R are traditional languages of choice for Data Science, [Spark]

However, using one language for all work has advantages like simplifying the software development process, such as building, testing, and deploying techniques, coding conventions, etc.

If you want a thorough introduction to Scala, see [Dean's book](http://shop.oreilly.com/product/0636920033073.do).

So, what are the advantages, as well as disadvantages of Scala?

## 1. Functional Programming Plus Objects
Expand All @@ -37,7 +40,7 @@ Scala also implements some _functional_ features using _object-oriented inherita
* **R:** As a Statistics language, R is more functional than object-oriented.
* **Java:** An object-oriented language, but with recently introduced functional constructs, _lambdas_ (anonymous functions) and collection operations that follow a more _functional_ style, rather than _imperative_ (i.e., where mutating the collection is embraced).

There are a few differences with Java's vs. Scala's approaches to OOP and FP that are worth mentioning:
There are a few differences with Java's vs. Scala's approaches to OOP and FP that are worth mentioning specifically:

### 1a. Traits vs. Interfaces
Scala's object model adds a _trait_ feature, which is a more powerful concept than Java 8 interfaces. Before Java 8, there was no [mixin composition](https://en.wikipedia.org/wiki/Mixin) capability in Java, where composition is generally [preferred over inheritance](https://en.wikipedia.org/wiki/Composition_over_inheritance).
Expand All @@ -47,7 +50,7 @@ Imagine that you want to define reusable logging code and mix it into other clas
Scala traits fully support mixin composition by supporting both field and method definitions with flexibility rules for overriding behavior, once the traits are mixed into classes.

### 1b. Java Streams
When you use the Java 8 collections, you can convert the the tradition collections to a "stream", which is lazy and gives you more functional operations. However, sometimes, the conversions back and forth can be tedious, e.g., converting to a stream for functional processing, then converting pass them to older APIs, etc. Scala collections are more consistently functional.
When you use the Java 8 collections, you can convert the traditional collections to a "stream", which is lazy and gives you more functional operations. However, sometimes, the conversions back and forth can be tedious, e.g., converting to a stream for functional processing, then converting pass them to older APIs, etc. Scala collections are more consistently functional.

### The Virtue of Functional Collections:
Let's examine how concisely we can operate on a collection of values in Scala and Spark.
Expand Down Expand Up @@ -84,7 +87,7 @@ Let's compare a Scala collections calculation vs. the same thing in Spark; how m
This produces the results:
```scala
res16: scala.collection.immutable.Map[Boolean,Int] = Map(
false -> 74, true -> 25)
false -> 75, true -> 25)
```
Note that for the numbers between 1 and 100, inclusive, exactly 1/3 of them are prime!

Expand All @@ -95,7 +98,7 @@ Note how similar the following code is to the previous example. After constructi
However, because Spark collections are "lazy" by default (i.e., not evaluated until we ask for results), we explicitly print the results so Spark evaluates them!

```scala
val rddPrimes = sparkContext.parallelize(1 until 100).
val rddPrimes = sparkContext.parallelize(1 to 100).
map(i => (i, isPrime(i))).
groupBy(tuple => tuple._2).
map(tuple => (tuple._1, tuple._2.size))
Expand All @@ -106,7 +109,7 @@ This produces the result:
```scala
rddPrimes: org.apache.spark.rdd.RDD[(Boolean, Int)] =
MapPartitionsRDD[4] at map at <console>:61
res18: Array[(Boolean, Int)] = Array((false,74), (true,25))
res18: Array[(Boolean, Int)] = Array((false,75), (true,25))
```

Note the inferred type, an `RDD` with records of type `(Boolean, Int)`, meaning two-element tuples.
Expand All @@ -123,19 +126,19 @@ What about the other languages?

## 2. Interpreter (REPL)

In the notebook, we've been using the Scala interpreter (a.k.a., the REPL - Read Eval, Print, Loop) already behind the scenes. It makes notebooks possible!
In the notebook, we've been using the Scala interpreter (a.k.a., the REPL - Read Eval, Print, Loop) already behind the scenes. It makes notebooks like this one possible!

What about the other languages?

* **Python:** Also has an interpreter and [iPython/Jupyter](https://ipython.org/) was one of the first, widely-used notebook environments.
* **R:** Also has an interpreter and notebook/IDE environments.
* **Java:** Does _not_ have an interpreter and can't be programmed in a notebook environment.
* **Java:** Does _not_ have an interpreter and can't be programmed in a notebook environment. However, Java 9 will have a REPL, after 20+ years!

## 3. Tuple Syntax
In data, you work with records of `n` fields (for some value of `n`) all the time. Support for `n`-element _tuples_ is very convenient and Scala has a shorthand syntax for instantiating tuples. We used it twice previously to return two-element tuples in the anonymous functions passed to the `map` methods above:

```scala
sparkContext.parallelize(1 until 100).
sparkContext.parallelize(1 to 100).
map(i => (i, isPrime(i))). // <-- here
groupBy(tuple => tuple._2).
map(tuple => (tuple._1, tuple._2.size)) // <-- here
Expand Down Expand Up @@ -174,7 +177,7 @@ This is one of the most powerful features you'll find in most functional languag
Let's rewrite our previous primes example:

```scala
sparkContext.parallelize(1 until 100).
sparkContext.parallelize(1 to 100).
map(i => (i, isPrime(i))).
groupBy{ case (_, primality) => primality}. // Syntax: { case pattern => body }
map{ case (primality, values) => (primality, values.size) } . // same here
Expand All @@ -184,7 +187,7 @@ sparkContext.parallelize(1 until 100).
The output is:
```scala
(true,25)
(false,74)
(false,75)
```

Note the `case` keyword and `=>` separating the pattern from the body to execute if the pattern matches.
Expand Down Expand Up @@ -268,7 +271,7 @@ j: Int = 10
Recall our previous Spark example, where we wrote nothing about types, but they were inferred:

```scala
sparkContext.parallelize(1 until 100).
sparkContext.parallelize(1 to 100).
map(i => (i, isPrime(i))).
groupBy{ case(_, primality) => primality }.
map{ case (primality, values) => (primality, values.size) }
Expand Down Expand Up @@ -330,11 +333,10 @@ Get the root directory of the notebooks:
val root = sys.env("NOTEBOOKS_DIR")
```

Load the airports data into a [DataFrame](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame). (In the notebook, the "cell" returns Scala "Unit", `()`, which is sort of like `void`, to avoid an annoying bug in the output.)
Load the airports data into a [DataFrame](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame).

```scala
val airportsDF = sqlContext.read.json(s"$root/airports.json")
()
```

Note the "schema" is inferred from the JSON and shown by the REPL (by calling `DataFrame.toString`).
Expand All @@ -345,6 +347,12 @@ airportsDF: org.apache.spark.sql.DataFrame = [airport: string, city: string, cou

We cache the results, so Spark will keep the data in memory since we'll run a few queries over it. `DataFrame.show` is convenient for displaying the first `N` records (20 by default).

```scala
airportsDF.cache
airportsDF.show
```

Here's the output of `show`:
```
+--------------------+------------------+-------+----+-----------+------------+-----+
| airport| city|country|iata| lat| long|state|
Expand Down Expand Up @@ -378,7 +386,7 @@ Now we can show the idiomatic DataFrame API (DSL) in action:
```scala
val grouped = airportsDF.groupBy($"state", $"country").count.orderBy($"count".desc)
grouped.printSchema
grouped.show(100) // all 50 states + territories
grouped.show(100) // 50 states + territories < 100
```

Here is the output:
Expand Down Expand Up @@ -460,7 +468,7 @@ What about the other languages?
* **Java:** Limited to so-called _fluent_ APIs, similar to our collections and RDD examples above.

## 9. And a Few Other Things...
There are many more Scala features that the other languages don't have or don't support as nicely. Some are actually quite significant for general programming tasks, but they are used infrequently in Spark code. Here they are, for completeness.
There are many more Scala features that the other languages don't have or don't support as nicely. Some are actually quite significant for general programming tasks, but they are used less frequently in Spark code. Here they are, for completeness.

### 9A. Singletons Are a Built-in Feature
Implement the _Singleton Design Pattern_ without special logic to ensure there's only one instance.
Expand Down Expand Up @@ -713,6 +721,7 @@ SuccessOrFailure<? extends Object> sof = null;
sof = new Success<String>("foo");

```

This is harder for the user, who has to understand what's okay in this case, both what the designer intended and some technical rules of type theory.

It's much better if the _designer_ of `SuccessOrFailure[T]`, who understands the desired behavior, defines the allowed variance behavior at the _definition site_, which Scala supports. Recall from above:
Expand Down
Binary file modified notebooks/WhyScala.pdf
Binary file not shown.
Loading