First pass at a spark platform for summingbird #502

isnotinvain · 2014-05-08T23:44:30Z

First pass at a spark platform for summingbird.

Outstanding issues:

Time based batching / BatchedStore equivalent not yet supported. Time is plubmed through everywhere and used for non commutative semigroups, but
batching doesn't really come into play yet.
Logic for finding the maximal covered timespan by various sources / stores is not yet implemented
Logic for lookups / leftjoins repsecting time (eg returning what a value was for a given key at a given time) is not yet implemented
Decide whether PlatformPlanner is useful or not
Test running real jobs on a real cluster (only tested once with wordcount)
SparkPlatform is stateful but shouldn't be
SparkPlatform submits blocking spark 'actions' in serial, should make (careful) use of a FuturePool instead, though we may run into spark
thread-safety issues there. I actually had this implemented but didn't want to spend the time finding out whether spark is threadsafe in
this way until I had something that worked in local mode first.
Writing the tests (SparkLaws) lead me to think that there should really be a platform agnostic set of laws, or at least a PlatformLaws base class.
There's a lot of overlap with the scalding platform in these tests. Having a PlatformLaws test suite would also better describe the contract and expected
behavior of all platforms.
There are currently no optimizations in the planning stage (except for commutativity not imposing a sort and using reduceByKey)
SparkPlatform is currently stateful but should be refactored to not be
The core spark types that need access to a SparkContext should be refactored in terms of a Reader monad

Conflicts: project/Build.scala

jcoveney · 2014-05-09T23:20:22Z

summingbird-spark/src/main/scala/com/twitter/summingbird/spark/CoreTypes.scala

+                     timeSpan: Interval[Timestamp],
+                     deltas: RDD[(Timestamp, (K, V))],
+                     commutativity: Commutativity,
+                     @transient semigroup: Semigroup[V]): MergeResult[K, V] = {


Semigroup extends serializable, why does it need to be transient?

Maybe it doesn't need to be in that case. I was just trying to force as much as through the externalize as possible. I wasn't sure which things need this vs which don't. Maybe only closures need this?

going to keep it this way just to be defensive.

jcoveney · 2014-05-09T23:20:54Z

summingbird-spark/src/main/scala/com/twitter/summingbird/spark/CoreTypes.scala

+
+// A source promises to never return values outside of timeSpan
+trait SparkSource[T] extends Serializable {
+  def rdd(sc: SparkContext, timeSpan: Interval[Timestamp]): RDD[(Timestamp, T)]


implicit SparkContext; reader monad; or the store can provide. seems like it doesn't need to be an argument?

jcoveney · 2014-05-20T22:53:05Z

summingbird-spark/src/main/scala/com/twitter/summingbird/spark/PlatformPlanner.scala

+  // some shorthand type aliases
+  type Prod[T] =  Producer[P, T]
+  type TailProd[T] =  TailProducer[P, T]
+  type Visited = Map[Prod[_], P#Plan[_]]


what happens if you try to do

type Visited = Map[Prod[_], P#Plan[T]]

or

type Visited[T] = Map[Prod[_], P#Plan[T]]

do either work? would clean up the need for casting

No, the T in Plan[T] isn't fixed. This is a Map of many different kinds of plans unfortunately.

The Plan is basically an RDD, so if you have an RDD[T] and then you map it to an RDD[U] you now have a Plan[U]

First pass at a spark platform for summingbird

jcoveney · 2014-05-21T00:14:54Z

Great work Alex. Now add those issues and let's keep summing.

isnotinvain added 29 commits March 8, 2014 02:43

Playing around with spark

1bd6bd7

use futures

a696d8e

style cleanup

8b98c88

add question markers

b473390

stash again

73807da

Merge branch 'develop' into alexlevenson/spark

1d3096b

Fixup summer

58ddd06

use spark 0.9.0

c5345f5

use spark 0.9.0-incubating

59da45f

make example, use scala 2.10 only -- need to fix this

37ddac9

Add assembly

eb026a7

It works!

0ead585

Some cleanup

fa29b74

ClassManifest -> ClassTag

88522f5

some cleanup

2769dba

add test

2e028dd

stash

71ca542

First test working

56d5050

two tests working

15a69ac

fix serialization, add leftJoin test

824047c

add test for multiple summers

831c3fa

More serialization fixes, scala version fix, and lookup job test

887e155

it works!

a4ddcad

Merge branch 'develop' into alexlevenson/spark

8250cd2

Conflicts: project/Build.scala

rm unused test file

896022e

monoid -> semigroup

7f09e17

Add tests for commutativity

15d213d

rm unused files

c8df922

Readme and some comments

b5571a0

jcoveney reviewed May 9, 2014
View reviewed changes

isnotinvain added 6 commits May 13, 2014 14:34

Address some comments

3217e5a

some Build cleanup

3de4cd3

fix sbt

ea2ab8b

try to use chill's trick in sbt

2e97b73

bump to scala 2.10.4 for spark tests

9e97e78

update README

04fbaf4

isnotinvain changed the title ~~First pass at a spark platform for summingbird -- DO NOT MERGE!~~ First pass at a spark platform for summingbird May 20, 2014

Update the readme some more

2a74e9a

jcoveney reviewed May 20, 2014
View reviewed changes

Rename toPlan2

20726e8

jcoveney added a commit that referenced this pull request May 21, 2014

Merge pull request #502 from isnotinvain/alexlevenson/spark

d4a7fb0

First pass at a spark platform for summingbird

jcoveney merged commit d4a7fb0 into twitter:develop May 21, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First pass at a spark platform for summingbird #502

First pass at a spark platform for summingbird #502

isnotinvain commented May 8, 2014

jcoveney May 9, 2014

isnotinvain May 13, 2014

isnotinvain May 13, 2014

jcoveney May 9, 2014

jcoveney May 20, 2014

isnotinvain May 20, 2014

jcoveney commented May 21, 2014

First pass at a spark platform for summingbird #502

First pass at a spark platform for summingbird #502

Conversation

isnotinvain commented May 8, 2014

jcoveney May 9, 2014

Choose a reason for hiding this comment

isnotinvain May 13, 2014

Choose a reason for hiding this comment

isnotinvain May 13, 2014

Choose a reason for hiding this comment

jcoveney May 9, 2014

Choose a reason for hiding this comment

jcoveney May 20, 2014

Choose a reason for hiding this comment

isnotinvain May 20, 2014

Choose a reason for hiding this comment

jcoveney commented May 21, 2014