-
Notifications
You must be signed in to change notification settings - Fork 266
First pass at a spark platform for summingbird #502
Conversation
Conflicts: project/Build.scala
timeSpan: Interval[Timestamp], | ||
deltas: RDD[(Timestamp, (K, V))], | ||
commutativity: Commutativity, | ||
@transient semigroup: Semigroup[V]): MergeResult[K, V] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semigroup extends serializable, why does it need to be transient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it doesn't need to be in that case. I was just trying to force as much as through the externalize as possible. I wasn't sure which things need this vs which don't. Maybe only closures need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to keep it this way just to be defensive.
|
||
// A source promises to never return values outside of timeSpan | ||
trait SparkSource[T] extends Serializable { | ||
def rdd(sc: SparkContext, timeSpan: Interval[Timestamp]): RDD[(Timestamp, T)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implicit SparkContext; reader monad; or the store can provide. seems like it doesn't need to be an argument?
// some shorthand type aliases | ||
type Prod[T] = Producer[P, T] | ||
type TailProd[T] = TailProducer[P, T] | ||
type Visited = Map[Prod[_], P#Plan[_]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if you try to do
type Visited = Map[Prod[_], P#Plan[T]]
or
type Visited[T] = Map[Prod[_], P#Plan[T]]
do either work? would clean up the need for casting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the T in Plan[T] isn't fixed. This is a Map of many different kinds of plans unfortunately.
The Plan is basically an RDD, so if you have an RDD[T] and then you map it to an RDD[U] you now have a Plan[U]
First pass at a spark platform for summingbird
Great work Alex. Now add those issues and let's keep summing. |
First pass at a spark platform for summingbird.
Outstanding issues:
batching doesn't really come into play yet.
thread-safety issues there. I actually had this implemented but didn't want to spend the time finding out whether spark is threadsafe in
this way until I had something that worked in local mode first.
There's a lot of overlap with the scalding platform in these tests. Having a PlatformLaws test suite would also better describe the contract and expected
behavior of all platforms.