Skip to content

[SPARK-11290][STREAMING] Basic implementation of trackStateByKey #9256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 27 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -44,18 +44,6 @@ object StatefulNetworkWordCount {

StreamingExamples.setStreamingLogLevels()

val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.sum

val previousCount = state.getOrElse(0)

Some(currentCount + previousCount)
}

val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))
}

val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since updateStateByKey isn't removed, maybe it's better to keep this example and create a new file for trackStateByKey.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to encourage people to move to the new API, it might be ok to not have the old one.

// Create the context with a 1 second batch size
val ssc = new StreamingContext(sparkConf, Seconds(1))
Expand All @@ -71,9 +59,16 @@ object StatefulNetworkWordCount {
val wordDstream = words.map(x => (x, 1))

// Update the cumulative count using updateStateByKey
// This will give a Dstream made of state (which is the cumulative count of the words)
val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc,
new HashPartitioner (ssc.sparkContext.defaultParallelism), true, initialRDD)
// This will give a DStream made of state (which is the cumulative count of the words)
val trackStateFunc = (batchTime: Time, word: String, one: Option[Int], state: State[Int]) => {
val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
val output = (word, sum)
state.update(sum)
Some(output)
}

val stateDstream = wordDstream.trackStateByKey(
StateSpec.function(trackStateFunc).initialState(initialRDD))
stateDstream.print()
ssc.start()
ssc.awaitTermination()
Expand Down
193 changes: 193 additions & 0 deletions streaming/src/main/scala/org/apache/spark/streaming/State.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.streaming

import scala.language.implicitConversions

import org.apache.spark.annotation.Experimental

/**
* :: Experimental ::
* Abstract class for getting and updating the tracked state in the `trackStateByKey` operation of
* a [[org.apache.spark.streaming.dstream.PairDStreamFunctions pair DStream]] (Scala) or a
* [[org.apache.spark.streaming.api.java.JavaPairDStream JavaPairDStream]] (Java).
*
* Scala example of using `State`:
* {{{
* // A tracking function that maintains an integer state and return a String
* def trackStateFunc(data: Option[Int], state: State[Int]): Option[String] = {
* // Check if state exists
* if (state.exists) {
* val existingState = state.get // Get the existing state
* val shouldRemove = ... // Decide whether to remove the state
* if (shouldRemove) {
* state.remove() // Remove the state
* } else {
* val newState = ...
* state.update(newState) // Set the new state
* }
* } else {
* val initialState = ...
* state.update(initialState) // Set the initial state
* }
* ... // return something
* }
*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todo: Add example in doc.

* }}}
*
* Java example:
* {{{
* TODO(@zsxwing)
* }}}
*/
@Experimental
sealed abstract class State[S] {

/** Whether the state already exists */
def exists(): Boolean

/**
* Get the state if it exists, otherwise it will throw `java.util.NoSuchElementException`.
* Check with `exists()` whether the state exists or not before calling `get()`.
*
* @throws java.util.NoSuchElementException If the state does not exist.
*/
def get(): S

/**
* Update the state with a new value.
*
* State cannot be updated if it has been already removed (that is, `remove()` has already been
* called) or it is going to be removed due to timeout (that is, `isTimingOut()` is `true`).
*
* @throws java.lang.IllegalArgumentException If the state has already been removed, or is
* going to be removed
*/
def update(newState: S): Unit

/**
* Remove the state if it exists.
*
* State cannot be updated if it has been already removed (that is, `remove()` has already been
* called) or it is going to be removed due to timeout (that is, `isTimingOut()` is `true`).
*/
def remove(): Unit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be delete or destroy? Not sure if we have used similar terminology elsewhere. Also it would be good to state the semantics of calling this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I have no strong feeling here other than that it match existing things, if we have them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only semantically similar thing is a HashMap, and there both Scala and Java HashMap uses remove()


/**
* Whether the state is timing out and going to be removed by the system after the current batch.
* This timeout can occur if timeout duration has been specified in the
* [[org.apache.spark.streaming.StateSpec StatSpec]] and the key has not received any new data
* for that timeout duration.
*/
def isTimingOut(): Boolean

/**
* Get the state as an [[scala.Option]]. It will be `Some(state)` if it exists, otherwise `None`.
*/
@inline final def getOption(): Option[S] = if (exists) Some(get()) else None

@inline final override def toString(): String = {
getOption.map { _.toString }.getOrElse("<state not set>")
}
}

/** Internal implementation of the [[State]] interface */
private[streaming] class StateImpl[S] extends State[S] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we separate this into another class? I don't see any other subclasses of State or great reasons to create them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it so that we can separate the publicly visible interfaces, from the private[streaming] ones. private[streaming] ones are still publicly accessible from Java, so I was trying to avoid that. Other than than I dont really have a strong reason. If that isnt a strong enough reason to have a separate abstract class and concrete implementation, then I can merge them into one.


private var state: S = null.asInstanceOf[S]
private var defined: Boolean = false
private var timingOut: Boolean = false
private var updated: Boolean = false
private var removed: Boolean = false

// ========= Public API =========
override def exists(): Boolean = {
defined
}

override def get(): S = {
if (defined) {
state
} else {
throw new NoSuchElementException("State is not set")
}
}

override def update(newState: S): Unit = {
require(!removed, "Cannot update the state after it has been removed")
require(!timingOut, "Cannot update the state that is timing out")
state = newState
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required for defensive guard require(!updated, "cannot update the state this is already updated")?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if the user accidentally tries to update the state that is going to be removed by timeout anyways, the system should throw an error rather than silently allowing him/her to update without actually being updated.

defined = true
updated = true
}

override def isTimingOut(): Boolean = {
timingOut
}

override def remove(): Unit = {
require(!timingOut, "Cannot remove the state that is timing out")
require(!removed, "Cannot remove the state that has already been removed")
defined = false
updated = false
removed = true
}

// ========= Internal API =========

/** Whether the state has been marked for removing */
def isRemoved(): Boolean = {
removed
}

/** Whether the state has been been updated */
def isUpdated(): Boolean = {
updated
}

/**
* Update the internal data and flags in `this` to the given state option.
* This method allows `this` object to be reused across many state records.
*/
def wrap(optionalState: Option[S]): Unit = {
optionalState match {
case Some(newState) =>
this.state = newState
defined = true

case None =>
this.state = null.asInstanceOf[S]
defined = false
}
timingOut = false
removed = false
updated = false
}

/**
* Update the internal data and flags in `this` to the given state that is going to be timed out.
* This method allows `this` object to be reused across many state records.
*/
def wrapTiminoutState(newState: S): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to contain a typo? It's also not clear how this is different from wrap, from the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will update comment.

this.state = newState
defined = true
timingOut = true
removed = false
updated = false
}
}
Loading