Skip to content

Commit e2dd124

Browse files
committed
better comments for spark scripts
1 parent a7c0cc5 commit e2dd124

6 files changed

+11
-2
lines changed

spark-scripts/src/main/scala/SimpleApp.scala

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
*/
55
import org.apache.spark.sql.SparkSession
66

7-
// level 1: get spark streaming to work
7+
// level 1: get spark to work from an sbt project
88
object SimpleApp {
99
def main(args: Array[String]) {
1010
val logFile = "file:///home/ubuntu/projects/java-podcast-processor/spark-scripts/README.md" // Should be some file on your system

spark-scripts/src/main/scala/SparkAggKafkaStreamingTest.scala

+1
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ import org.apache.kafka.common.serialization.StringDeserializer
1212
object SparkAggKafkaStreamingTest {
1313
def main (args: Array[String]) {
1414
/*
15+
* Level 4: Run aggregations on these Spark streams that are consuming Kafka topics
1516
* for each topic, get timestamp of first and last event that occurs.
1617
* Also get the average time interval between messages (avgDiffSec)
1718
*/

spark-scripts/src/main/scala/SparkKafkaStreamingAvgTimeDiff.scala

+2
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,8 @@ object SparkKafkaStreamingAvgTimeDiff {
2323
import spark.implicits._
2424

2525
/*
26+
* Level 6: Do it on our actual podcast kafka topics
27+
*
2628
* Currently using `podcast` topic as the "action" and `episode topic as the reaction for our proof of concept
2729
* This way we can do things like:
2830
* - test how many episodes were successfully parsed out, by comparing our `episode` kafka topic with the episodeCount returned from itunes api call

spark-scripts/src/main/scala/SparkKafkaStreamingAvgTimeDiffTest.scala

+5
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,13 @@ import org.apache.kafka.common.serialization.StringDeserializer
1212
object SparkKafkaStreamingAvgTimeDiffTest {
1313
def main (args: Array[String]) {
1414
/*
15+
* Level 5: Get average time between events between two different kafka topics, but specifically when the 2nd topic is consuming from the 1st topic, so we see a "reaction time".
16+
* This is done by doing `select where` clause that filters `action_value = reaction_value`.
17+
* E.g., if in `test` there is an event with value `event1`, it won't be paired with events in topic `test-reaction` with value `event3`, but rather only with value `event1`.
18+
*
1519
* this is close to what we want to do, but only uses fake topics (`test` as an action, and `test-reaction` as reaction).
1620
* we can then take a producer running in a terminal session and send events to these, just to make sure that our logic is working correctly, before we were on this on our actual kafka topics
21+
*
1722
* For the final product, see spark-scripts/src/main/scala/SparkKafkaStreamingAvgTimeDiff.scala
1823
*/
1924

spark-scripts/src/main/scala/SparkKafkaStreamingTest.scala

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ import org.apache.spark.sql.kafka010._
66
import org.apache.kafka.clients.consumer.ConsumerRecord
77
import org.apache.kafka.common.serialization.StringDeserializer
88

9-
// level 2: get Sparks Streaming to work with kafka topics
9+
// level 3: get Sparks Streaming to work with kafka topics
1010
object SparkKafkaStreamingTest {
1111
def main (args: Array[String]) {
1212
/*

spark-scripts/src/main/scala/SparkStreamingTest.scala

+1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import org.apache.spark.sql.functions._
22
import org.apache.spark.sql.SparkSession
33

4+
// level 2: get spark streaming to work
45
// singleton class (our main). Runs a word count over network (localhost:9999)
56
object SparkStreamingTest {
67
def main (args: Array[String]) {

0 commit comments

Comments
 (0)