Skip to content

Commit e43173b

Browse files
committed
Migrated multi-line comments from Scala examples to README, as spark-shell doesnt handle multiline charset comment well.
1 parent 4acdd00 commit e43173b

File tree

2 files changed

+14
-12
lines changed

2 files changed

+14
-12
lines changed

README.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,20 @@ You can also append `-i <file.scala>` to execute a scala file via the spark shel
3737
spark-shell --conf "spark.mongodb.input.uri=mongodb://mongodb:27017/spark.times" --conf "spark.mongodb.output.uri=mongodb://mongodb/spark.output" --packages org.mongodb.spark:mongo-spark-connector_${SCALA_VERSION}:${MONGO_SPARK_VERSION} -i ./examples.scala
3838
```
3939

40+
#### Additional Comments
41+
42+
For code block in [examples.scalaL14-25](spark/files/examples.scala#L14-L25), this is an example of grouping.
43+
For example if you have 4 documents of :
44+
45+
```js
46+
{ "doc": "A", "timestamp" : ISODate("2016-02-15T00:43:04.686Z"), "myid" : 1 }
47+
{ "doc": "B", "timestamp" : ISODate("2016-02-15T00:43:06.310Z"), "myid" : 2 }
48+
{ "doc": "C", "timestamp" : ISODate("2016-01-03T00:43:07.534Z"), "myid" : 1 }
49+
{ "doc": "D", "timestamp" : ISODate("2016-01-03T00:43:09.214Z"), "myid" : 2 }
50+
```
51+
52+
The code block will group by `myid` and sort by latest timestamp, which would return only two documents, `doc:A` and `doc:B`. The grouping removes duplicate of `myid`s by returning only documents with the latest timestamp.
53+
4054
### More Information.
4155

4256
See related article:

spark/files/examples.scala

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -11,18 +11,6 @@ println("Input Count: " + rdd.count)
1111
println("Input documents: ")
1212
rdd.foreach(println)
1313

14-
/*
15-
PROCESSING
16-
For example, if you have 4 documents of :
17-
18-
{ "doc": "A", "timestamp" : ISODate("2016-02-15T00:43:04.686Z"), "myid" : 1 }
19-
{ "doc": "B", "timestamp" : ISODate("2016-02-15T00:43:06.310Z"), "myid" : 2 }
20-
{ "doc": "C", "timestamp" : ISODate("2016-01-03T00:43:07.534Z"), "myid" : 1 }
21-
{ "doc": "D", "timestamp" : ISODate("2016-01-03T00:43:09.214Z"), "myid" : 2 }
22-
23-
Group by `myid` sort latest timestamp, would return only two documents, doc:A and doc:B.
24-
Removing duplicates of myid’s by returning only documents with the latest timestamp.
25-
*/
2614
import org.joda.time.DateTime
2715
val outputRDD = rdd.map(
2816
(tuple)=>((tuple.get("myid")), (tuple.get("timestamp")))

0 commit comments

Comments
 (0)