[SPARK-29823][MLLIB] Improper persist strategy in mllib.clustering.KMeans.run()

amanomer · srowen · commit 8c2bf64743e8 · 2019-11-13T08:16:06.000-06:00
### What changes were proposed in this pull request? Adjust RDD to persist. ### Why are the changes needed? To handle the improper persist strategy. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually Closes #26483 from amanomer/SPARK-29823. Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
@@ -223,12 +223,12 @@ class KMeans private (
 
     // Compute squared norms and cache them.
     val norms = data.map(Vectors.norm(_, 2.0))
-    norms.persist()
     val zippedData = data.zip(norms).map { case (v, norm) =>
       new VectorWithNorm(v, norm)
     }
+    zippedData.persist()
     val model = runAlgorithm(zippedData, instr)
-    norms.unpersist()
+    zippedData.unpersist()
 
     // Warn at the end of the run as well, for increased visibility.
     if (data.getStorageLevel == StorageLevel.NONE) {