Skip to content

Commit 456c11f

Browse files
mnazbroJoshRosen
authored andcommitted
[SPARK-5440][pyspark] Add toLocalIterator to pyspark rdd
Since Java and Scala both have access to iterate over partitions via the "toLocalIterator" function, python should also have that same ability. Author: Michael Nazario <mnazario@palantir.com> Closes apache#4237 from mnazario/feature/toLocalIterator and squashes the following commits: 1c58526 [Michael Nazario] Fix documentation off by one error 0cdc8f8 [Michael Nazario] Add toLocalIterator to PySpark
1 parent 9b18009 commit 456c11f

File tree

1 file changed

+14
-0
lines changed

1 file changed

+14
-0
lines changed

python/pyspark/rdd.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2059,6 +2059,20 @@ def countApproxDistinct(self, relativeSD=0.05):
20592059
hashRDD = self.map(lambda x: portable_hash(x) & 0xFFFFFFFF)
20602060
return hashRDD._to_java_object_rdd().countApproxDistinct(relativeSD)
20612061

2062+
def toLocalIterator(self):
2063+
"""
2064+
Return an iterator that contains all of the elements in this RDD.
2065+
The iterator will consume as much memory as the largest partition in this RDD.
2066+
>>> rdd = sc.parallelize(range(10))
2067+
>>> [x for x in rdd.toLocalIterator()]
2068+
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
2069+
"""
2070+
partitions = xrange(self.getNumPartitions())
2071+
for partition in partitions:
2072+
rows = self.context.runJob(self, lambda x: x, [partition])
2073+
for row in rows:
2074+
yield row
2075+
20622076

20632077
class PipelinedRDD(RDD):
20642078

0 commit comments

Comments
 (0)