Skip to content

Commit 1568336

Browse files
committed
misleading task number of groupByKey
"By default, this uses only 8 parallel tasks to do the grouping." is a big misleading. Please refer to #389 detail is as following code : <code> def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = { val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse for (r <- bySize if r.partitioner.isDefined) { return r.partitioner.get } if (rdd.context.conf.contains("spark.default.parallelism")) { new HashPartitioner(rdd.context.defaultParallelism) } else { new HashPartitioner(bySize.head.partitions.size) } } </code>
1 parent 037fe4d commit 1568336

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

docs/scala-programming-guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -189,7 +189,7 @@ The following tables list the transformations and actions currently supported (s
189189
<tr>
190190
<td> <b>groupByKey</b>([<i>numTasks</i>]) </td>
191191
<td> When called on a dataset of (K, V) pairs, returns a dataset of (K, Seq[V]) pairs. <br />
192-
<b>Note:</b> By default, this uses only 8 parallel tasks to do the grouping. You can pass an optional <code>numTasks</code> argument to set a different number of tasks.
192+
<b>Note:</b> By default, if the RDD already has a partitioner, the task number is decided by the partition number of the partitioner, or else relies on the value of <code>spark.default.parallelism</code> if the property is set , otherwise depends on the partition number of the RDD. You can pass an optional <code>numTasks</code> argument to set a different number of tasks.
193193
</td>
194194
</tr>
195195
<tr>

0 commit comments

Comments
 (0)