-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-3886] [PySpark] use AutoBatchedSerializer by default #2740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
QA tests have started for PR 2740 at commit
|
QA tests have finished for PR 2740 at commit
|
Test PASSed. |
Java object. Set 1 to disable batching or -1 to use an | ||
unlimited batch size. | ||
Java object. Set 1 to disable batching, or 0 to choose batch size | ||
based on size of objects automaticly, or -1 to use an unlimited |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling: automatically. How about "Set 1 to disable batching, 0 to automatically choose the batch size based on object sizes, or -1 to use an unlimited batch size"?
Aside from a minor doc typo, this looks good to me, especially since AutoBatchedSerializer already exists and has been tested. |
QA tests have started for PR 2740 at commit
|
I tried a small experiment to test this out: import os
from pyspark import SparkContext, SparkConf
conf = SparkConf().set("spark.executor.memory", "2g")
sc = SparkContext(conf=conf)
mb = 1000000
def inflateDataSize(x):
return bytearray(os.urandom(1 * mb))
sc.parallelize(range(1000), 10).map(inflateDataSize).cache().count() Prior to this patch, the Python worker's memory consumption would steadily grow while it attempted to batch together 100 MB of data per task, whereas now the memory usage remains constant because we emit smaller batches more often (since the objects are big). Thanks for updating the docs. This looks good to me, so I'm going to merge it into master. |
QA tests have finished for PR 2740 at commit
|
Test PASSed. |
Use AutoBatchedSerializer by default, which will choose the proper batch size based on size of serialized objects, let the size of serialized batch fall in into [64k - 640k]. In JVM, the serializer will also track the objects in batch to figure out duplicated objects, larger batch may cause OOM in JVM. Author: Davies Liu <davies.liu@gmail.com> Closes apache#2740 from davies/batchsize and squashes the following commits: 52cdb88 [Davies Liu] update docs 185f2b9 [Davies Liu] use AutoBatchedSerializer by default
Use AutoBatchedSerializer by default, which will choose the proper batch size based on size of serialized objects, let the size of serialized batch fall in into [64k - 640k].
In JVM, the serializer will also track the objects in batch to figure out duplicated objects, larger batch may cause OOM in JVM.