[SPARK-1912] fix compress memory issue during reduce #860

cloud-fan · 2014-05-23T10:32:37Z

When we need to read a compressed block, we will first create a compress stream instance(LZF or Snappy) and use it to wrap that block.
Let's say a reducer task need to read 1000 local shuffle blocks, it will first prepare to read that 1000 blocks, which means create 1000 compression stream instance to wrap them. But the initialization of compression instance will allocate some memory and when we have many compression instance at the same time, it is a problem.
Actually reducer reads the shuffle blocks one by one, so we can do the compression instance initialization lazily.

AmplabJenkins · 2014-05-23T10:32:58Z

Can one of the admins verify this patch?

rxin · 2014-05-25T03:57:44Z

Jenkins, test this please.

AmplabJenkins · 2014-05-25T03:57:59Z

Merged build triggered.

AmplabJenkins · 2014-05-25T03:58:08Z

Merged build started.

AmplabJenkins · 2014-05-25T04:38:11Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-25T04:38:11Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15181/

rxin · 2014-05-25T05:45:39Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

@@ -329,8 +329,21 @@ private[spark] class BlockManager(
   * never deletes (recent) items.
   */
  def getLocalFromDisk(blockId: BlockId, serializer: Serializer): Option[Iterator[Any]] = {
-    diskStore.getValues(blockId, serializer).orElse(
-      sys.error("Block " + blockId + " not found on disk, though it should be"))
+    class LazyProxyIterator(f: => Iterator[Any]) extends Iterator[Any] {


Do you mind adding some inline comment on why we are doing this? (basically your pull request description)

cloud-fan · 2014-05-26T04:49:44Z

@rxin Thanks for your advice! I have added the comment and override, and please take a look to see if I missed something. Thanks!

mateiz · 2014-05-29T06:21:27Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

+    }
+
+    if (diskStore.contains(blockId)) {
+      Some(new LazyProxyIterator(diskStore.getValues(blockId, serializer).get))


Doesn't this introduce a race condition because you're calling contains before getValues? If the block is removed in that time, you'll have a problem. It would be better to change BlockManager.dataDeserialize to use the lazy iterator.

cloud-fan · 2014-05-29T07:46:39Z

@mateiz That's a good idea! I have moved the lazy iterator into BlockManager.dataDeserialize. Thanks for your comments!

mateiz · 2014-05-29T18:23:55Z

Jenkins, this is ok to test

AmplabJenkins · 2014-05-29T18:28:00Z

Merged build triggered.

AmplabJenkins · 2014-05-29T18:28:07Z

Merged build started.

AmplabJenkins · 2014-05-29T19:08:32Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-29T19:08:33Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15291/

mateiz · 2014-06-02T00:37:10Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

-    val stream = wrapForCompression(blockId, new ByteBufferInputStream(bytes, true))
-    serializer.newInstance().deserializeStream(stream).asIterator
+
+    def doWork() = {


Maybe call this getIterator

Actually it could also be clearer to write it like this:

lazy val iterator = { val stream = wrapForCompression(blockId, new ByteBufferInputStream(bytes, true)) serializer.newInstance().deserializeStream(stream).asIterator }

mateiz · 2014-06-02T00:48:15Z

Thanks for the changes. Made a couple more comments but I think it's almost good to go.

cloud-fan · 2014-06-03T03:14:19Z

@mateiz Does lazy val has performance overhead? I agree lazy val can make the code clearer here, but dataDeserialize can be called many times if there are lots of shuffle blocks. I'm not a scala expert, please correct me if I am wrong.

mateiz · 2014-06-03T05:50:03Z

lazy val shouldn't be worse than what you have now. Anyway maybe it's better to leave it as a function to make it clearer. But just call the function getIterator.

AmplabJenkins · 2014-06-03T06:52:58Z

Merged build triggered.

AmplabJenkins · 2014-06-03T06:53:08Z

Merged build started.

AmplabJenkins · 2014-06-03T07:34:06Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-03T07:34:06Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15375/

mateiz · 2014-06-03T20:20:07Z

Thanks Wenchen, I've merged this for 1.1. We may also want to merge it for 1.0.1 but I'll wait to see what others thing since we might want to do more performance testing. @pwendell @rxin what are your thoughts?

rxin · 2014-06-03T20:23:46Z

I don't think there would be any perf issues with this, but might be good to do a perf run to see if there are any perf penalty.

There are possibly two perf downsides: 1. contention on lazy val's lock, and 2 the extra branch. However, I think in this case the iterator is consumed by a single thread, so the JVM can get rid of the lock, and branch prediction should work well for the extra branch ...

When we need to read a compressed block, we will first create a compress stream instance(LZF or Snappy) and use it to wrap that block. Let's say a reducer task need to read 1000 local shuffle blocks, it will first prepare to read that 1000 blocks, which means create 1000 compression stream instance to wrap them. But the initialization of compression instance will allocate some memory and when we have many compression instance at the same time, it is a problem. Actually reducer reads the shuffle blocks one by one, so we can do the compression instance initialization lazily. Author: Wenchen Fan(Cloud) <cloud0fan@gmail.com> Closes apache#860 from cloud-fan/fix-compress and squashes the following commits: 0924a6b [Wenchen Fan(Cloud)] rename 'doWork' into 'getIterator' 07f32c2 [Wenchen Fan(Cloud)] move the LazyProxyIterator to dataDeserialize d80c426 [Wenchen Fan(Cloud)] remove empty lines in short class 2c8adb2 [Wenchen Fan(Cloud)] add inline comment 8ebff77 [Wenchen Fan(Cloud)] fix compress memory issue during reduce

When we need to read a compressed block, we will first create a compress stream instance(LZF or Snappy) and use it to wrap that block. Let's say a reducer task need to read 1000 local shuffle blocks, it will first prepare to read that 1000 blocks, which means create 1000 compression stream instance to wrap them. But the initialization of compression instance will allocate some memory and when we have many compression instance at the same time, it is a problem. Actually reducer reads the shuffle blocks one by one, so we can do the compression instance initialization lazily. Author: Wenchen Fan(Cloud) <cloud0fan@gmail.com> Closes #860 from cloud-fan/fix-compress and squashes the following commits: 0924a6b [Wenchen Fan(Cloud)] rename 'doWork' into 'getIterator' 07f32c2 [Wenchen Fan(Cloud)] move the LazyProxyIterator to dataDeserialize d80c426 [Wenchen Fan(Cloud)] remove empty lines in short class 2c8adb2 [Wenchen Fan(Cloud)] add inline comment 8ebff77 [Wenchen Fan(Cloud)] fix compress memory issue during reduce

When we need to read a compressed block, we will first create a compress stream instance(LZF or Snappy) and use it to wrap that block. Let's say a reducer task need to read 1000 local shuffle blocks, it will first prepare to read that 1000 blocks, which means create 1000 compression stream instance to wrap them. But the initialization of compression instance will allocate some memory and when we have many compression instance at the same time, it is a problem. Actually reducer reads the shuffle blocks one by one, so we can do the compression instance initialization lazily. Author: Wenchen Fan(Cloud) <cloud0fan@gmail.com> Closes apache#860 from cloud-fan/fix-compress and squashes the following commits: 0924a6b [Wenchen Fan(Cloud)] rename 'doWork' into 'getIterator' 07f32c2 [Wenchen Fan(Cloud)] move the LazyProxyIterator to dataDeserialize d80c426 [Wenchen Fan(Cloud)] remove empty lines in short class 2c8adb2 [Wenchen Fan(Cloud)] add inline comment 8ebff77 [Wenchen Fan(Cloud)] fix compress memory issue during reduce

fix compress memory issue during reduce

8ebff77

rxin reviewed May 25, 2014
View reviewed changes

add inline comment

2c8adb2

mateiz reviewed May 29, 2014
View reviewed changes

cloud-fan added 2 commits May 29, 2014 14:52

remove empty lines in short class

d80c426

move the LazyProxyIterator to dataDeserialize

07f32c2

mateiz reviewed Jun 2, 2014
View reviewed changes

rename 'doWork' into 'getIterator'

0924a6b

asfgit closed this in 45e9bc8 Jun 3, 2014

rxin mentioned this pull request Aug 28, 2014

[SPARK-1912] Lazily initialize buffers for local shuffle blocks. #2179

Closed

[SPARK-1912] fix compress memory issue during reduce #860

[SPARK-1912] fix compress memory issue during reduce #860

Uh oh!

Conversation

cloud-fan commented May 23, 2014

Uh oh!

AmplabJenkins commented May 23, 2014

Uh oh!

rxin commented May 25, 2014

Uh oh!

AmplabJenkins commented May 25, 2014

Uh oh!

AmplabJenkins commented May 25, 2014

Uh oh!

AmplabJenkins commented May 25, 2014

Uh oh!

AmplabJenkins commented May 25, 2014

Uh oh!

rxin May 25, 2014

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented May 26, 2014

Uh oh!

mateiz May 29, 2014

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented May 29, 2014

Uh oh!

mateiz commented May 29, 2014

Uh oh!

AmplabJenkins commented May 29, 2014

Uh oh!

AmplabJenkins commented May 29, 2014

Uh oh!

AmplabJenkins commented May 29, 2014

Uh oh!

AmplabJenkins commented May 29, 2014

Uh oh!

mateiz Jun 2, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz Jun 2, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz commented Jun 2, 2014

Uh oh!

cloud-fan commented Jun 3, 2014

Uh oh!

mateiz commented Jun 3, 2014

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

mateiz commented Jun 3, 2014

Uh oh!

rxin commented Jun 3, 2014

Uh oh!

Uh oh!