[SPARK-17491] Close serialization stream to fix wrong answer bug in putIteratorAsBytes() #15043

JoshRosen · 2016-09-10T06:55:03Z

What changes were proposed in this pull request?

MemoryStore.putIteratorAsBytes() may silently lose values when used with KryoSerializer because it does not properly close the serialization stream before attempting to deserialize the already-serialized values, which may cause values buffered in Kryo's internal buffers to not be read.

This is the root cause behind a user-reported "wrong answer" bug in PySpark caching reported by @bennoleslie on the Spark user mailing list in a thread titled "pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK". Due to Spark 2.0's automatic use of KryoSerializer for "safe" types (such as byte arrays, primitives, etc.) this misuse of serializers manifested itself as silent data corruption rather than a StreamCorrupted error (which you might get from JavaSerializer).

The minimal fix, implemented here, is to close the serialization stream before attempting to deserialize written values. In addition, this patch adds several additional assertions / precondition checks to prevent misuse of PartiallySerializedBlock and ChunkedByteBufferOutputStream.

How was this patch tested?

The original bug was masked by an invalid assert in the memory store test cases: the old assert compared two results record-by-record with zip but didn't first check that the lengths of the two collections were equal, causing missing records to go unnoticed. The updated test case reproduced this bug.

In addition, I added a new PartiallySerializedBlockSuite to unit test that component.

SparkQA · 2016-09-10T09:05:59Z

Test build #65199 has finished for PR 15043 at commit 35a32e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

lins05 · 2016-09-10T10:01:32Z

Did a simple test and it does fix the bug. One interesting thing (without this patch) is that while records.count() returns a smaller number than the actual count, the spark UI still shows the correct records number, in my test case it's 2999808 v.s. 300000.

JoshRosen · 2016-09-12T18:09:47Z

#15056 also touches this code and creates a new test suite for this component so I'd prefer to merge that PR first.

Edit: upon further inspection I think these could be merged independently.

…d-block-values-iterator-bugfix

srinathshankar · 2016-09-13T21:16:39Z

core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala

@@ -782,6 +785,9 @@ private[storage] class PartiallySerializedBlock[T](
   * `close()` on it to free its resources.
   */
  def valuesIterator: PartiallyUnrolledIterator[T] = {
+    // Close the serialization stream so that the serializer's internal buffers are freed and any
+    // "end-of-stream" markers can be written out so that `unrolled` is a valid serialized stream.
+    serializationStream.close()


It seems like 'unrolled' may basically be invalid until serializationStream is called.

But it looks like valuesIterator is not the only place where unrolled is used

…ory.

SparkQA · 2016-09-14T01:55:49Z

Test build #65341 has finished for PR 15043 at commit 2f43e69.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-14T20:01:02Z

Test build #65393 has finished for PR 15043 at commit ccf929f.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-09-14T20:14:32Z

Jenkins, retest this please.

SparkQA · 2016-09-14T20:58:50Z

Test build #65390 has finished for PR 15043 at commit c4e50e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-14T22:16:13Z

Test build #65397 has finished for PR 15043 at commit ccf929f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-15T00:59:17Z

Test build #65407 has finished for PR 15043 at commit daf447b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srinathshankar · 2016-09-15T19:25:11Z

core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala

    }
+    consumed = true


Hmm, why set consumed = true here ? What's the problem with calling verifyNotConsumedAndNotDiscarded() twice ?

I was being overly-clever; it's clearer to just set consumed = true after each of the verifyNotConsumedAndNotDiscarded call sites.

srinathshankar · 2016-09-15T20:58:54Z

core/src/test/scala/org/apache/spark/storage/PartiallySerializedBlockSuite.scala

+      serializationStream = serializationStream,
+      redirectableOutputStream,
+      unrollMemory = unrollMemory,
+      memoryMode = MemoryMode.ON_HEAP,


WHat happens if the memory mode is off_heap. Is that relevant ?

It should only affect the memory accounting for unroll memory. If you're caching a block at MEMORY_SER storage level and are using off-heap caching then it's possible for the unrolled memory to be off-heap (so the ChunkedByteBufferOutputStream will be using a DirectBuffer allocator). In this case we need to count this as off-heap unroll memory so that Spark's off-heap allocations can respect the configured off-heap memory limit.

Given that off-heap caching (and thus off-heap unrolling) is a relatively new experimental feature, it's entirely possible that there are accounting bugs within this path. I'm going to try to expand this test suite to also exercise that case just to be 100% sure that we're accounting properly.

srinathshankar · 2016-09-15T21:03:38Z

core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala

    new PartiallyUnrolledIterator(
      memoryStore,
      unrollMemory,
-      unrolled = CompletionIterator[T, Iterator[T]](unrolledIter, discard()),
+      unrolled = CompletionIterator[T, Iterator[T]](unrolledIter, unrolledBuffer.dispose()),


Why the change from discard to dispose() ? You've made discard idempotent, right ? Does the caller have to manually release memory after the iterator is consumed ?

There was a subtle bug in the old code where the call to dispose() would end up freeing unroll memory for the buffer but that same memory would also be freed by PartiallyUnrolledIterator itself. This was exposed by the Mockito verify calls in the new test suite.

Given that unrolledBuffer.toInputStream(dispose = true) will also handle disposal of the buffer I don't think we even need this CompletionIterator here. Let me see about removing it.

srinathshankar · 2016-09-15T21:14:55Z

core/src/test/scala/org/apache/spark/storage/PartiallySerializedBlockSuite.scala

+      Mockito.verify(partiallySerializedBlock.invokePrivate(getRedirectableOutputStream())).close()
+
+      val deserializedItems = valuesIterator.toArray.toSeq
+      Mockito.verify(memoryStore).releaseUnrollMemoryForThisTask(


What's the code path that makes releaseUnroll be called, give that the completion task returned by valuesIterator has changed from discard to dispose .

Any non-freed unroll memory will be automatically freed at the end of the task (as part of the Executor or Task code itself). Before the task has completed, though, PartiallyUnrolledIterator will free the specified amount of unroll memory once the unrolled iterator is fully consumed.

JoshRosen

Thanks for the good review feedback. I think that there might indeed be a pre-existing bug related to off-heap unroll accounting here, so let me try to also catch that via strengthened test cases. I'll update this patch to address your feedback.

JoshRosen · 2016-09-15T21:26:33Z

core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala

    new PartiallyUnrolledIterator(
      memoryStore,
      unrollMemory,
-      unrolled = CompletionIterator[T, Iterator[T]](unrolledIter, discard()),
+      unrolled = CompletionIterator[T, Iterator[T]](unrolledIter, unrolledBuffer.dispose()),


There was a subtle bug in the old code where the call to dispose() would end up freeing unroll memory for the buffer but that same memory would also be freed by PartiallyUnrolledIterator itself. This was exposed by the Mockito verify calls in the new test suite.

Given that unrolledBuffer.toInputStream(dispose = true) will also handle disposal of the buffer I don't think we even need this CompletionIterator here. Let me see about removing it.

JoshRosen · 2016-09-15T21:35:26Z

core/src/test/scala/org/apache/spark/storage/PartiallySerializedBlockSuite.scala

+      serializationStream = serializationStream,
+      redirectableOutputStream,
+      unrollMemory = unrollMemory,
+      memoryMode = MemoryMode.ON_HEAP,


It should only affect the memory accounting for unroll memory. If you're caching a block at MEMORY_SER storage level and are using off-heap caching then it's possible for the unrolled memory to be off-heap (so the ChunkedByteBufferOutputStream will be using a DirectBuffer allocator). In this case we need to count this as off-heap unroll memory so that Spark's off-heap allocations can respect the configured off-heap memory limit.

Given that off-heap caching (and thus off-heap unrolling) is a relatively new experimental feature, it's entirely possible that there are accounting bugs within this path. I'm going to try to expand this test suite to also exercise that case just to be 100% sure that we're accounting properly.

JoshRosen · 2016-09-15T21:37:00Z

core/src/test/scala/org/apache/spark/storage/PartiallySerializedBlockSuite.scala

+      Mockito.verify(partiallySerializedBlock.invokePrivate(getRedirectableOutputStream())).close()
+
+      val deserializedItems = valuesIterator.toArray.toSeq
+      Mockito.verify(memoryStore).releaseUnrollMemoryForThisTask(


Any non-freed unroll memory will be automatically freed at the end of the task (as part of the Executor or Task code itself). Before the task has completed, though, PartiallyUnrolledIterator will free the specified amount of unroll memory once the unrolled iterator is fully consumed.

JoshRosen · 2016-09-15T21:37:49Z

core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala

    }
+    consumed = true


I was being overly-clever; it's clearer to just set consumed = true after each of the verifyNotConsumedAndNotDiscarded call sites.

JoshRosen · 2016-09-15T23:55:48Z

Alright, I've updated this to address the latest round of review feedback. I did manage to spot a memory-accounting problem with off-heap memory because PartiallyUnrolledIterator had hardcoded the use of MemoryMode.OFF_HEAP.

(Whoops, didn't mean to submit this comment early; my laptop trackpad glitched out and caused a bunch of spurious mouse clicks).

SparkQA · 2016-09-16T01:26:08Z

Test build #65465 has finished for PR 15043 at commit 0d70774.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srinathshankar

This looks good. I think a test dimension with memory_mode = off_heap would be useful

srinathshankar · 2016-09-16T04:36:14Z

core/src/test/scala/org/apache/spark/storage/PartiallySerializedBlockSuite.scala

+    }
+  }
+
+  test("cannot call valuesIterator() after finishWritingToStream()") {


Minor: You can probably combine
test("cannot call valuesIterator() more than once") and
test("cannot call finishWritingToStream() after valuesIterator()") into one test
Same for making calls after finishWritingToStream

srinathshankar · 2016-09-16T04:40:48Z

core/src/test/scala/org/apache/spark/storage/PartiallyUnrolledIteratorSuite.scala

@@ -33,7 +33,7 @@ class PartiallyUnrolledIteratorSuite extends SparkFunSuite with MockitoSugar {
    val rest = (unrollSize until restSize + unrollSize).iterator

    val memoryStore = mock[MemoryStore]
-    val joinIterator = new PartiallyUnrolledIterator(memoryStore, unrollSize, unroll, rest)
+    val joinIterator = new PartiallyUnrolledIterator(memoryStore, ON_HEAP, unrollSize, unroll, rest)


We should look into trying to test this with OFF_HEAP as well.

JoshRosen · 2016-09-17T16:03:38Z

Jenkins, retest this please.

JoshRosen · 2016-09-17T16:19:04Z

I agree that we should add more off-heap tests, but I'd like to do it in another patch so that we can get this one merged faster to unblock the 2.0.1 RC.

In terms of testing off-heap, I think that one of the best high-level tests / asserts would be to strengthen the releaseUnrollMemory() checks so that inappropriately releasing unroll memory during a task throws an exception during tests. Today there are some circumstances where unroll memory can only be released at the end of a task (such as an iterator backed by an unrolled block that is only partially consumed before the task ends), so the calls to release unroll memory have been tolerant of too much memory being released (it just releases min(actualMemory, requestedToRelease)). However, this is only appropriate to do at the end of the task so we should strengthen the asserts to only allow it there; this would have caught the memory mode mixup that I fixed here.

I'm going to retest this and if it passes tests then I'll merge to master and branch-2.0. I'll add the new tests described above in a followup.

SparkQA · 2016-09-17T17:54:59Z

Test build #65541 has finished for PR 15043 at commit 0d70774.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-09-17T18:43:35Z

I believe that this latest test failure is caused by a known flaky PySpark test, so I'm going to merge this now and will monitor tests afterwards.

…utIteratorAsBytes() ## What changes were proposed in this pull request? `MemoryStore.putIteratorAsBytes()` may silently lose values when used with `KryoSerializer` because it does not properly close the serialization stream before attempting to deserialize the already-serialized values, which may cause values buffered in Kryo's internal buffers to not be read. This is the root cause behind a user-reported "wrong answer" bug in PySpark caching reported by bennoleslie on the Spark user mailing list in a thread titled "pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK". Due to Spark 2.0's automatic use of KryoSerializer for "safe" types (such as byte arrays, primitives, etc.) this misuse of serializers manifested itself as silent data corruption rather than a StreamCorrupted error (which you might get from JavaSerializer). The minimal fix, implemented here, is to close the serialization stream before attempting to deserialize written values. In addition, this patch adds several additional assertions / precondition checks to prevent misuse of `PartiallySerializedBlock` and `ChunkedByteBufferOutputStream`. ## How was this patch tested? The original bug was masked by an invalid assert in the memory store test cases: the old assert compared two results record-by-record with `zip` but didn't first check that the lengths of the two collections were equal, causing missing records to go unnoticed. The updated test case reproduced this bug. In addition, I added a new `PartiallySerializedBlockSuite` to unit test that component. Author: Josh Rosen <joshrosen@databricks.com> Closes #15043 from JoshRosen/partially-serialized-block-values-iterator-bugfix. (cherry picked from commit 8faa521) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

…utIteratorAsBytes() ## What changes were proposed in this pull request? `MemoryStore.putIteratorAsBytes()` may silently lose values when used with `KryoSerializer` because it does not properly close the serialization stream before attempting to deserialize the already-serialized values, which may cause values buffered in Kryo's internal buffers to not be read. This is the root cause behind a user-reported "wrong answer" bug in PySpark caching reported by bennoleslie on the Spark user mailing list in a thread titled "pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK". Due to Spark 2.0's automatic use of KryoSerializer for "safe" types (such as byte arrays, primitives, etc.) this misuse of serializers manifested itself as silent data corruption rather than a StreamCorrupted error (which you might get from JavaSerializer). The minimal fix, implemented here, is to close the serialization stream before attempting to deserialize written values. In addition, this patch adds several additional assertions / precondition checks to prevent misuse of `PartiallySerializedBlock` and `ChunkedByteBufferOutputStream`. ## How was this patch tested? The original bug was masked by an invalid assert in the memory store test cases: the old assert compared two results record-by-record with `zip` but didn't first check that the lengths of the two collections were equal, causing missing records to go unnoticed. The updated test case reproduced this bug. In addition, I added a new `PartiallySerializedBlockSuite` to unit test that component. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#15043 from JoshRosen/partially-serialized-block-values-iterator-bugfix.

JoshRosen added 2 commits September 9, 2016 23:40

Enhance existing tests to demonstrate bug.

bb6fe38

Minimal (?) fix.

35a32e7

Merge remote-tracking branch 'origin/master' into partially-serialize…

189283d

…d-block-values-iterator-bugfix

srinathshankar reviewed Sep 13, 2016
View reviewed changes

JoshRosen added 4 commits September 13, 2016 16:51

Add new test suite and define invariants more clearly.

5b3e4e3

Enforce API contract

bd38509

Fix invalid discard() call that could cause double-free of unroll mem…

2acd2d2

…ory.

Scalastyle.

2f43e69

JoshRosen added 2 commits September 14, 2016 11:36

Fix test in MemoryStoreSuite by being more explicit about laziness.

c4e50e6

Enforce that BBOS is closed before use and is not written after close().

ccf929f

Fix ChunkedByteBufferOutputStreamSuite

daf447b

srinathshankar reviewed Sep 15, 2016

View reviewed changes

JoshRosen commented Sep 15, 2016

View reviewed changes

JoshRosen added 3 commits September 15, 2016 16:23

Remove redundant dispose() call; add comment.

4f009d3

Properly release off-heap memory in PartiallyUnrolledIterator

55c5f15

Fix overly-clever side-effecting method.

0d70774

JoshRosen closed this Sep 15, 2016

JoshRosen reopened this Sep 15, 2016

srinathshankar reviewed Sep 16, 2016

View reviewed changes

asfgit closed this in 8faa521 Sep 17, 2016

JoshRosen deleted the partially-serialized-block-values-iterator-bugfix branch September 17, 2016 18:48

ConeyLiu mentioned this pull request Jan 24, 2018

[SPARK-22068][CORE]Reduce the duplicate code between putIteratorAsValues and putIteratorAsBytes #19285

Closed

[SPARK-17491] Close serialization stream to fix wrong answer bug in putIteratorAsBytes() #15043

[SPARK-17491] Close serialization stream to fix wrong answer bug in putIteratorAsBytes() #15043

Uh oh!

Conversation

JoshRosen commented Sep 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 10, 2016

Uh oh!

lins05 commented Sep 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoshRosen commented Sep 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 14, 2016

Uh oh!

SparkQA commented Sep 14, 2016

Uh oh!

JoshRosen commented Sep 14, 2016

Uh oh!

SparkQA commented Sep 14, 2016

Uh oh!

SparkQA commented Sep 14, 2016

Uh oh!

SparkQA commented Sep 15, 2016

Uh oh!

srinathshankar Sep 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srinathshankar Sep 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srinathshankar Sep 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Sep 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 16, 2016

Uh oh!

srinathshankar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Sep 17, 2016

Uh oh!

JoshRosen commented Sep 17, 2016

Uh oh!

JoshRosen commented Sep 10, 2016 •

edited

Loading

lins05 commented Sep 10, 2016 •

edited

Loading

JoshRosen commented Sep 12, 2016 •

edited

Loading

srinathshankar Sep 15, 2016 •

edited

Loading

srinathshankar Sep 15, 2016 •

edited

Loading

srinathshankar Sep 15, 2016 •

edited

Loading

JoshRosen commented Sep 15, 2016 •

edited

Loading