[SPARK-7873] Allow KryoSerializerInstance to create multiple streams at the same time #6415

JoshRosen · 2015-05-26T17:26:37Z

This is a somewhat obscure bug, but I think that it will seriously impact KryoSerializer users who use custom registrators which disabled auto-reset. When auto-reset is disabled, then this breaks things in some of our shuffle paths which actually end up creating multiple OutputStreams from the same shared SerializerInstance (which is unsafe).

This was introduced by a patch (SPARK-3386) which enables serializer re-use in some of the shuffle paths, since constructing new serializer instances is actually pretty costly for KryoSerializer. We had already fixed another corner-case (SPARK-7766) bug related to this, but missed this one.

I think that the root problem here is that KryoSerializerInstance can be used in a way which is unsafe even within a single thread, e.g. by creating multiple open OutputStreams from the same instance or by interleaving deserialize and deserializeStream calls. I considered a smaller patch which adds assertions to guard against this type of "misuse" but abandoned that approach after I realized how convoluted the Scaladoc became.

This patch fixes this bug by making it legal to create multiple streams from the same KryoSerializerInstance. Internally, KryoSerializerInstance now implements a borrowKryo() / releaseKryo() API that's backed by a "pool" of capacity 1. Each call to a KryoSerializerInstance method will borrow the Kryo, do its work, then release the serializer instance back to the pool. If the pool is empty and we need an instance, it will allocate a new Kryo on-demand. This makes it safe for multiple OutputStreams to be opened from the same serializer. If we try to release a Kryo back to the pool but the pool already contains a Kryo, then we'll just discard the new Kryo. I don't think there's a clear benefit to having a larger pool since our usages tend to fall into two cases, a) where we only create a single OutputStream and b) where we create a huge number of OutputStreams with the same lifecycle, then destroy the KryoSerializerInstance (this is what's happening in the bypassMergeSort code path that my regression test hits).

SparkQA · 2015-05-26T17:39:53Z

Test build #33528 has finished for PR 6415 at commit 7350886.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

…can interfere.

pwendell · 2015-05-26T18:14:11Z

/cc @rxin

pwendell · 2015-05-26T18:15:03Z

Jenkins, retest this please.

JoshRosen · 2015-05-26T18:15:40Z

Pushed a new test demonstrating another problem: back-to-back deserialize / deserializeStream calls aren't safe, even if you close the stream first. This was a problem in the old code, too, and can lead to silent data corruption.

SparkQA · 2015-05-26T19:30:00Z

Test build #33529 has finished for PR 6415 at commit 9816e8f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- implicit class DslLogicalPlan(val logicalPlan: LogicalPlan)

This makes it safe to invoke all SerializerInstance methods at any time, including the creation of multiple open OutputStreams from the same KryoSerializerInstance.

SparkQA · 2015-05-26T20:58:20Z

Test build #33534 has finished for PR 6415 at commit ab457ca.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KryoSerializationStream(
- class KryoDeserializationStream(

JoshRosen · 2015-05-26T22:01:34Z

Alright, this should be ready for a first pass of reviews. I'm going to work on updating the comments shortly.

JoshRosen · 2015-05-26T22:02:08Z

Whoops, meant to post this comment earlier:

I think that the root problem here is that KryoSerializerInstance can be used in a way which is unsafe even within a single thread, e.g. by creating multiple open OutputStreams from the same instance or by interleaving deserialize and deserializeStream calls. I considered a smaller patch which adds assertions to guard against this type of "misuse" but abandoned that approach after I realized how convoluted the Scaladoc became.

I just pushed a WIP commit which illustrates my proposed fix. In a nutshell, I think that rather than using a single Kryo instance in KryoSerializerInstance, we should implement a borrowKryo() / releaseKryo() API that's backed by a "pool" of capacity 1. In the common case, every call to a KryoSerializerInstnace method, do its work, then release the serializer instance back to the pool. If the pool is empty and we need an instance, we'll allocate a new Kryo on-demand. This makes it safe for multiple OutputStreams to be opened from the same serializer. If we try to release a Kryo back to the pool but the pool already contains a Kryo, then we'll just discard the new Kryo. I don't think there's a clear benefit to having a larger pool since our usages tend to fall into two cases, a) where we only create a single OutputStream and b) where we create a huge number of OutputStreams with the same lifecycle, then destroy the KryoSerializerInstance (this is what's happening in the bypassMergeSort code path that my failing test hits).

SparkQA · 2015-05-26T23:28:18Z

Test build #33542 has finished for PR 6415 at commit 3f1da96.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KryoSerializationStream(
- class KryoDeserializationStream(

zsxwing · 2015-05-27T00:45:21Z

LGTM

JoshRosen · 2015-05-27T00:47:12Z

Thanks for reviewing. Don't merge this yet; I need to update the description and comments.

JoshRosen · 2015-05-27T18:52:44Z

I've revised the pull request description and pushed a new commit which adds a few more comments.

SparkQA · 2015-05-27T19:03:18Z

Test build #33603 has finished for PR 6415 at commit ba55d20.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KryoSerializationStream(
- class KryoDeserializationStream(

JoshRosen · 2015-05-27T20:02:39Z

Jenkins, retest this please.

SparkQA · 2015-05-27T21:21:21Z

Test build #33610 has finished for PR 6415 at commit ba55d20.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-27T23:56:09Z

Test build #33620 has finished for PR 6415 at commit 00b402e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KryoSerializationStream(
- class KryoDeserializationStream(

JoshRosen · 2015-05-27T23:56:54Z

Jenkins, retest this please.

JoshRosen · 2015-05-27T23:56:59Z

Flaky Python test...

SparkQA · 2015-05-28T01:51:17Z

Test build #33628 has finished for PR 6415 at commit 00b402e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KryoSerializationStream(
- class KryoDeserializationStream(

JoshRosen · 2015-05-28T02:52:26Z

@zsxwing @pwendell unless you have additional feedback, I think that this should now be good to go for the next RC.

zsxwing · 2015-05-28T02:53:21Z

LGTM

pwendell · 2015-05-28T03:18:34Z

Okay, merging this - thanks guys.

…at the same time This is a somewhat obscure bug, but I think that it will seriously impact KryoSerializer users who use custom registrators which disabled auto-reset. When auto-reset is disabled, then this breaks things in some of our shuffle paths which actually end up creating multiple OutputStreams from the same shared SerializerInstance (which is unsafe). This was introduced by a patch (SPARK-3386) which enables serializer re-use in some of the shuffle paths, since constructing new serializer instances is actually pretty costly for KryoSerializer. We had already fixed another corner-case (SPARK-7766) bug related to this, but missed this one. I think that the root problem here is that KryoSerializerInstance can be used in a way which is unsafe even within a single thread, e.g. by creating multiple open OutputStreams from the same instance or by interleaving deserialize and deserializeStream calls. I considered a smaller patch which adds assertions to guard against this type of "misuse" but abandoned that approach after I realized how convoluted the Scaladoc became. This patch fixes this bug by making it legal to create multiple streams from the same KryoSerializerInstance. Internally, KryoSerializerInstance now implements a `borrowKryo()` / `releaseKryo()` API that's backed by a "pool" of capacity 1. Each call to a KryoSerializerInstance method will borrow the Kryo, do its work, then release the serializer instance back to the pool. If the pool is empty and we need an instance, it will allocate a new Kryo on-demand. This makes it safe for multiple OutputStreams to be opened from the same serializer. If we try to release a Kryo back to the pool but the pool already contains a Kryo, then we'll just discard the new Kryo. I don't think there's a clear benefit to having a larger pool since our usages tend to fall into two cases, a) where we only create a single OutputStream and b) where we create a huge number of OutputStreams with the same lifecycle, then destroy the KryoSerializerInstance (this is what's happening in the bypassMergeSort code path that my regression test hits). Author: Josh Rosen <joshrosen@databricks.com> Closes #6415 from JoshRosen/SPARK-7873 and squashes the following commits: 00b402e [Josh Rosen] Initialize eagerly to fix a failing test ba55d20 [Josh Rosen] Add explanatory comments 3f1da96 [Josh Rosen] Guard against duplicate close() ab457ca [Josh Rosen] Sketch a loan/release based solution. 9816e8f [Josh Rosen] Add a failing test showing how deserialize() and deserializeStream() can interfere. 7350886 [Josh Rosen] Add failing regression test for SPARK-7873 (cherry picked from commit 852f4de) Signed-off-by: Patrick Wendell <patrick@databricks.com>

…at the same time This is a somewhat obscure bug, but I think that it will seriously impact KryoSerializer users who use custom registrators which disabled auto-reset. When auto-reset is disabled, then this breaks things in some of our shuffle paths which actually end up creating multiple OutputStreams from the same shared SerializerInstance (which is unsafe). This was introduced by a patch (SPARK-3386) which enables serializer re-use in some of the shuffle paths, since constructing new serializer instances is actually pretty costly for KryoSerializer. We had already fixed another corner-case (SPARK-7766) bug related to this, but missed this one. I think that the root problem here is that KryoSerializerInstance can be used in a way which is unsafe even within a single thread, e.g. by creating multiple open OutputStreams from the same instance or by interleaving deserialize and deserializeStream calls. I considered a smaller patch which adds assertions to guard against this type of "misuse" but abandoned that approach after I realized how convoluted the Scaladoc became. This patch fixes this bug by making it legal to create multiple streams from the same KryoSerializerInstance. Internally, KryoSerializerInstance now implements a `borrowKryo()` / `releaseKryo()` API that's backed by a "pool" of capacity 1. Each call to a KryoSerializerInstance method will borrow the Kryo, do its work, then release the serializer instance back to the pool. If the pool is empty and we need an instance, it will allocate a new Kryo on-demand. This makes it safe for multiple OutputStreams to be opened from the same serializer. If we try to release a Kryo back to the pool but the pool already contains a Kryo, then we'll just discard the new Kryo. I don't think there's a clear benefit to having a larger pool since our usages tend to fall into two cases, a) where we only create a single OutputStream and b) where we create a huge number of OutputStreams with the same lifecycle, then destroy the KryoSerializerInstance (this is what's happening in the bypassMergeSort code path that my regression test hits). Author: Josh Rosen <joshrosen@databricks.com> Closes apache#6415 from JoshRosen/SPARK-7873 and squashes the following commits: 00b402e [Josh Rosen] Initialize eagerly to fix a failing test ba55d20 [Josh Rosen] Add explanatory comments 3f1da96 [Josh Rosen] Guard against duplicate close() ab457ca [Josh Rosen] Sketch a loan/release based solution. 9816e8f [Josh Rosen] Add a failing test showing how deserialize() and deserializeStream() can interfere. 7350886 [Josh Rosen] Add failing regression test for SPARK-7873

Add failing regression test for SPARK-7873

7350886

JoshRosen mentioned this pull request May 26, 2015

[SPARK-7766] KryoSerializerInstance reuse is unsafe when auto-reset is disabled #6293

Closed

Add a failing test showing how deserialize() and deserializeStream() …

9816e8f

…can interfere.

Sketch a loan/release based solution.

ab457ca

This makes it safe to invoke all SerializerInstance methods at any time, including the creation of multiple open OutputStreams from the same KryoSerializerInstance.

Guard against duplicate close()

3f1da96

Add explanatory comments

ba55d20

JoshRosen changed the title ~~[SPARK-7873] [WIP] Fix another bug related to KryoSerializerInstance re-use in sort-shuffle~~ [SPARK-7873] Fix another bug related to KryoSerializerInstance re-use in sort-shuffle May 27, 2015

JoshRosen changed the title ~~[SPARK-7873] Fix another bug related to KryoSerializerInstance re-use in sort-shuffle~~ [SPARK-7873] Allow KryoSerializerInstance to create multiple streams at the same time May 27, 2015

Initialize eagerly to fix a failing test

00b402e

asfgit closed this in 852f4de May 28, 2015

JoshRosen deleted the SPARK-7873 branch July 7, 2017 23:54

[SPARK-7873] Allow KryoSerializerInstance to create multiple streams at the same time #6415

[SPARK-7873] Allow KryoSerializerInstance to create multiple streams at the same time #6415

Uh oh!

Conversation

JoshRosen commented May 26, 2015

Uh oh!

SparkQA commented May 26, 2015

Uh oh!

pwendell commented May 26, 2015

Uh oh!

pwendell commented May 26, 2015

Uh oh!

JoshRosen commented May 26, 2015

Uh oh!

SparkQA commented May 26, 2015

Uh oh!

SparkQA commented May 26, 2015

Uh oh!

JoshRosen commented May 26, 2015

Uh oh!

JoshRosen commented May 26, 2015

Uh oh!

SparkQA commented May 26, 2015

Uh oh!

zsxwing commented May 27, 2015

Uh oh!

JoshRosen commented May 27, 2015

Uh oh!

JoshRosen commented May 27, 2015

Uh oh!

SparkQA commented May 27, 2015

Uh oh!

JoshRosen commented May 27, 2015

Uh oh!

SparkQA commented May 27, 2015

Uh oh!

SparkQA commented May 27, 2015

Uh oh!

JoshRosen commented May 27, 2015

Uh oh!

JoshRosen commented May 27, 2015

Uh oh!

SparkQA commented May 28, 2015

Uh oh!

JoshRosen commented May 28, 2015

Uh oh!

zsxwing commented May 28, 2015

Uh oh!

pwendell commented May 28, 2015

Uh oh!

Uh oh!