Skip to content

[SPARK-7766] KryoSerializerInstance reuse is unsafe when auto-reset is disabled #6293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

JoshRosen
Copy link
Contributor

SPARK-3386 / #5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using KryoSerializer: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling reset() between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization.

This patch fixes this bug by calling reset() at the start of serialize() and serializeStream(). I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled.

@AmplabJenkins
Copy link

Merged build triggered.

@JoshRosen
Copy link
Contributor Author

/cc @carrino and @rxin for review.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 20, 2015

Test build #33168 has started for PR 6293 at commit e19726d.

@JoshRosen JoshRosen changed the title [SPARK-7766] KryoSerializerInstance reuse is unsafe when auto-flush is disabled [SPARK-7766] KryoSerializerInstance reuse is unsafe when auto-reset is disabled May 20, 2015
@JoshRosen
Copy link
Contributor Author

Whoops, the title used to read "auto-flush" instead of "auto-reset"... fixed now.

@carrino
Copy link

carrino commented May 20, 2015

+1

@SparkQA
Copy link

SparkQA commented May 20, 2015

Test build #33168 has finished for PR 6293 at commit e19726d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33168/
Test PASSed.

@JoshRosen
Copy link
Contributor Author

@rxin, in terms of managing risk in merging this fix vs. reverting the original perf. optimization, do you think we should loop in a Kryo expert to look this over?

@rxin
Copy link
Contributor

rxin commented May 21, 2015

Maybe send an email to kryo-users? It's pretty active there.

@sryza
Copy link
Contributor

sryza commented May 21, 2015

For #4450, we check if auto-reset is set to true and resort to the slow path if not. It might be worth doing the same?

@JoshRosen
Copy link
Contributor Author

@sryza, I considered taking that approach but decided against it for a few different reasons:

  • Calling reset() between creating serialization streams should be pretty cheap because it's a relatively inexpensive operation and we only call it once per stream.
  • Calling reset() once per serialize() call is sightly more expensive from a relative cost perspective, but we have relatively few calls to serialize() (I think that we mainly / only use Serializer.serialize for serializing closures).
  • If we guard the reset call with a check to see whether auto-reset is enabled, then we probably need to create a boolean field to track whether auto-reset is enabled rather than calling the getAutoReset() each time. It might be fine to convert getAutoReset into a lazy val or to otherwise store its return value if there aren't any ordering / initialization concerns here.

As far as I know, auto-reset calls reset after each object in the stream, whereas here we're only calling it very few times, so I don't suspect that this will have a performance penalty. Therefore, for simplicity's sake we might want to avoid the extra logic / checks if they're not necessary.

@JoshRosen
Copy link
Contributor Author

I sent an email to kro-users to see if anyone wants to help sanity-check this fix: https://groups.google.com/forum/#!topic/kryo-users/nS-QoGQ9wwE

@JoshRosen
Copy link
Contributor Author

Unless there's more feedback, I'd like to merge this for 1.4 in order to fix this bug.

@rxin
Copy link
Contributor

rxin commented May 22, 2015

Changes LGTM.

@JoshRosen
Copy link
Contributor Author

Alright, merging to master and branch-1.4 (1.4.0). Thanks to everyone who helped review!

@asfgit asfgit closed this in eac0069 May 22, 2015
asfgit pushed a commit that referenced this pull request May 22, 2015
…s disabled

SPARK-3386 / #5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization.

This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits:

e19726d [Josh Rosen] Add fix for SPARK-7766.
71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug

(cherry picked from commit eac0069)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
@JoshRosen
Copy link
Contributor Author

I discovered that this didn't entirely fix the problem, since there are some corner-cases where we attempt to construct multiple open OutputStreams from the same serializer instance. I've opened #6415 to demonstrate this bug and discuss possible fixes. One solution might be to just revert all of the serializer re-use related patches until we can come up with a design that's easier to prove safe.

@JoshRosen JoshRosen deleted the kryo-instance-reuse-bug branch May 26, 2015 17:29
@carrino
Copy link

carrino commented May 26, 2015

Another option is to just force autoclose to be on. I think these fixes are
important enough. I'm seeing about 13% perf hit from doing reflection
creating these Kryo Serializers, so rolling this back is kinda the opposite
of what I was hoping for.
On May 26, 2015 10:29 AM, "Josh Rosen" notifications@github.com wrote:

I discovered that this didn't entirely fix the problem, since there are
some corner-cases where we attempt to construct multiple open OutputStreams
from the same serializer instance. I've opened #6415
#6415 to demonstrate this bug and
discuss possible fixes. One solution might be to just revert all of the
serializer re-use related patches until we can come up with a design that's
easier to prove safe.


Reply to this email directly or view it on GitHub
#6293 (comment).

@JoshRosen
Copy link
Contributor Author

I've updated #6415 to attempt to make it safe to create multiple open serialization streams from the same KryoSerializerInstance. I think that the approach in my new patch will also help to avoid other corner-cases related to calling deserialize while a deserializeStream is already open. The fixes there will help to guard against future misuse of shared serializer instances by removing all of the corner-cases where certain method invocation orders were unsafe.

jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
…s disabled

SPARK-3386 / apache#5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization.

This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled.

Author: Josh Rosen <joshrosen@databricks.com>

Closes apache#6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits:

e19726d [Josh Rosen] Add fix for SPARK-7766.
71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
…s disabled

SPARK-3386 / apache#5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization.

This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled.

Author: Josh Rosen <joshrosen@databricks.com>

Closes apache#6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits:

e19726d [Josh Rosen] Add fix for SPARK-7766.
71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…s disabled

SPARK-3386 / apache#5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization.

This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled.

Author: Josh Rosen <joshrosen@databricks.com>

Closes apache#6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits:

e19726d [Josh Rosen] Add fix for SPARK-7766.
71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants