-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-7766] KryoSerializerInstance reuse is unsafe when auto-reset is disabled #6293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Merged build triggered. |
Merged build started. |
Test build #33168 has started for PR 6293 at commit |
Whoops, the title used to read "auto-flush" instead of "auto-reset"... fixed now. |
+1 |
Test build #33168 has finished for PR 6293 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
@rxin, in terms of managing risk in merging this fix vs. reverting the original perf. optimization, do you think we should loop in a Kryo expert to look this over? |
Maybe send an email to kryo-users? It's pretty active there. |
For #4450, we check if auto-reset is set to true and resort to the slow path if not. It might be worth doing the same? |
@sryza, I considered taking that approach but decided against it for a few different reasons:
As far as I know, auto-reset calls reset after each object in the stream, whereas here we're only calling it very few times, so I don't suspect that this will have a performance penalty. Therefore, for simplicity's sake we might want to avoid the extra logic / checks if they're not necessary. |
I sent an email to kro-users to see if anyone wants to help sanity-check this fix: https://groups.google.com/forum/#!topic/kryo-users/nS-QoGQ9wwE |
Unless there's more feedback, I'd like to merge this for 1.4 in order to fix this bug. |
Changes LGTM. |
Alright, merging to master and branch-1.4 (1.4.0). Thanks to everyone who helped review! |
…s disabled SPARK-3386 / #5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization. This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled. Author: Josh Rosen <joshrosen@databricks.com> Closes #6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits: e19726d [Josh Rosen] Add fix for SPARK-7766. 71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug (cherry picked from commit eac0069) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
I discovered that this didn't entirely fix the problem, since there are some corner-cases where we attempt to construct multiple open OutputStreams from the same serializer instance. I've opened #6415 to demonstrate this bug and discuss possible fixes. One solution might be to just revert all of the serializer re-use related patches until we can come up with a design that's easier to prove safe. |
Another option is to just force autoclose to be on. I think these fixes are
|
I've updated #6415 to attempt to make it safe to create multiple open serialization streams from the same KryoSerializerInstance. I think that the approach in my new patch will also help to avoid other corner-cases related to calling |
…s disabled SPARK-3386 / apache#5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization. This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits: e19726d [Josh Rosen] Add fix for SPARK-7766. 71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug
…s disabled SPARK-3386 / apache#5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization. This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits: e19726d [Josh Rosen] Add fix for SPARK-7766. 71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug
…s disabled SPARK-3386 / apache#5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization. This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits: e19726d [Josh Rosen] Add fix for SPARK-7766. 71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug
SPARK-3386 / #5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using
KryoSerializer
: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without callingreset()
between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization.This patch fixes this bug by calling
reset()
at the start ofserialize()
andserializeStream()
. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled.