Skip to content

JVM IPC Deserialization uses BinaryFormatter, which is now Deprecated for OWASP CWE #1131

Closed
@cutecycle

Description

@cutecycle

Describe the bug
We'll be getting SYSLIB0011 errors for the way the Broadcast, Worker, and RDD are formatting streams.

My current understanding:

  • .NET Spark communicates with a spark workers and makes broadcast variables JVM compatible by using a system of converting CLR objects to JVM objects.
  • This is all currently performed raw via BinaryFormatter.
  • Untrusted binary deserialization of data with undefined schema can result in RCE.
  • The guidance is "go to xml or json/bson", but we can't just do IPC with those formats?
  • In theory, protobuf/Arrow could allow us whatever level of control over the individual bytes may be necessary to continue byte-level IPC with the JVM
  • The missing piece is a definition of the stream format that would go over the socket?

Historical example: jni4net (also uses BinaryFormatter, but has a some level of struct definition:
*https://github.com/jni4net/jni4net/blob/ac2189c37253710e7b729797631419b0bf3b8559/jni4net.tested.n/src/generated/net/sf/jni4net/tested/JavaInstanceFields.generated.cs#L16

Notes:
Can we write directly to MemoryStream?
#1112 (comment)

"passing protobufs between Java and C using JNI":
https://medium.com/@dhaval.durve/passing-protobufs-between-java-and-native-c-code-using-jni-9808b60f6d2c

An equivalent of this CVE, and the object filter used to resolve it
https://security.snyk.io/vuln/SNYK-PYTHON-PYSPARK-3021140

https://github.com/apache/spark/pull/18166/files#diff-6a1d1601920af68466d7c30dc02170468abbe408138734c00d50d2ba1b81ba35R179

BinaryFormatter Guidance:
https://learn.microsoft.com/en-us/dotnet/standard/serialization/binaryformatter-security-guide

Arrow buffers:
https://arrow.apache.org/docs/python/ipc.html

BinaryFormatter Marshaller in ProtoBuf.net:
https://github.com/protobuf-net/protobuf-net.Grpc/blob/main/tests/protobuf-net.Grpc.Test.Integration/CustomMarshaller.cs

Protobuf scalar bytes for arbitrary byte lengths:
https://developers.google.com/protocol-buffers/docs/proto3#scalar

Wind down plan in dotnet:
dotnet/designs@bd0a066

"Is binary serialization inherently unsafe?"
https://stackoverflow.com/a/66825699

pyspark's implementation of this is based on py4j; they were going to use protobuf but opted for strings
https://github.com/py4j/py4j/blob/b4514ecd40ea121a35f9cf50bbf2ccea95354245/py4j-python/src/py4j/protocol.py#L9
https://github.com/py4j/py4j/blob/1f8a0b6dc216f16092d9c1b2556897eec8653a62/py4j-python/src/py4j/java_gateway.py#L1737

Though I will say... things seem to be BinaryFormatter all the way down?
https://github.com/protobuf-net/protobuf-net/search?q=binaryformatter

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions