JVM IPC Deserialization uses BinaryFormatter, which is now Deprecated for OWASP CWE

**Describe the bug**
We'll be getting SYSLIB0011 errors for the way the Broadcast, Worker, and RDD are formatting streams. 

My current understanding:

* .NET Spark communicates with a spark workers and makes broadcast variables JVM compatible by using a system of converting CLR objects to JVM objects.
* This is all currently performed raw via BinaryFormatter.
* Untrusted binary deserialization of data with undefined schema can result in RCE.
* The guidance is "go to xml or json/bson", but we can't just do IPC with those formats?
* In theory, protobuf/Arrow could allow us whatever level of control over the individual bytes may be necessary to continue byte-level IPC with the JVM
* The missing piece is a definition of the stream format that would go over the socket?

Historical example: jni4net (also uses BinaryFormatter, but has a some level of struct definition:
*https://github.com/jni4net/jni4net/blob/ac2189c37253710e7b729797631419b0bf3b8559/jni4net.tested.n/src/generated/net/sf/jni4net/tested/JavaInstanceFields.generated.cs#L16

Notes: 
Can we write directly to MemoryStream?
https://github.com/dotnet/spark/pull/1112#discussion_r1094785106

"passing protobufs between Java and C using JNI": 
https://medium.com/@dhaval.durve/passing-protobufs-between-java-and-native-c-code-using-jni-9808b60f6d2c

An equivalent of this CVE, and the object filter used to resolve it
https://security.snyk.io/vuln/SNYK-PYTHON-PYSPARK-3021140

https://github.com/apache/spark/pull/18166/files#diff-6a1d1601920af68466d7c30dc02170468abbe408138734c00d50d2ba1b81ba35R179

BinaryFormatter Guidance: 
https://learn.microsoft.com/en-us/dotnet/standard/serialization/binaryformatter-security-guide

Arrow buffers:
https://arrow.apache.org/docs/python/ipc.html

BinaryFormatter Marshaller in ProtoBuf.net: 
https://github.com/protobuf-net/protobuf-net.Grpc/blob/main/tests/protobuf-net.Grpc.Test.Integration/CustomMarshaller.cs

Protobuf scalar `bytes` for arbitrary byte lengths:
https://developers.google.com/protocol-buffers/docs/proto3#scalar

Wind down plan in dotnet: 
https://github.com/dotnet/designs/pull/141/commits/bd0a0661f9d248ed31a354d27ad026efd6719690

"Is binary serialization inherently unsafe?" 
https://stackoverflow.com/a/66825699

pyspark's implementation of this is based on py4j; they were going to use protobuf but opted for strings
https://github.com/py4j/py4j/blob/b4514ecd40ea121a35f9cf50bbf2ccea95354245/py4j-python/src/py4j/protocol.py#L9
https://github.com/py4j/py4j/blob/1f8a0b6dc216f16092d9c1b2556897eec8653a62/py4j-python/src/py4j/java_gateway.py#L1737

Though I will say... things seem to be BinaryFormatter all the way down? 
https://github.com/protobuf-net/protobuf-net/search?q=binaryformatter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JVM IPC Deserialization uses BinaryFormatter, which is now Deprecated for OWASP CWE #1131

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

JVM IPC Deserialization uses BinaryFormatter, which is now Deprecated for OWASP CWE #1131

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions