Description
Describe the bug
We'll be getting SYSLIB0011 errors for the way the Broadcast, Worker, and RDD are formatting streams.
My current understanding:
- .NET Spark communicates with a spark workers and makes broadcast variables JVM compatible by using a system of converting CLR objects to JVM objects.
- This is all currently performed raw via BinaryFormatter.
- Untrusted binary deserialization of data with undefined schema can result in RCE.
- The guidance is "go to xml or json/bson", but we can't just do IPC with those formats?
- In theory, protobuf/Arrow could allow us whatever level of control over the individual bytes may be necessary to continue byte-level IPC with the JVM
- The missing piece is a definition of the stream format that would go over the socket?
Historical example: jni4net (also uses BinaryFormatter, but has a some level of struct definition:
*https://github.com/jni4net/jni4net/blob/ac2189c37253710e7b729797631419b0bf3b8559/jni4net.tested.n/src/generated/net/sf/jni4net/tested/JavaInstanceFields.generated.cs#L16
Notes:
Can we write directly to MemoryStream?
#1112 (comment)
"passing protobufs between Java and C using JNI":
https://medium.com/@dhaval.durve/passing-protobufs-between-java-and-native-c-code-using-jni-9808b60f6d2c
An equivalent of this CVE, and the object filter used to resolve it
https://security.snyk.io/vuln/SNYK-PYTHON-PYSPARK-3021140
BinaryFormatter Guidance:
https://learn.microsoft.com/en-us/dotnet/standard/serialization/binaryformatter-security-guide
Arrow buffers:
https://arrow.apache.org/docs/python/ipc.html
BinaryFormatter Marshaller in ProtoBuf.net:
https://github.com/protobuf-net/protobuf-net.Grpc/blob/main/tests/protobuf-net.Grpc.Test.Integration/CustomMarshaller.cs
Protobuf scalar bytes
for arbitrary byte lengths:
https://developers.google.com/protocol-buffers/docs/proto3#scalar
Wind down plan in dotnet:
dotnet/designs@bd0a066
"Is binary serialization inherently unsafe?"
https://stackoverflow.com/a/66825699
pyspark's implementation of this is based on py4j; they were going to use protobuf but opted for strings
https://github.com/py4j/py4j/blob/b4514ecd40ea121a35f9cf50bbf2ccea95354245/py4j-python/src/py4j/protocol.py#L9
https://github.com/py4j/py4j/blob/1f8a0b6dc216f16092d9c1b2556897eec8653a62/py4j-python/src/py4j/java_gateway.py#L1737
Though I will say... things seem to be BinaryFormatter all the way down?
https://github.com/protobuf-net/protobuf-net/search?q=binaryformatter