You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since Spark 3.4.0 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing protobuf data.
21
+
* This will become a table of contents (this text will be scraped).
22
+
{:toc}
23
+
24
+
Since Spark 3.4.0 release, [Spark SQL](sql-programming-guide.html) provides built-in support for reading and writing protobuf data.
22
25
23
26
## Deploying
24
27
The `spark-protobuf` module is external and not included in `spark-submit` or `spark-shell` by default.
@@ -46,45 +49,53 @@ Kafka key-value record will be augmented with some metadata, such as the ingesti
46
49
47
50
Spark SQL schema is generated based on the protobuf descriptor file or protobuf class passed to `from_protobuf` and `to_protobuf`. The specified protobuf class or protobuf descriptor file must match the data, otherwise, the behavior is undefined: it may fail or return arbitrary results.
48
51
49
-
### Python
52
+
<divclass="codetabs">
53
+
54
+
<divdata-lang="python"markdown="1">
55
+
56
+
<divclass="d-none">
57
+
This div is only used to make markdown editor/viewer happy and does not display on web
58
+
50
59
```python
60
+
</div>
61
+
62
+
{% highlight python %}
63
+
51
64
from pyspark.sql.protobuf.functions import from_protobuf, to_protobuf
52
65
53
-
#`from_protobuf` and `to_protobuf` provides two schema choices. Via Protobuf descriptor file,
66
+
# from_protobuf and to_protobuf provide two schema choices. Via Protobuf descriptor file,
## Supported types for Protobuf -> Spark SQL conversion
278
+
232
279
Currently Spark supports reading [protobuf scalar types](https://developers.google.com/protocol-buffers/docs/proto3#scalar), [enum types](https://developers.google.com/protocol-buffers/docs/proto3#enum), [nested type](https://developers.google.com/protocol-buffers/docs/proto3#nested), and [maps type](https://developers.google.com/protocol-buffers/docs/proto3#maps) under messages of Protobuf.
233
280
In addition to the these types, `spark-protobuf` also introduces support for Protobuf `OneOf` fields. which allows you to handle messages that can have multiple possible sets of fields, but only one set can be present at a time. This is useful for situations where the data you are working with is not always in the same format, and you need to be able to handle messages with different sets of fields without encountering errors.
@@ -282,16 +329,12 @@ In addition to the these types, `spark-protobuf` also introduces support for Pro
282
329
<td>OneOf</td>
283
330
<td>Struct</td>
284
331
</tr>
285
-
<tr>
286
-
<td>Any</td>
287
-
<td>StructType</td>
288
-
</tr>
289
332
</table>
290
333
291
334
It also supports reading the following Protobuf types [Timestamp](https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#timestamp) and [Duration](https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#duration)
@@ -305,10 +348,11 @@ It also supports reading the following Protobuf types [Timestamp](https://develo
305
348
</table>
306
349
307
350
## Supported types for Spark SQL -> Protobuf conversion
351
+
308
352
Spark supports the writing of all Spark SQL types into Protobuf. For most types, the mapping from Spark types to Protobuf types is straightforward (e.g. IntegerType gets converted to int);
@@ -356,15 +400,23 @@ Spark supports the writing of all Spark SQL types into Protobuf. For most types,
356
400
</table>
357
401
358
402
## Handling circular references protobuf fields
403
+
359
404
One common issue that can arise when working with Protobuf data is the presence of circular references. In Protobuf, a circular reference occurs when a field refers back to itself or to another field that refers back to the original field. This can cause issues when parsing the data, as it can result in infinite loops or other unexpected behavior.
360
-
To address this issue, the latest version of spark-protobuf introduces a new feature: the ability to check for circular references through field types. This allows users use the `recursive.fields.max.depth` option to specify the maximum number of levels of recursion to allow when parsing the schema. By default, `spark-protobuf` will not permit recursive fields by setting `recursive.fields.max.depth` to -1. However, you can set this option to 0 to 10 if needed.
405
+
To address this issue, the latest version of spark-protobuf introduces a new feature: the ability to check for circular references through field types. This allows users use the `recursive.fields.max.depth` option to specify the maximum number of levels of recursion to allow when parsing the schema. By default, `spark-protobuf` will not permit recursive fields by setting `recursive.fields.max.depth` to -1. However, you can set this option to 0 to 10 if needed.
361
406
362
407
Setting `recursive.fields.max.depth` to 0 drops all recursive fields, setting it to 1 allows it to be recursed once, and setting it to 2 allows it to be recursed twice. A `recursive.fields.max.depth` value greater than 10 is not allowed, as it can lead to performance issues and even stack overflows.
363
408
364
409
SQL Schema for the below protobuf message will vary based on the value of `recursive.fields.max.depth`.
365
410
366
-
```proto
367
-
syntax = "proto3"
411
+
<divdata-lang="proto"markdown="1">
412
+
<divclass="d-none">
413
+
This div is only used to make markdown editor/viewer happy and does not display on web
0 commit comments