Skip to content

Commit 0257b77

Browse files
committed
[SPARK-45532][DOCS] Restore codetabs for the Protobuf Data Source Guide
### What changes were proposed in this pull request? This PR restores the [Protobuf Data Source Guide](https://spark.apache.org/docs/latest/sql-data-sources-protobuf.html#python)'s code tabs which #40614 removed for markdown syntax fixes In this PR, we introduce a hidden div to hold the code-block marker of markdown, then make both the liquid and markdown happy. ### Why are the changes needed? improve doc readability and consistency. ### Does this PR introduce _any_ user-facing change? yes, doc change ### How was this patch tested? #### Doc build ![image](https://github.com/apache/spark/assets/8326978/8aefeee0-92b2-4048-a3f6-108e4c3f309d) #### markdown editor and view ![image](https://github.com/apache/spark/assets/8326978/283b0820-390a-4540-8713-647c40f956ac) ### Was this patch authored or co-authored using generative AI tooling? no Closes #43361 from yaooqinn/SPARK-45532. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>
1 parent 96bac6c commit 0257b77

File tree

1 file changed

+150
-93
lines changed

1 file changed

+150
-93
lines changed

docs/sql-data-sources-protobuf.md

Lines changed: 150 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,10 @@ license: |
1818
limitations under the License.
1919
---
2020

21-
Since Spark 3.4.0 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing protobuf data.
21+
* This will become a table of contents (this text will be scraped).
22+
{:toc}
23+
24+
Since Spark 3.4.0 release, [Spark SQL](sql-programming-guide.html) provides built-in support for reading and writing protobuf data.
2225

2326
## Deploying
2427
The `spark-protobuf` module is external and not included in `spark-submit` or `spark-shell` by default.
@@ -46,45 +49,53 @@ Kafka key-value record will be augmented with some metadata, such as the ingesti
4649

4750
Spark SQL schema is generated based on the protobuf descriptor file or protobuf class passed to `from_protobuf` and `to_protobuf`. The specified protobuf class or protobuf descriptor file must match the data, otherwise, the behavior is undefined: it may fail or return arbitrary results.
4851

49-
### Python
52+
<div class="codetabs">
53+
54+
<div data-lang="python" markdown="1">
55+
56+
<div class="d-none">
57+
This div is only used to make markdown editor/viewer happy and does not display on web
58+
5059
```python
60+
</div>
61+
62+
{% highlight python %}
63+
5164
from pyspark.sql.protobuf.functions import from_protobuf, to_protobuf
5265

53-
# `from_protobuf` and `to_protobuf` provides two schema choices. Via Protobuf descriptor file,
66+
# from_protobuf and to_protobuf provide two schema choices. Via Protobuf descriptor file,
5467
# or via shaded Java class.
5568
# give input .proto protobuf schema
56-
# syntax = "proto3"
69+
# syntax = "proto3"
5770
# message AppEvent {
58-
# string name = 1;
59-
# int64 id = 2;
60-
# string context = 3;
71+
# string name = 1;
72+
# int64 id = 2;
73+
# string context = 3;
6174
# }
62-
63-
df = spark\
64-
.readStream\
65-
.format("kafka")\
66-
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")\
67-
.option("subscribe", "topic1")\
68-
.load()
75+
df = spark
76+
.readStream
77+
.format("kafka")\
78+
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
79+
.option("subscribe", "topic1")
80+
.load()
6981

7082
# 1. Decode the Protobuf data of schema `AppEvent` into a struct;
7183
# 2. Filter by column `name`;
7284
# 3. Encode the column `event` in Protobuf format.
7385
# The Protobuf protoc command can be used to generate a protobuf descriptor file for give .proto file.
74-
output = df\
75-
.select(from_protobuf("value", "AppEvent", descriptorFilePath).alias("event"))\
76-
.where('event.name == "alice"')\
77-
.select(to_protobuf("event", "AppEvent", descriptorFilePath).alias("event"))
86+
output = df
87+
.select(from_protobuf("value", "AppEvent", descriptorFilePath).alias("event"))
88+
.where('event.name == "alice"')
89+
.select(to_protobuf("event", "AppEvent", descriptorFilePath).alias("event"))
7890

7991
# Alternatively, you can decode and encode the SQL columns into protobuf format using protobuf
8092
# class name. The specified Protobuf class must match the data, otherwise the behavior is undefined:
8193
# it may fail or return arbitrary result. To avoid conflicts, the jar file containing the
8294
# 'com.google.protobuf.*' classes should be shaded. An example of shading can be found at
8395
# https://github.com/rangadi/shaded-protobuf-classes.
84-
85-
output = df\
86-
.select(from_protobuf("value", "org.sparkproject.spark_protobuf.protobuf.AppEvent").alias("event"))\
87-
.where('event.name == "alice"')
96+
output = df
97+
.select(from_protobuf("value", "org.sparkproject.spark_protobuf.protobuf.AppEvent").alias("event"))
98+
.where('event.name == "alice"')
8899

89100
output.printSchema()
90101
# root
@@ -94,61 +105,75 @@ output.printSchema()
94105
# | |-- context: string (nullable = true)
95106

96107
output = output
97-
.select(to_protobuf("event", "org.sparkproject.spark_protobuf.protobuf.AppEvent").alias("event"))
98-
99-
query = output\
100-
.writeStream\
101-
.format("kafka")\
102-
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")\
103-
.option("topic", "topic2")\
104-
.start()
108+
.select(to_protobuf("event", "org.sparkproject.spark_protobuf.protobuf.AppEvent").alias("event"))
109+
110+
query = output
111+
.writeStream
112+
.format("kafka")
113+
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")\
114+
.option("topic", "topic2")
115+
.start()
116+
117+
{% endhighlight %}
118+
119+
<div class="d-none">
105120
```
121+
</div>
122+
123+
</div>
124+
125+
<div data-lang="scala" markdown="1">
126+
127+
<div class="d-none">
128+
This div is only used to make markdown editor/viewer happy and does not display on web
106129

107-
### Scala
108130
```scala
131+
</div>
132+
133+
{% highlight scala %}
109134
import org.apache.spark.sql.protobuf.functions._
110135

111-
// `from_protobuf` and `to_protobuf` provides two schema choices. Via Protobuf descriptor file,
136+
// `from_protobuf` and `to_protobuf` provides two schema choices. Via the protobuf descriptor file,
112137
// or via shaded Java class.
113138
// give input .proto protobuf schema
114-
// syntax = "proto3"
139+
// syntax = "proto3"
115140
// message AppEvent {
116-
// string name = 1;
117-
// int64 id = 2;
118-
// string context = 3;
141+
// string name = 1;
142+
// int64 id = 2;
143+
// string context = 3;
119144
// }
120145

121146
val df = spark
122-
.readStream
123-
.format("kafka")
124-
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
125-
.option("subscribe", "topic1")
126-
.load()
147+
.readStream
148+
.format("kafka")
149+
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
150+
.option("subscribe", "topic1")
151+
.load()
127152

128153
// 1. Decode the Protobuf data of schema `AppEvent` into a struct;
129154
// 2. Filter by column `name`;
130155
// 3. Encode the column `event` in Protobuf format.
131156
// The Protobuf protoc command can be used to generate a protobuf descriptor file for give .proto file.
132157
val output = df
133-
.select(from_protobuf($"value", "AppEvent", descriptorFilePath) as $"event")
134-
.where("event.name == \"alice\"")
135-
.select(to_protobuf($"user", "AppEvent", descriptorFilePath) as $"event")
158+
.select(from_protobuf($"value", "AppEvent", descriptorFilePath) as $"event")
159+
.where("event.name == \"alice\"")
160+
.select(to_protobuf($"user", "AppEvent", descriptorFilePath) as $"event")
136161

137162
val query = output
138-
.writeStream
139-
.format("kafka")
140-
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
141-
.option("topic", "topic2")
142-
.start()
163+
.writeStream
164+
.format("kafka")
165+
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
166+
.option("topic", "topic2")
167+
.start()
143168

144169
// Alternatively, you can decode and encode the SQL columns into protobuf format using protobuf
145170
// class name. The specified Protobuf class must match the data, otherwise the behavior is undefined:
146171
// it may fail or return arbitrary result. To avoid conflicts, the jar file containing the
147172
// 'com.google.protobuf.*' classes should be shaded. An example of shading can be found at
148173
// https://github.com/rangadi/shaded-protobuf-classes.
149174
var output = df
150-
.select(from_protobuf($"value", "org.example.protos..AppEvent") as $"event")
151-
.where("event.name == \"alice\"")
175+
.select(from_protobuf($"value", "org.example.protos..AppEvent") as $"event")
176+
.where("event.name == \"alice\"")
152177

153178
output.printSchema()
154179
// root
@@ -160,54 +185,67 @@ output.printSchema()
160185
output = output.select(to_protobuf($"event", "org.sparkproject.spark_protobuf.protobuf.AppEvent") as $"event")
161186

162187
val query = output
163-
.writeStream
164-
.format("kafka")
165-
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
166-
.option("topic", "topic2")
167-
.start()
188+
.writeStream
189+
.format("kafka")
190+
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
191+
.option("topic", "topic2")
192+
.start()
193+
194+
{% endhighlight %}
195+
196+
<div class="d-none">
168197
```
198+
</div>
199+
</div>
200+
201+
<div data-lang="java" markdown="1">
202+
203+
<div class="d-none">
204+
This div is only used to make markdown editor/viewer happy and does not display on web
169205

170-
### Java
171206
```java
207+
</div>
208+
209+
{% highlight java %}
172210
import static org.apache.spark.sql.functions.col;
173211
import static org.apache.spark.sql.protobuf.functions.*;
174212

175-
// `from_protobuf` and `to_protobuf` provides two schema choices. Via Protobuf descriptor file,
213+
// `from_protobuf` and `to_protobuf` provides two schema choices. Via the protobuf descriptor file,
176214
// or via shaded Java class.
177215
// give input .proto protobuf schema
178-
// syntax = "proto3"
216+
// syntax = "proto3"
179217
// message AppEvent {
180-
// string name = 1;
181-
// int64 id = 2;
182-
// string context = 3;
218+
// string name = 1;
219+
// int64 id = 2;
220+
// string context = 3;
183221
// }
184222

185223
Dataset<Row> df = spark
186-
.readStream()
187-
.format("kafka")
188-
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
189-
.option("subscribe", "topic1")
190-
.load();
224+
.readStream()
225+
.format("kafka")
226+
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
227+
.option("subscribe", "topic1")
228+
.load();
191229

192230
// 1. Decode the Protobuf data of schema `AppEvent` into a struct;
193231
// 2. Filter by column `name`;
194232
// 3. Encode the column `event` in Protobuf format.
195233
// The Protobuf protoc command can be used to generate a protobuf descriptor file for give .proto file.
196234
Dataset<Row> output = df
197-
.select(from_protobuf(col("value"), "AppEvent", descriptorFilePath).as("event"))
198-
.where("event.name == \"alice\"")
199-
.select(to_protobuf(col("event"), "AppEvent", descriptorFilePath).as("event"));
235+
.select(from_protobuf(col("value"), "AppEvent", descriptorFilePath).as("event"))
236+
.where("event.name == \"alice\"")
237+
.select(to_protobuf(col("event"), "AppEvent", descriptorFilePath).as("event"));
200238

201239
// Alternatively, you can decode and encode the SQL columns into protobuf format using protobuf
202240
// class name. The specified Protobuf class must match the data, otherwise the behavior is undefined:
203241
// it may fail or return arbitrary result. To avoid conflicts, the jar file containing the
204242
// 'com.google.protobuf.*' classes should be shaded. An example of shading can be found at
205243
// https://github.com/rangadi/shaded-protobuf-classes.
206244
Dataset<Row> output = df
207-
.select(
208-
from_protobuf(col("value"),
209-
"org.sparkproject.spark_protobuf.protobuf.AppEvent").as("event"))
210-
.where("event.name == \"alice\"")
245+
.select(
246+
from_protobuf(col("value"),
247+
"org.sparkproject.spark_protobuf.protobuf.AppEvent").as("event"))
248+
.where("event.name == \"alice\"")
211249

212250
output.printSchema()
213251
// root
@@ -221,19 +259,28 @@ output = output.select(
221259
"org.sparkproject.spark_protobuf.protobuf.AppEvent").as("event"));
222260

223261
StreamingQuery query = output
224-
.writeStream()
225-
.format("kafka")
226-
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
227-
.option("topic", "topic2")
228-
.start();
262+
.writeStream()
263+
.format("kafka")
264+
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
265+
.option("topic", "topic2")
266+
.start();
267+
268+
{% endhighlight %}
269+
270+
<div class="d-none">
229271
```
272+
</div>
273+
</div>
274+
275+
</div>
230276

231277
## Supported types for Protobuf -> Spark SQL conversion
278+
232279
Currently Spark supports reading [protobuf scalar types](https://developers.google.com/protocol-buffers/docs/proto3#scalar), [enum types](https://developers.google.com/protocol-buffers/docs/proto3#enum), [nested type](https://developers.google.com/protocol-buffers/docs/proto3#nested), and [maps type](https://developers.google.com/protocol-buffers/docs/proto3#maps) under messages of Protobuf.
233280
In addition to the these types, `spark-protobuf` also introduces support for Protobuf `OneOf` fields. which allows you to handle messages that can have multiple possible sets of fields, but only one set can be present at a time. This is useful for situations where the data you are working with is not always in the same format, and you need to be able to handle messages with different sets of fields without encountering errors.
234281

235-
<table class="table">
236-
<tr><th><b>Protobuf type</b></th><th><b>Spark SQL type</b></th></tr>
282+
<table class="table table-striped">
283+
<thead><tr><th><b>Protobuf type</b></th><th><b>Spark SQL type</b></th></tr></thead>
237284
<tr>
238285
<td>boolean</td>
239286
<td>BooleanType</td>
@@ -282,16 +329,12 @@ In addition to the these types, `spark-protobuf` also introduces support for Pro
282329
<td>OneOf</td>
283330
<td>Struct</td>
284331
</tr>
285-
<tr>
286-
<td>Any</td>
287-
<td>StructType</td>
288-
</tr>
289332
</table>
290333

291334
It also supports reading the following Protobuf types [Timestamp](https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#timestamp) and [Duration](https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#duration)
292335

293-
<table class="table">
294-
<tr><th><b>Protobuf logical type</b></th><th><b>Protobuf schema</b></th><th><b>Spark SQL type</b></th></tr>
336+
<table class="table table-striped">
337+
<thead><tr><th><b>Protobuf logical type</b></th><th><b>Protobuf schema</b></th><th><b>Spark SQL type</b></th></tr></thead>
295338
<tr>
296339
<td>duration</td>
297340
<td>MessageType{seconds: Long, nanos: Int}</td>
@@ -305,10 +348,11 @@ It also supports reading the following Protobuf types [Timestamp](https://develo
305348
</table>
306349

307350
## Supported types for Spark SQL -> Protobuf conversion
351+
308352
Spark supports the writing of all Spark SQL types into Protobuf. For most types, the mapping from Spark types to Protobuf types is straightforward (e.g. IntegerType gets converted to int);
309353

310-
<table class="table">
311-
<tr><th><b>Spark SQL type</b></th><th><b>Protobuf type</b></th></tr>
354+
<table class="table table-striped">
355+
<thead><tr><th><b>Spark SQL type</b></th><th><b>Protobuf type</b></th></tr></thead>
312356
<tr>
313357
<td>BooleanType</td>
314358
<td>boolean</td>
@@ -356,15 +400,23 @@ Spark supports the writing of all Spark SQL types into Protobuf. For most types,
356400
</table>
357401

358402
## Handling circular references protobuf fields
403+
359404
One common issue that can arise when working with Protobuf data is the presence of circular references. In Protobuf, a circular reference occurs when a field refers back to itself or to another field that refers back to the original field. This can cause issues when parsing the data, as it can result in infinite loops or other unexpected behavior.
360-
To address this issue, the latest version of spark-protobuf introduces a new feature: the ability to check for circular references through field types. This allows users use the `recursive.fields.max.depth` option to specify the maximum number of levels of recursion to allow when parsing the schema. By default, `spark-protobuf` will not permit recursive fields by setting `recursive.fields.max.depth` to -1. However, you can set this option to 0 to 10 if needed.
405+
To address this issue, the latest version of spark-protobuf introduces a new feature: the ability to check for circular references through field types. This allows users use the `recursive.fields.max.depth` option to specify the maximum number of levels of recursion to allow when parsing the schema. By default, `spark-protobuf` will not permit recursive fields by setting `recursive.fields.max.depth` to -1. However, you can set this option to 0 to 10 if needed.
361406

362407
Setting `recursive.fields.max.depth` to 0 drops all recursive fields, setting it to 1 allows it to be recursed once, and setting it to 2 allows it to be recursed twice. A `recursive.fields.max.depth` value greater than 10 is not allowed, as it can lead to performance issues and even stack overflows.
363408

364409
SQL Schema for the below protobuf message will vary based on the value of `recursive.fields.max.depth`.
365410

366-
```proto
367-
syntax = "proto3"
411+
<div data-lang="proto" markdown="1">
412+
<div class="d-none">
413+
This div is only used to make markdown editor/viewer happy and does not display on web
414+
415+
```protobuf
416+
</div>
417+
418+
{% highlight protobuf %}
419+
syntax = "proto3"
368420
message Person {
369421
string name = 1;
370422
Person bff = 2
@@ -376,4 +428,9 @@ message Person {
376428
0: struct<name: string, bff: null>
377429
1: struct<name string, bff: <name: string, bff: null>>
378430
2: struct<name string, bff: <name: string, bff: struct<name: string, bff: null>>> ...
379-
```
431+
432+
{% endhighlight %}
433+
<div class="d-none">
434+
```
435+
</div>
436+
</div>

0 commit comments

Comments
 (0)