[SPARK-29223][SQL][SS] New option to specify timestamp on all subscribing topic-partitions in Kafka source #32609

HeartSaVioR · 2021-05-20T12:37:52Z

What changes were proposed in this pull request?

This patch is a follow-up of SPARK-26848 (#23747). In SPARK-26848, we decided to open possibility to let end users set individual timestamp per partition. But in many cases, specifying timestamp represents the intention that we would want to go back to specific timestamp and reprocess records, which should be applied to all topics and partitions.

This patch proposes to provide a way to set a global timestamp across topic-partitions which the source is subscribing to, so that end users can set all offsets by specific timestamp easily. To provide the way to config the timestamp easier, the new options only receive "a" timestamp for start/end timestamp.

New options introduced in this PR:

startingTimestamp
endingTimestamp

All two options receive timestamp as string.

There're priorities for options regarding starting/ending offset as we will have three options for start offsets and another three options for end offsets. Priorities are following:

starting offsets: startingTimestamp -> startingOffsetsByTimestamp -> startingOffsets
ending offsets: startingTimestamp -> startingOffsetsByTimestamp -> startingOffsets

Why are the changes needed?

Existing option to specify timestamp as offset is quite verbose if there're a lot of partitions across topics. Suppose there're 100s of partitions in a topic, the json should contain 100s of times of the same timestamp.

Also, the number of partitions can also change, which requires either:

fixing the code if the json is statically created
introducing the dependencies on Kafka client and deal with Kafka API on crafting json programmatically

Both approaches are even not "acceptable" if we're dealing with ad-hoc query; anyone doesn't want to write the code more complicated than the query itself. Flink provides the option to specify a timestamp for all topic-partitions like this PR, and even doesn't provide the option to specify the timestamp per topic-partition.

With this PR, end users are only required to provide a single timestamp value. No more complicated JSON format end users need to know about the structure.

Does this PR introduce any user-facing change?

Yes, this PR introduces two new options, described in above section.

Doc changes are following:

How was this patch tested?

New UTs covering new functionalities. Also manually tested via simple batch & streaming queries.

…bing topic-partitions in Kafka source

SparkQA · 2021-05-20T13:48:42Z

Test build #138756 has finished for PR 32609 at commit f8da576.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-20T13:59:23Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43278/

SparkQA · 2021-05-21T03:55:57Z

Test build #138786 has finished for PR 32609 at commit ec1f662.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-21T05:47:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43309/

SparkQA · 2021-05-21T06:22:43Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43309/

HeartSaVioR · 2021-05-21T06:26:29Z

cc. @viirya @gaborgsomogyi @xuanyuanking

gaborgsomogyi

Mainly looks good, left some questions/nits.

gaborgsomogyi · 2021-05-24T06:39:02Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala

-          case Some(json) => SpecificOffsetRangeLimit(JsonUtils.partitionOffsets(json))
-          case None => defaultOffsets
-        }
+    // The order below represents "preferences"


specifying both global timestamp and specific timestamp for partition added to test case 1. vs 2. fallback which is good. In order to cover all "preferences" maybe we can add a case where all config options added.

Yeah I changed the test a bit to cover two cases: all options / timestamp per partition vs offset. Thanks for the suggestion.

gaborgsomogyi · 2021-05-24T06:40:47Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala

+      val tsStr = params(globalOffsetTimestampOptionKey).trim
+      try {
+        val ts = tsStr.toLong
+        return GlobalTimestampRangeLimit(ts)


Super nit: If we put case 2 and 3 into else then we don't need return statement.

I intended to avoid deeper level of indentations, but if ~ else if ~ else would achieve the same without return. Will address.

gaborgsomogyi · 2021-05-24T06:48:19Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala

+  /**
+   * Resolves the specific offsets based on timestamp per all topic-partitions being subscribed.
+   * The returned offset for each partition is the earliest offset whose timestamp is greater
+   * than or equal to the given timestamp in the corresponding partition. If the matched offset


Nit: If the matched offset doesn't exist is a bit odd construct. Either matched or doesn't exist.

Let me refine the words. Basically it queries to Kafka and we are explaining the mechanism, it seems OK to say If Kafka doesn't return the matched offset.

gaborgsomogyi · 2021-05-24T06:49:37Z

docs/structured-streaming-kafka-integration.md

+### Details on timestamp offset options
+
+The returned offset for each partition is the earliest offset whose timestamp is greater than or equal to the given timestamp in the corresponding partition.
+The behavior varies across options if the matched offset doesn't exist - check the description of each option.


Same as down below in the file.

gaborgsomogyi · 2021-05-24T06:51:42Z

docs/structured-streaming-kafka-integration.md

+The behavior varies across options if the matched offset doesn't exist - check the description of each option.
+
+Spark simply passes the timestamp information to <code>KafkaConsumer.offsetsForTimes</code>, and doesn't interpret or reason about the value.
+For more details on <code>KafkaConsumer.offsetsForTimes</code>, please refer <a href="https://kafka.apache.org/21/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#offsetsForTimes-java.util.Map-">javadoc</a> for details.


Putting exact version information into the doc needs time to time attention.
https://kafka.apache.org/21/...
If we assume Kafka is not breaking API then we can put latest instead of 21. Though not sure Kafka has such link.
BTW why 21, we're on <kafka.version>2.8.0</kafka.version> and feature requires minimum 0.10.1.0?

If I understand correctly, there's no notion of "latest" so we picked the version we used at that time. (Worth noting that the content was added when we added timestamp offset.)

I'm OK to either raising the version to the one we use in 3.2 or lowering the version to minimum.

I've double checked and no "latest" found. I think from maintenance perspective it would be the best to lower the version to the minimum. Such case the API must remain the same and we don't have to touch it time to time. WDYT?

That makes sense. Hopefully Kafka community looks to maintain all versions of doc so the risk of broken link is relatively low. I'll update the link to 0.10.1.x. Thanks!

viirya · 2021-05-24T07:58:24Z

docs/structured-streaming-kafka-integration.md

@@ -370,16 +384,11 @@ The following configurations are optional:
  <td>none (the value of <code>startingOffsets</code> will apply)</td>


Change to "next preference is ..." for consistency?

Nice finding!

viirya

looks okay overall.

SparkQA · 2021-05-25T01:16:15Z

Test build #138897 has finished for PR 32609 at commit afd6c11.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-25T01:22:41Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43418/

gaborgsomogyi · 2021-05-25T07:36:06Z

When the hardcoded version discussion is resolved then it's good to go from my perspective.

gaborgsomogyi

LGTM.

SparkQA · 2021-05-25T09:14:37Z

Test build #138917 has finished for PR 32609 at commit 20b276d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-25T09:19:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43438/

xuanyuanking

LGTM

xuanyuanking · 2021-05-25T09:36:55Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala

+   * @param timestamp the timestamp.
+   * @param failsOnNoMatchingOffset whether to fail the query when no matched offset can be found.
+   */
+  def fetchGlobalTimestampBasedOffsets(timestamp: Long,


super nit for the code style, maybe a new line for the params?

Nice finding! Fixed it.

SparkQA · 2021-05-25T09:55:21Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43438/

SparkQA · 2021-05-25T12:10:34Z

Test build #138924 has finished for PR 32609 at commit b72f590.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-25T12:20:12Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43445/

HeartSaVioR · 2021-05-25T12:43:06Z

I'm merging this as I got approvals with comments for a few nits, and I addressed all of them. Thanks all for the reviewing!

SparkQA · 2021-05-25T12:52:55Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43445/

viirya · 2021-05-25T20:32:56Z

lgtm too

rishabhsairawat · 2021-07-22T13:28:37Z

@HeartSaVioR Thanks for this option. It makes sense to use the same timestamp across the subscribed topics.

@viirya Will it be a part of Spark 3.2.0 release?

viirya · 2021-07-22T17:00:28Z

@viirya Will it be a part of Spark 3.2.0 release?

Yes. This was merged before 3.2 branch cut.

[SPARK-29223][SQL][SS] New option to specify timestamp on all subscri…

f8da576

…bing topic-partitions in Kafka source

github-actions bot added DOCS SQL STRUCTURED STREAMING labels May 20, 2021

HyukjinKwon requested a review from gaborgsomogyi May 21, 2021 01:20

Refine doc

ec1f662

gaborgsomogyi reviewed May 24, 2021

View reviewed changes

viirya reviewed May 24, 2021

View reviewed changes

Reflect review comments

afd6c11

Change versioning

20b276d

gaborgsomogyi approved these changes May 25, 2021

View reviewed changes

xuanyuanking reviewed May 25, 2021

View reviewed changes

Fix indentation

b72f590

HeartSaVioR closed this in a57afd4 May 25, 2021

		@@ -370,16 +384,11 @@ The following configurations are optional:
		<td>none (the value of <code>startingOffsets</code> will apply)</td>

[SPARK-29223][SQL][SS] New option to specify timestamp on all subscribing topic-partitions in Kafka source #32609

[SPARK-29223][SQL][SS] New option to specify timestamp on all subscribing topic-partitions in Kafka source #32609

Uh oh!

Conversation

HeartSaVioR commented May 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented May 20, 2021

Uh oh!

SparkQA commented May 20, 2021

Uh oh!

SparkQA commented May 21, 2021

Uh oh!

SparkQA commented May 21, 2021

Uh oh!

SparkQA commented May 21, 2021

Uh oh!

HeartSaVioR commented May 21, 2021

Uh oh!

gaborgsomogyi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi May 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi May 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 25, 2021

Uh oh!

SparkQA commented May 25, 2021

Uh oh!

gaborgsomogyi commented May 25, 2021

Uh oh!

gaborgsomogyi left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 25, 2021

Uh oh!

SparkQA commented May 25, 2021

Uh oh!

xuanyuanking left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 25, 2021

HeartSaVioR commented May 20, 2021 •

edited

Loading

gaborgsomogyi May 24, 2021 •

edited

Loading

gaborgsomogyi May 24, 2021 •

edited

Loading