[SPARK-33224][SS][WEBUI] Add watermark gap information into SS UI page #30427

HeartSaVioR · 2020-11-19T12:57:06Z

What changes were proposed in this pull request?

This PR proposes to add the watermark gap information in SS UI page. Please refer below screenshots to see what we'd like to show in UI.

Please note that this PR doesn't plot the watermark value - knowing the gap between actual wall clock and watermark looks more useful than the absolute value.

Why are the changes needed?

Watermark is the one of major metrics the end users need to track for stateful queries. Watermark defines "when" the output will be emitted for append mode, hence knowing how much gap between wall clock and watermark (input data) is very helpful to make expectation of the output.

Does this PR introduce any user-facing change?

Yes, SS UI query page will contain the watermark gap information.

How was this patch tested?

Basic UT added. Manually tested with two queries:

simple case

You'll see consistent watermark gap with (15 seconds + a) = 10 seconds are from delay in watermark definition, 5 seconds are trigger interval.

import org.apache.spark.sql.streaming.Trigger

spark.conf.set("spark.sql.shuffle.partitions", "10")

val query = spark
  .readStream
  .format("rate")
  .option("rowsPerSecond", 1000)
  .option("rampUpTime", "10s")
  .load()
  .selectExpr("timestamp", "mod(value, 100) as mod", "value")
  .withWatermark("timestamp", "10 seconds")
  .groupBy(window($"timestamp", "1 minute", "10 seconds"), $"mod")
  .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value"))
  .writeStream
  .format("console")
  .trigger(Trigger.ProcessingTime("5 seconds"))
  .outputMode("append")
  .start()

query.awaitTermination()

complicated case

This randomizes the timestamp, hence producing random watermark gap. This won't be smaller than 15 seconds as I described earlier.

import org.apache.spark.sql.streaming.Trigger

spark.conf.set("spark.sql.shuffle.partitions", "10")

val query = spark
  .readStream
  .format("rate")
  .option("rowsPerSecond", 1000)
  .option("rampUpTime", "10s")
  .load()
  .selectExpr("*", "CAST(CAST(timestamp AS BIGINT) - CAST((RAND() * 100000) AS BIGINT) AS TIMESTAMP) AS tsMod")
  .selectExpr("tsMod", "mod(value, 100) as mod", "value")
  .withWatermark("tsMod", "10 seconds")
  .groupBy(window($"tsMod", "1 minute", "10 seconds"), $"mod")
  .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value"))
  .writeStream
  .format("console")
  .trigger(Trigger.ProcessingTime("5 seconds"))
  .outputMode("append")
  .start()

query.awaitTermination()

HeartSaVioR · 2020-11-19T13:05:46Z

One thing to think about: should we automatically scale the unit for watermark gap? I just picked seconds which doesn't look too small and too big, but if input event time is delayed by hours it's going to be a bit huge. (It's definitely not a good signal, though.)

SparkQA · 2020-11-19T13:41:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35952/

gaborgsomogyi · 2020-11-19T13:54:42Z

I've just had a slight look at this and I don't think switching is needed. We're not doing this where bytes showed. If I would be a user and see a graph and all of a sudden I see a different axis meaning I would be confused.

gaborgsomogyi

Looks good, not yet tested it manually so this still to come.

gaborgsomogyi · 2020-11-19T13:58:39Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala

+      query: StreamingQueryUIData,
+      minBatchTime: Long,
+      maxBatchTime: Long,
+      jsCollector: JsCollector): NodeBuffer = {


Now sure what complications it would mean in generateStatTable but I think we can give back Node here.

Yeah I simply copied and pasted, and struggled why it requires multiple nodes (hence &+). My bad.

Changed to Seq[Node] as I don't see a way to instantiate empty Node.

Changing to Option[Node] make us to force wrapping <tr>...</tr> to Option, whereas leaving the return type to NodeBuffer or Seq[Node] don't. (scala.xml.Node looks to have interesting implementation - Node extends NodeSeq)

I think either NodeBuffer or Seq[Node] is simpler.

Now I see and agree doesn't worth the hassle.

gaborgsomogyi · 2020-11-19T14:00:36Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala

+        <tr>
+          <td style="vertical-align: middle;">
+            <div style="width: 160px;">
+              <div><strong>Global Watermark Gap {SparkUIUtils.tooltip("The gap between timestamp and global watermark for the batch.", "right")}</strong></div>


I understand that timestamp here means now but maybe we can more explicit.

Yes. Probably better to say batch timestamp explicitly.

gaborgsomogyi · 2020-11-19T14:06:01Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/ui/UISeleniumSuite.scala

@@ -51,6 +53,7 @@ class UISeleniumSuite extends SparkFunSuite with WebBrowser with Matchers with B
    val conf = new SparkConf()
      .setMaster(master)
      .setAppName("ui-test")
+      .set(SHUFFLE_PARTITIONS, 5)


Just curious, is this to speed up the unit test not to start 200 tasks?

Yes. Once I changed the active query a bit to have watermark being set, it suffered to make progress in 30 seconds (meaning checking the UI failed as there's no queryProgress) and failed. This fixed the issue.

Has similar problem before, just wanted to double check. Thanks!

SparkQA · 2020-11-19T14:13:26Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35952/

dongjoon-hyun

Nice. Thank you, @HeartSaVioR .

For your design choice, do we support automatically scale the unit in the other graph? Otherwise, I agree with your decision (sec).

One thing to think about: should we automatically scale the unit for watermark gap?

dongjoon-hyun · 2020-11-19T15:50:30Z

cc @viirya

gaborgsomogyi · 2020-11-19T16:06:51Z

I've double checked the graphs manually and it works fine.

gaborgsomogyi · 2020-11-19T16:35:13Z

do we support automatically scale the unit in the other graph

No

SparkQA · 2020-11-19T18:13:33Z

Test build #131348 has finished for PR 30427 at commit d82702a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-11-19T23:26:38Z

AFAIK we don't support auto scale for other graphs. That sounds like an improvement, but I'm not a FE engineer and we even don't seem to rely on graph library (which may provide rich functionalities) and implement our own, hence harder to make an improvement.

The value can goes up very high if you set the additional delay of watermark to a couple of hours or even more (2 hours = 172,800 seconds). If the difference of watermark gap among batches are tiny compared to the additional delay, the graph will just keep showing "nearly" horizontal line. While scaling unit would make us confused, adjusting min/max of y axis might be helpful. I'm just hesitate to make change as I see all existing graphs have 0 as min value of y axis, though.

HyukjinKwon · 2020-11-20T00:24:56Z

cc @xuanyuanking too FYI

dongjoon-hyun · 2020-11-20T00:40:04Z

Thank you for the confirmation, @gaborgsomogyi and @HeartSaVioR !

HeartSaVioR · 2020-11-20T00:49:38Z

cc. @tdas @zsxwing @jose-torres @sarutak as well

SparkQA · 2020-11-20T01:45:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35983/

SparkQA · 2020-11-20T02:09:43Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35983/

viirya · 2020-11-20T04:18:04Z

Watermark is the one of major metrics the end users need to track for stateful queries. Watermark defines "when" the output will be emitted for append mode, hence knowing how much gap between wall clock and watermark (input data) is very helpful to make expectation of the output.

Hmm, my question is, watermark should be derived from event time instead of processing time (I think it should be wall clock here?). In the examples, looks like the event time is as processing time, IIUC. So once the event time from data is different processing time, is this graph still useful?

HeartSaVioR · 2020-11-20T04:34:24Z

The complicated case in manual test demonstrates the use case of "event time processing". Please take a look at the code how I randomize the event timestamp in input rows.

Technically, the graph is almost meaningless on processing time, because the event timestamp would be nearly same as batch timestamp. Even the query is lagging, once the next batch is launched, the event timestamp of inputs will be matched to the batch timestamp.

The graph will be helpful if they're either using "ingest time" (not timestamped by Spark, but timestamped when ingested to the input storage) which could show the lag of process, or using "event time" which is the best case of showing the gap.

The figure is borrowed from the gold articles below. If you haven't read below articles, strongly recommend to read them, or read the book "Streaming Systems".

https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/
https://www.oreilly.com/radar/the-world-beyond-batch-streaming-102/

viirya · 2020-11-20T04:38:27Z

The complicated case in manual test demonstrates the use case of "event time processing". Please take a look at the code how I randomize the event timestamp in input rows.

Do I miss anything? The two code is the same.

HeartSaVioR · 2020-11-20T04:40:04Z

Sorry, copy & paste error. Just updated.

viirya · 2020-11-20T05:03:43Z

Technically, the graph is almost meaningless on processing time, because the event timestamp would be nearly same as batch timestamp. Even the query is lagging, once the next batch is launched, the event timestamp of inputs will be matched to the batch timestamp.

The graph will be helpful if they're either using "ingest time" (not timestamped by Spark, but timestamped when ingested to the input storage) which could show the lag of process, or using "event time" which is the best case of showing the gap.

The gap is calculated by the difference between batch timestamp (this should be processing time, right? Because the trigger clock is SystemClock by default) and watermark. My previous question maybe not clear. If we process history data or some simulation data, the event time could be far different to processing time. For example, if we process some data from 2010 to 2019, now the gap is current time - 2010-xx-xx...?

HeartSaVioR · 2020-11-20T05:15:58Z

If we process history data or some simulation data, the event time could be far different to processing time. For example, if we process some data from 2010 to 2019, now the gap is current time - 2010-xx-xx...?

You understand it correctly, though that's just a one of use cases. Given they are running "streaming workload", one of the main goals is to capture the recent outputs (e.g. trends). Watermark would still work for such historical use cases as well, but what to plot to provide values even on the situation remains the question. (What would be the "ideal" timestamp to calculate the gap in this case?)

EDIT: for that case, adjusting range on y axis would probably help, otherwise we only see the "line" plotted nearly linear like what I commented above in #30427 (comment).

SparkQA · 2020-11-20T05:42:00Z

Test build #131380 has finished for PR 30427 at commit 2f1081a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala

viirya · 2020-11-20T07:34:12Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala

+          </div>
+        </td>
+        <td class="watermark-gap-timeline">{graphUIDataForWatermark.generateTimelineHtml(jsCollector)}</td>
+        <td class="watermark-gap-timeline">{graphUIDataForWatermark.generateHistogramHtml(jsCollector)}</td>


watermark-gap-histogram?

My bad. Thanks for finding!

SparkQA · 2020-11-20T08:05:02Z

Test build #131405 has finished for PR 30427 at commit d19fd10.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-20T08:16:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36009/

xuanyuanking

😂 Sorry for the latest minute comment again(missed the ping...). I'm also OK to address/discuss the comment in a follow-up if it ready to go and this PR does address the general cases of SS. Post my LGTM.

HeartSaVioR · 2020-11-23T19:08:50Z

retest this, please

SparkQA · 2020-11-23T19:12:50Z

Test build #131571 has started for PR 30427 at commit d19fd10.

SparkQA · 2020-11-23T20:28:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36172/

SparkQA · 2020-11-23T20:54:21Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36172/

SparkQA · 2020-11-24T04:34:08Z

Test build #131596 has finished for PR 30427 at commit d19fd10.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sarutak · 2020-11-24T04:36:58Z

retest this please.

SparkQA · 2020-11-24T06:35:51Z

Test build #131622 has finished for PR 30427 at commit d19fd10.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sarutak · 2020-11-24T07:30:22Z

retest this please.

SparkQA · 2020-11-24T10:44:14Z

Test build #131630 has finished for PR 30427 at commit d19fd10.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2020-11-24T10:52:03Z

retest this please

SparkQA · 2020-11-24T12:42:45Z

Test build #131649 has finished for PR 30427 at commit d19fd10.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-11-24T17:36:09Z

Jenkins seems unstable. GA was passed actually. I think it should be okay.

dongjoon-hyun · 2020-11-24T18:27:49Z

Yep. The master branch is fixed via 048a982 .

dongjoon-hyun · 2020-11-24T18:27:59Z

Retest this please.

SparkQA · 2020-11-24T21:54:36Z

Test build #131692 has finished for PR 30427 at commit d19fd10.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-11-24T22:47:25Z

Just rebased. I'll merge either Github Action or Jenkins is happy with the change.

SparkQA · 2020-11-25T03:48:07Z

Test build #131704 has finished for PR 30427 at commit a6db726.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-11-25T04:13:09Z

Thanks all for reviewing! Merged to master.

…d if built with Scala 2.13 ### What changes were proposed in this pull request? This PR fixes an issue that the histogram and timeline aren't rendered in the `Streaming Query Statistics` page if we built Spark with Scala 2.13. ![before-fix-the-issue](https://user-images.githubusercontent.com/4736016/100612855-f543d700-3356-11eb-90d9-ede57b8b3f4f.png) ![NaN_Error](https://user-images.githubusercontent.com/4736016/100612879-00970280-3357-11eb-97cf-43978bbe2d3a.png) The reason is [`maxRecordRate` can be `NaN`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala#L371) for Scala 2.13. The `NaN` is the result of [`query.recentProgress.map(_.inputRowsPerSecond).max`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala#L372) when the first element of `query.recentProgress.map(_.inputRowsPerSecond)` is `NaN`. Actually, the comparison logic for `Double` type was changed in Scala 2.13. scala/bug#12107 scala/scala#6410 So this issue happens as of Scala 2.13. The root cause of the `NaN` is [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L164). This `NaN` seems to be an initial value of `inputTimeSec` so I think `Double.PositiveInfinity` is suitable rather than `NaN` and this change can resolve this issue. ### Why are the changes needed? To make sure we can use the histogram/timeline with Scala 2.13. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? First, I built with the following commands. ``` $ /dev/change-scala-version.sh 2.13 $ build/sbt -Phive -Phive-thriftserver -Pscala-2.13 package ``` Then, ran the following query (this is brought from #30427 ). ``` import org.apache.spark.sql.streaming.Trigger val query = spark .readStream .format("rate") .option("rowsPerSecond", 1000) .option("rampUpTime", "10s") .load() .selectExpr("*", "CAST(CAST(timestamp AS BIGINT) - CAST((RAND() * 100000) AS BIGINT) AS TIMESTAMP) AS tsMod") .selectExpr("tsMod", "mod(value, 100) as mod", "value") .withWatermark("tsMod", "10 seconds") .groupBy(window($"tsMod", "1 minute", "10 seconds"), $"mod") .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value")) .writeStream .format("console") .trigger(Trigger.ProcessingTime("5 seconds")) .outputMode("append") .start() ``` Finally, I confirmed that the timeline and histogram are rendered. ![after-fix-the-issue](https://user-images.githubusercontent.com/4736016/100612736-c9285600-3356-11eb-856d-7e53cc656c36.png) ``` Closes #30546 from sarutak/ss-nan. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>

github-actions bot added SQL STRUCTURED STREAMING WEB UI labels Nov 19, 2020

gaborgsomogyi reviewed Nov 19, 2020

View reviewed changes

dongjoon-hyun reviewed Nov 19, 2020

View reviewed changes

sarutak reviewed Nov 20, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala Outdated Show resolved Hide resolved

viirya reviewed Nov 20, 2020

View reviewed changes

xuanyuanking approved these changes Nov 23, 2020

View reviewed changes

dongjoon-hyun approved these changes Nov 24, 2020

View reviewed changes

HeartSaVioR added 4 commits November 25, 2020 07:46

[SPARK-33224][SS] Add watermark gap information into SS UI page

107186b

Address review comments

b8e93d6

Fix a silly bug

91ed114

Another silly bugfix

a6db726

HeartSaVioR force-pushed the SPARK-33224 branch from d19fd10 to a6db726 Compare November 24, 2020 22:46

sarutak changed the title ~~[SPARK-33224][SS] Add watermark gap information into SS UI page~~ [SPARK-33224][SS][WEBUI] Add watermark gap information into SS UI page Nov 24, 2020

HeartSaVioR closed this in edab094 Nov 25, 2020

HeartSaVioR deleted the SPARK-33224 branch November 25, 2020 04:13

sarutak mentioned this pull request Nov 30, 2020

[SPARK-33607][SS][WEBUI] Input Rate timeline/histogram aren't rendered if built with Scala 2.13 #30546

Closed

[SPARK-33224][SS][WEBUI] Add watermark gap information into SS UI page #30427

[SPARK-33224][SS][WEBUI] Add watermark gap information into SS UI page #30427

Uh oh!

Conversation

HeartSaVioR commented Nov 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HeartSaVioR commented Nov 19, 2020

Uh oh!

SparkQA commented Nov 19, 2020

Uh oh!

gaborgsomogyi commented Nov 19, 2020

Uh oh!

gaborgsomogyi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Nov 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Nov 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 19, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 19, 2020

Uh oh!

gaborgsomogyi commented Nov 19, 2020

Uh oh!

gaborgsomogyi commented Nov 19, 2020

Uh oh!

SparkQA commented Nov 19, 2020

Uh oh!

HeartSaVioR commented Nov 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Nov 20, 2020

Uh oh!

dongjoon-hyun commented Nov 20, 2020

Uh oh!

HeartSaVioR commented Nov 20, 2020

Uh oh!

SparkQA commented Nov 20, 2020

Uh oh!

SparkQA commented Nov 20, 2020

Uh oh!

viirya commented Nov 20, 2020

Uh oh!

HeartSaVioR commented Nov 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Nov 20, 2020

Uh oh!

HeartSaVioR commented Nov 20, 2020

HeartSaVioR commented Nov 19, 2020 •

edited

Loading

HeartSaVioR Nov 20, 2020 •

edited

Loading

HeartSaVioR Nov 23, 2020 •

edited

Loading

HeartSaVioR commented Nov 19, 2020 •

edited

Loading

HeartSaVioR commented Nov 20, 2020 •

edited

Loading

HeartSaVioR commented Nov 20, 2020 •

edited

Loading