[SPARK-33223][SS][UI]Structured Streaming Web UI state information #30151

gaborgsomogyi · 2020-10-26T11:52:27Z

What changes were proposed in this pull request?

Structured Streaming UI is not containing state information. In this PR I've added it.

Why are the changes needed?

Missing state information.

Does this PR introduce any user-facing change?

Additional UI elements appear.

How was this patch tested?

Existing unit tests + manual test.

HeartSaVioR · 2020-10-26T12:53:23Z

Could you please paste the screenshots here as this PR addresses the UI change? That would help to see the proposal easier. Thanks in advance!

gaborgsomogyi · 2020-10-26T13:24:09Z

Sure, just wanted to check the jenkins tests. When the PR is polished I'll add the UI snapshot.

SparkQA · 2020-10-26T14:00:41Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34886/

SparkQA · 2020-10-26T14:27:10Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34886/

SparkQA · 2020-10-26T16:18:59Z

Test build #130285 has finished for PR 30151 at commit eb581b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2020-10-27T11:00:45Z

@HeartSaVioR here it is. The operator ID looks a bit ugly on the UI but operators don't have names.

HeartSaVioR · 2020-10-27T12:43:07Z

Thanks for taking the screenshot. General comments:

Could you please run the query long enough with input data to see whether the graph produces the correct and valuable result?
Could you also run stream-stream join and see how long the added graphs will take? Would it be helpful to display them separately without knowing which operator the number represents?

gaborgsomogyi · 2020-10-27T13:47:43Z

Could you please run the query long enough with input data to see whether the graph produces the correct and valuable result?

I've just created a sample in-memory app to show the snapshot but going forward we need a more sophisticated app (like the suggested stream-stream join) with longer execution time.

Could you also run stream-stream join and see how long the added graphs will take? Would it be helpful to display them separately without knowing which operator the number represents?

Will create such an app instead of the actual in-memory one and make tests w/ it. How do you exactly mean separate?

HeartSaVioR · 2020-10-28T01:18:12Z

How do you exactly mean separate?

I meant having "accumulated" graphs across multiple state stores vs having graphs per state store. If you use stream-stream join, 12 (6 * 2) graphs will come up, and to see the overall memory usage end users have to accumulate these values by theirselves.

Having graphs per state store may be helpful on stream-stream join when there's a skew between left side and right side (either volume of the inputs or difference on evict condition), but probably can be hidden by default and shown on demand of "details". (separate page?)

Btw I guess loadedMapCacheHitCount graph can be dropped unless on demand, as if things are working without crash or Spark's bug it will always increment properly.

gaborgsomogyi · 2020-10-28T09:41:54Z

Let me create a stream-stream join app to test and we can discuss the details what/how/where to aggregate.
Some preliminary opinions:

see the overall memory usage end users have to accumulate these values by theirselves

I agree, it would be good to show a summary but independent graph also needed to see which one is problematic

Having graphs per state store may be helpful on stream-stream join when there's a skew between left side and right side (either volume of the inputs or difference on evict condition), but probably can be hidden by default and shown on demand of "details". (separate page?)

Yeah, having 3-4 operator would make the UI horror. I'll start to experiment w/ separate page per operator approach.

Btw I guess loadedMapCacheHitCount graph can be dropped unless on demand, as if things are working without crash or Spark's bug it will always increment properly.

loadedMapCacheHitCount is coming from custom metrics which has taken over as-is: https://github.com/apache/spark/pull/30151/files#diff-e2de3487a935d91466e94189dc6d74dfe545a80a2a24a6da73cffbc55e32f6eaR261
If we want to show such values selectively maybe we can create a blacklist config for it (of course is separate jira).
Just a rapid idea: spark.sql.streaming.ui.disabledCustomMetrics=foo,bar. WDYT?

HeartSaVioR · 2020-10-28T13:00:59Z

If we would like to enable/disable graphs, probably checkbox would be better option. Some of Spark UI pages already work like the way.

gaborgsomogyi · 2020-10-28T13:58:37Z

Are these checkboxes dynamically generated? Custom metrics are custom so anything can be added there. So unless one know the potential values in advance it's hard to put a checkbox.

HeartSaVioR · 2020-10-28T23:58:00Z

Ah you're right. That said, we may need to drop graphs for custom metrics as other state store providers wouldn't have them.

gaborgsomogyi · 2020-10-29T11:41:38Z

Based on our discussion I've created SPARK-33287 since we need some kind of exclude mechanism to implement it.

gaborgsomogyi · 2020-10-29T12:18:07Z

I've just created an example stream-stream join test app to ease the testing: https://github.com/gaborgsomogyi/spark-stream-stream-join-test

SparkQA · 2020-10-29T13:09:42Z

Test build #130409 has finished for PR 30151 at commit ba23c1a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-29T13:10:03Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35013/

SparkQA · 2020-10-29T13:31:03Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35013/

gaborgsomogyi · 2020-10-29T14:35:58Z

Switched to aggregated values. I think the approach is more useful from user perspective so we can go on w/ this. WDYT?
I think the name of the graph need to be enhanced but first I would like to agree on the approach.
cc @xuanyuanking

SparkQA · 2020-10-29T15:19:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35016/

SparkQA · 2020-10-29T15:43:56Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35016/

SparkQA · 2020-10-29T18:59:38Z

Test build #130412 has finished for PR 30151 at commit 4adc856.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-10-30T05:42:42Z

The UI change looks OK to me. I'll probably apply the PR to my end and play with.

Btw, do you have some TODOs for getting rid of WIP? Just to determine the time to start review in codebase.

gaborgsomogyi · 2020-10-30T09:04:03Z

Btw, do you have some TODOs for getting rid of WIP? Just to determine the time to start review in codebase.

Would like to do a review on my code and execute some extra tests (1-2 hours endurance, etc). Hope I can remove the WIP today.

gaborgsomogyi · 2020-10-30T10:45:23Z

I've executed

no state
one operator
2 operators

applications for an hour. The response time of the UI is stable.
Additionally made some minor fixes in the PR.

SparkQA · 2020-10-30T15:04:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35064/

SparkQA · 2020-10-30T15:07:54Z

Test build #130455 has finished for PR 30151 at commit 3d11793.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-30T15:27:04Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35064/

SparkQA · 2020-10-30T18:52:14Z

Test build #130459 has finished for PR 30151 at commit 3d11793.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-11-02T00:36:56Z

cc. @sarutak as well

HeartSaVioR

Looks OK to me, only minors. I manually verified with queries:

no stateful operator
time window query
query in SPARK-33259, which demonstrate the correctness issue on multiple stream-stream joins - dropped rows are presented in the graph.

For the case of 3), probably better to have details page which show these graphs per stateful operator so that which operator has dropped rows, but that can be done as separate JIRA issue (and separate PR).

sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala

HeartSaVioR · 2020-11-02T07:20:18Z

Btw, thinking out loud, would it be helpful if we have a table showing (batch ID, timestamp, SQL execution IDs with link, Jobs with link)? Or having a table showing (batch ID, timestamp, basic metrics) with details page which contains relevant SQL execution information with link, and relevant Job information with link in the page.

General monitoring would be done with SS UI page, and in this page we probably figure out the problematic batches which we would like to look into details. The connection is lost here and as of now we need to find relevant information manually.

gaborgsomogyi · 2020-11-03T10:22:16Z

Without super deep consideration it makes sense. This page adds only high level information about the states. When it looks bad from high level then users are interested in more granular information to find out the root cause. One step on this road could be to show state information per state basis (not yet created a jira for this). When we added custom metrics + watermark info we can do experiments what is the most valuable information to users in finding out unhealthy query root causes.

SparkQA · 2020-11-03T10:55:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35164/

SparkQA · 2020-11-03T11:18:18Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35164/

HeartSaVioR · 2020-11-03T13:08:47Z

Looks OK to me. If you could come up with some sort of testing even better, but I'm not sure that's something we could require on the UI change, as the comment said we don't have effective UI tests for SS - #30151 (comment)

SparkQA · 2020-11-03T14:35:14Z

Test build #130563 has finished for PR 30151 at commit 291cf8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2020-11-03T18:04:56Z

Thanks for adding this. This is pretty useful. Could you add some simple tests in https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/ui/UISeleniumSuite.scala#L125 to make sure we at least touch these codes in our tests?

gaborgsomogyi · 2020-11-04T20:08:21Z

Sure, started to have a look what tests are possible to add. Doing some other experiments so will take some time.

SparkQA · 2020-11-09T14:21:37Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35398/

SparkQA · 2020-11-09T14:50:46Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35398/

SparkQA · 2020-11-09T14:59:26Z

Test build #130789 has finished for PR 30151 at commit 02a14fc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2020-11-09T15:31:56Z

retest this please

SparkQA · 2020-11-09T16:18:36Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35404/

SparkQA · 2020-11-09T16:47:34Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35404/

SparkQA · 2020-11-09T19:48:46Z

Test build #130795 has finished for PR 30151 at commit 02a14fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR

+1
As I see simple test is added, merging to master. If someone wants to have detailed tests please volunteer and submit a follow-up PR.

HeartSaVioR · 2020-11-10T02:23:40Z

Thanks all for reviewing and thanks @gaborgsomogyi for the contribution. Merged to master.

gaborgsomogyi · 2020-11-10T08:11:35Z

Thanks all for taking care!

HeartSaVioR · 2020-11-20T03:44:27Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala

+        <tr>
+          <td style="vertical-align: middle;">
+            <div style="width: 160px;">
+              <div><strong>Aggregated Number Of State Rows Dropped By Watermark {SparkUIUtils.tooltip("Aggregated number of state rows dropped by watermark.", "right")}</strong></div>


I figured out during work on mine - the rows dropped by watermark are not from state. It's a bit confusing, but they're input rows for "stateful operators". I'll make a follow-up PR to correct this.

yeskarthik · 2022-02-08T00:16:57Z

@HeartSaVioR @gaborgsomogyi can you please clarify why Structured Streaming UI is rendered as a static page instead of the data being exposed as an API like we do for DStream monitoring?

[SPARK-33223][SS][UI]Structured Streaming Web UI state information

eb581b8

Remove custom metrics

ba23c1a

Aggregate operators

4adc856

Bugfixes

85b1c08

gaborgsomogyi changed the title ~~[WIP][SPARK-33223][SS][UI]Structured Streaming Web UI state information~~ [SPARK-33223][SS][UI]Structured Streaming Web UI state information Oct 30, 2020

HeartSaVioR reviewed Nov 2, 2020

View reviewed changes

Review fix

291cf8a

Add selenium test

02a14fc

HeartSaVioR approved these changes Nov 10, 2020

View reviewed changes

HeartSaVioR closed this in 4ac8133 Nov 10, 2020

HeartSaVioR reviewed Nov 20, 2020

View reviewed changes

HeartSaVioR mentioned this pull request Nov 20, 2020

[SPARK-33223][SS][FOLLOWUP] Clarify the meaning of "number of rows dropped by watermark" in SS UI page #30439

Closed

[SPARK-33223][SS][UI]Structured Streaming Web UI state information #30151

[SPARK-33223][SS][UI]Structured Streaming Web UI state information #30151

Conversation

gaborgsomogyi commented Oct 26, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HeartSaVioR commented Oct 26, 2020

gaborgsomogyi commented Oct 26, 2020

SparkQA commented Oct 26, 2020

SparkQA commented Oct 26, 2020

SparkQA commented Oct 26, 2020

gaborgsomogyi commented Oct 27, 2020

HeartSaVioR commented Oct 27, 2020

gaborgsomogyi commented Oct 27, 2020

HeartSaVioR commented Oct 28, 2020

gaborgsomogyi commented Oct 28, 2020

HeartSaVioR commented Oct 28, 2020 • edited Loading

gaborgsomogyi commented Oct 28, 2020 • edited Loading

HeartSaVioR commented Oct 28, 2020

gaborgsomogyi commented Oct 29, 2020

gaborgsomogyi commented Oct 29, 2020

SparkQA commented Oct 29, 2020

SparkQA commented Oct 29, 2020

SparkQA commented Oct 29, 2020

gaborgsomogyi commented Oct 29, 2020 • edited Loading

SparkQA commented Oct 29, 2020

SparkQA commented Oct 29, 2020

SparkQA commented Oct 29, 2020

HeartSaVioR commented Oct 30, 2020 • edited Loading

gaborgsomogyi commented Oct 30, 2020

gaborgsomogyi commented Oct 30, 2020

SparkQA commented Oct 30, 2020

SparkQA commented Oct 30, 2020

SparkQA commented Oct 30, 2020

SparkQA commented Oct 30, 2020

HeartSaVioR commented Nov 2, 2020

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Nov 2, 2020

gaborgsomogyi commented Nov 3, 2020 • edited Loading

SparkQA commented Nov 3, 2020

SparkQA commented Nov 3, 2020

HeartSaVioR commented Nov 3, 2020

SparkQA commented Nov 3, 2020

zsxwing commented Nov 3, 2020 • edited Loading

gaborgsomogyi commented Nov 4, 2020

SparkQA commented Nov 9, 2020

SparkQA commented Nov 9, 2020

SparkQA commented Nov 9, 2020

gaborgsomogyi commented Nov 9, 2020

SparkQA commented Nov 9, 2020

SparkQA commented Nov 9, 2020

SparkQA commented Nov 9, 2020

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Nov 10, 2020 • edited Loading

gaborgsomogyi commented Nov 10, 2020

HeartSaVioR Nov 20, 2020

Choose a reason for hiding this comment

HeartSaVioR Nov 20, 2020

Choose a reason for hiding this comment

yeskarthik commented Feb 8, 2022

gaborgsomogyi commented Oct 26, 2020 •

edited

Loading

HeartSaVioR commented Oct 28, 2020 •

edited

Loading

gaborgsomogyi commented Oct 28, 2020 •

edited

Loading

gaborgsomogyi commented Oct 29, 2020 •

edited

Loading

HeartSaVioR commented Oct 30, 2020 •

edited

Loading

gaborgsomogyi commented Nov 3, 2020 •

edited

Loading

zsxwing commented Nov 3, 2020 •

edited

Loading

HeartSaVioR commented Nov 10, 2020 •

edited

Loading