Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33223][SS][UI]Structured Streaming Web UI state information #30151

Closed
wants to merge 7 commits into from

Conversation

gaborgsomogyi
Copy link
Contributor

@gaborgsomogyi gaborgsomogyi commented Oct 26, 2020

What changes were proposed in this pull request?

Structured Streaming UI is not containing state information. In this PR I've added it.

Why are the changes needed?

Missing state information.

Does this PR introduce any user-facing change?

Additional UI elements appear.

How was this patch tested?

Existing unit tests + manual test.
Screenshot 2020-10-30 at 15 14 21

@HeartSaVioR
Copy link
Contributor

Could you please paste the screenshots here as this PR addresses the UI change? That would help to see the proposal easier. Thanks in advance!

@gaborgsomogyi
Copy link
Contributor Author

Sure, just wanted to check the jenkins tests. When the PR is polished I'll add the UI snapshot.

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34886/

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34886/

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Test build #130285 has finished for PR 30151 at commit eb581b8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor Author

@HeartSaVioR here it is. The operator ID looks a bit ugly on the UI but operators don't have names.
Screenshot 2020-10-27 at 11 37 10

@HeartSaVioR
Copy link
Contributor

Thanks for taking the screenshot. General comments:

  1. Could you please run the query long enough with input data to see whether the graph produces the correct and valuable result?
  2. Could you also run stream-stream join and see how long the added graphs will take? Would it be helpful to display them separately without knowing which operator the number represents?

@gaborgsomogyi
Copy link
Contributor Author

Could you please run the query long enough with input data to see whether the graph produces the correct and valuable result?

I've just created a sample in-memory app to show the snapshot but going forward we need a more sophisticated app (like the suggested stream-stream join) with longer execution time.

Could you also run stream-stream join and see how long the added graphs will take? Would it be helpful to display them separately without knowing which operator the number represents?

Will create such an app instead of the actual in-memory one and make tests w/ it. How do you exactly mean separate?

@HeartSaVioR
Copy link
Contributor

How do you exactly mean separate?

I meant having "accumulated" graphs across multiple state stores vs having graphs per state store. If you use stream-stream join, 12 (6 * 2) graphs will come up, and to see the overall memory usage end users have to accumulate these values by theirselves.

Having graphs per state store may be helpful on stream-stream join when there's a skew between left side and right side (either volume of the inputs or difference on evict condition), but probably can be hidden by default and shown on demand of "details". (separate page?)

Btw I guess loadedMapCacheHitCount graph can be dropped unless on demand, as if things are working without crash or Spark's bug it will always increment properly.

@gaborgsomogyi
Copy link
Contributor Author

Let me create a stream-stream join app to test and we can discuss the details what/how/where to aggregate.
Some preliminary opinions:

see the overall memory usage end users have to accumulate these values by theirselves

I agree, it would be good to show a summary but independent graph also needed to see which one is problematic

Having graphs per state store may be helpful on stream-stream join when there's a skew between left side and right side (either volume of the inputs or difference on evict condition), but probably can be hidden by default and shown on demand of "details". (separate page?)

Yeah, having 3-4 operator would make the UI horror. I'll start to experiment w/ separate page per operator approach.

Btw I guess loadedMapCacheHitCount graph can be dropped unless on demand, as if things are working without crash or Spark's bug it will always increment properly.

loadedMapCacheHitCount is coming from custom metrics which has taken over as-is: https://github.com/apache/spark/pull/30151/files#diff-e2de3487a935d91466e94189dc6d74dfe545a80a2a24a6da73cffbc55e32f6eaR261
If we want to show such values selectively maybe we can create a blacklist config for it (of course is separate jira).
Just a rapid idea: spark.sql.streaming.ui.disabledCustomMetrics=foo,bar. WDYT?

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Oct 28, 2020

If we would like to enable/disable graphs, probably checkbox would be better option. Some of Spark UI pages already work like the way.

@gaborgsomogyi
Copy link
Contributor Author

gaborgsomogyi commented Oct 28, 2020

Are these checkboxes dynamically generated? Custom metrics are custom so anything can be added there. So unless one know the potential values in advance it's hard to put a checkbox.

@HeartSaVioR
Copy link
Contributor

Ah you're right. That said, we may need to drop graphs for custom metrics as other state store providers wouldn't have them.

@gaborgsomogyi
Copy link
Contributor Author

Based on our discussion I've created SPARK-33287 since we need some kind of exclude mechanism to implement it.

@gaborgsomogyi
Copy link
Contributor Author

I've just created an example stream-stream join test app to ease the testing: https://github.com/gaborgsomogyi/spark-stream-stream-join-test

@SparkQA
Copy link

SparkQA commented Oct 29, 2020

Test build #130409 has finished for PR 30151 at commit ba23c1a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 29, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35013/

@SparkQA
Copy link

SparkQA commented Oct 29, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35013/

@gaborgsomogyi
Copy link
Contributor Author

gaborgsomogyi commented Oct 29, 2020

Switched to aggregated values. I think the approach is more useful from user perspective so we can go on w/ this. WDYT?
I think the name of the graph need to be enhanced but first I would like to agree on the approach.
cc @xuanyuanking
Screenshot 2020-10-29 at 15 24 19

@SparkQA
Copy link

SparkQA commented Oct 29, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35016/

@SparkQA
Copy link

SparkQA commented Oct 29, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35016/

@SparkQA
Copy link

SparkQA commented Oct 29, 2020

Test build #130412 has finished for PR 30151 at commit 4adc856.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Oct 30, 2020

The UI change looks OK to me. I'll probably apply the PR to my end and play with.

Btw, do you have some TODOs for getting rid of WIP? Just to determine the time to start review in codebase.

@gaborgsomogyi
Copy link
Contributor Author

Btw, do you have some TODOs for getting rid of WIP? Just to determine the time to start review in codebase.

Would like to do a review on my code and execute some extra tests (1-2 hours endurance, etc). Hope I can remove the WIP today.

@gaborgsomogyi
Copy link
Contributor Author

I've executed

  • no state
  • one operator
  • 2 operators

applications for an hour. The response time of the UI is stable.
Additionally made some minor fixes in the PR.

@gaborgsomogyi gaborgsomogyi changed the title [WIP][SPARK-33223][SS][UI]Structured Streaming Web UI state information [SPARK-33223][SS][UI]Structured Streaming Web UI state information Oct 30, 2020
@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35064/

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Test build #130455 has finished for PR 30151 at commit 3d11793.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35064/

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Test build #130459 has finished for PR 30151 at commit 3d11793.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

cc. @sarutak as well

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me, only minors. I manually verified with queries:

  1. no stateful operator
  2. time window query
  3. query in SPARK-33259, which demonstrate the correctness issue on multiple stream-stream joins - dropped rows are presented in the graph.

For the case of 3), probably better to have details page which show these graphs per stateful operator so that which operator has dropped rows, but that can be done as separate JIRA issue (and separate PR).

@HeartSaVioR
Copy link
Contributor

Btw, thinking out loud, would it be helpful if we have a table showing (batch ID, timestamp, SQL execution IDs with link, Jobs with link)? Or having a table showing (batch ID, timestamp, basic metrics) with details page which contains relevant SQL execution information with link, and relevant Job information with link in the page.

General monitoring would be done with SS UI page, and in this page we probably figure out the problematic batches which we would like to look into details. The connection is lost here and as of now we need to find relevant information manually.

@gaborgsomogyi
Copy link
Contributor Author

gaborgsomogyi commented Nov 3, 2020

Without super deep consideration it makes sense. This page adds only high level information about the states. When it looks bad from high level then users are interested in more granular information to find out the root cause. One step on this road could be to show state information per state basis (not yet created a jira for this). When we added custom metrics + watermark info we can do experiments what is the most valuable information to users in finding out unhealthy query root causes.

@SparkQA
Copy link

SparkQA commented Nov 3, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35164/

@SparkQA
Copy link

SparkQA commented Nov 3, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35164/

@HeartSaVioR
Copy link
Contributor

Looks OK to me. If you could come up with some sort of testing even better, but I'm not sure that's something we could require on the UI change, as the comment said we don't have effective UI tests for SS - #30151 (comment)

@SparkQA
Copy link

SparkQA commented Nov 3, 2020

Test build #130563 has finished for PR 30151 at commit 291cf8a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Nov 3, 2020

Thanks for adding this. This is pretty useful. Could you add some simple tests in https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/ui/UISeleniumSuite.scala#L125 to make sure we at least touch these codes in our tests?

@gaborgsomogyi
Copy link
Contributor Author

Sure, started to have a look what tests are possible to add. Doing some other experiments so will take some time.

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35398/

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35398/

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Test build #130789 has finished for PR 30151 at commit 02a14fc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35404/

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35404/

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Test build #130795 has finished for PR 30151 at commit 02a14fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
As I see simple test is added, merging to master. If someone wants to have detailed tests please volunteer and submit a follow-up PR.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Nov 10, 2020

Thanks all for reviewing and thanks @gaborgsomogyi for the contribution. Merged to master.

@gaborgsomogyi
Copy link
Contributor Author

Thanks all for taking care!

<tr>
<td style="vertical-align: middle;">
<div style="width: 160px;">
<div><strong>Aggregated Number Of State Rows Dropped By Watermark {SparkUIUtils.tooltip("Aggregated number of state rows dropped by watermark.", "right")}</strong></div>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured out during work on mine - the rows dropped by watermark are not from state. It's a bit confusing, but they're input rows for "stateful operators". I'll make a follow-up PR to correct this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yeskarthik
Copy link
Contributor

@HeartSaVioR @gaborgsomogyi can you please clarify why Structured Streaming UI is rendered as a static page instead of the data being exposed as an API like we do for DStream monitoring?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants