[SPARK-17243] [Web UI] Spark 2.0 History Server won't load with very large application history #14835

ajbozarth · 2016-08-26T21:07:56Z

What changes were proposed in this pull request?

With the new History Server the summary page loads the application list via the the REST API, this makes it very slow to impossible to load with large (10K+) application history. This pr fixes this by adding the spark.history.ui.maxApplications conf to limit the number of applications the History Server displays. This is accomplished using a new optional limit param for the applications api. (Note this only applies to what the summary page displays, all the Application UI's are still accessible if the user knows the App ID and goes to the Application UI directly.)

I've also added a new test for the limit param in HistoryServerSuite.scala

How was this patch tested?

Manual testing and dev/run-tests

…o HistoryServer

ajbozarth · 2016-08-26T21:11:54Z

I ended up not using the limit param in my solution since getting a conf value to the js function that calls the api was not exactly simple (my biggest issue with switching to js based pages for the ui), but given it's overall usefulness I decided to leave it in this pr as a new feature, If you think I should open a separate JIRA/pr for it I can.

ajbozarth · 2016-08-26T21:12:32Z

@srowen @steveloughran can you take a look at this, you two were the ones who commented on the JIRA, thanks!

steveloughran · 2016-08-26T21:34:10Z

dev/.rat-excludes

@@ -101,3 +101,4 @@ org.apache.spark.scheduler.ExternalClusterManager
 .*\.sql
 .Rbuildignore
 org.apache.spark.deploy.yarn.security.ServiceCredentialProvider
+spark-warehouse


not sure about this change

oh whoops, I must of committed that by accident, dev/run-tests won't pass without it so I've been adding/removing it every time

steveloughran · 2016-08-26T21:39:32Z

I'll let other people review the source code in detail, except note that people calling the REST API may want to ask for all entries, even if the web view asks for less.

the rest API is documented in docs/monitoring.md ; changes need to go in there
Tests in HistoryServerSuite;

Test-wise, I'd add something to the end of test("incomplete apps get refreshed"), making another API call on the app list through getContentAndCode and set the ?limit=1 value to verify 0 comes back.

ajbozarth · 2016-08-26T22:14:48Z

@steveloughran Some follow up and clarifications:
For you comments about people calling the REST API, you are correct that my change limits the applications available through the api on the history server. This was my intended behavior, but if you think it'd be better I can switch to my fallback implementation outlined below.

Alternate implementation that only limits history server and not the api:
I pass the conf value to the js file via a global js var with a getter and setter (I've done this previously with threadDumpEnabled in ExecutorsPage.scala going to executors.js)

I will add this
I added the one test already, for the test you referenced I'm a bit confused at what you're recommending. Why would ?limit=1 return 0? Or are you saying to call it after the last line where it deletes the log, if so how would that test the limit param?

Thanks

ajbozarth · 2016-08-26T23:26:44Z

So I decided to switch to my alternate implementation after thinking hard about how detrimental limiting the api would be. I also updated the docs with my new conf and api param. I will push these changes once I double check my tests.

SparkQA · 2016-08-26T23:31:20Z

Test build #64502 has finished for PR 14835 at commit 6163f13.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-27T01:54:27Z

Test build #64508 has finished for PR 14835 at commit 172dc56.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

QQshu1 · 2016-08-27T02:42:37Z

I may have a question ,if I just have a application,and it is very large .It also can't load it by this way. How to solve this kind of scene?

ajbozarth · 2016-08-29T05:03:47Z

@QQshu1 I'm not sure what problem you're having but this pr is to solve the specific problem that the call to get the application list json when loading the history server takes too long when the application list is too long

tgravescs · 2016-08-29T16:42:54Z

docs/monitoring.md

+    <td>Int.MaxValue</td>
+    <td>
+      The number of applications to display on the history summary page. Application UIs are
+      still available even if they are not displayed on the history summary page.


we might want to clarify this, I assume its if you put url to the application directly

You are correct, I'll reword it to use "URL"

tgravescs · 2016-08-29T16:42:58Z

I need to look at in more details, but at one point (and it looks like this has changed) spark.history.retainedApplications I thought affected number being displayed as well. So in the very least I think we should update the docs to clarify what it is vs this.

tgravescs · 2016-08-29T19:05:46Z

core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala

@@ -100,6 +100,7 @@ class HistoryServerSuite extends SparkFunSuite with BeforeAndAfter with Matchers
    "minDate app list json" -> "applications?minDate=2015-02-10",
    "maxDate app list json" -> "applications?maxDate=2015-02-10",
    "maxDate2 app list json" -> "applications?maxDate=2015-02-03T16:42:40.000GMT",
+    "limit app list json" -> "applications?limit=3",


what happens with limit=-1? I assume the take will just return empty map but good to try it.

limit=-1 will actually return the whole list, I just pass the int to take and that's how that function works, I though this was a good default value (which is why it's the default value in the js file)

ok, sounds fine.

I was worried about what would happen if limit was unset —but the existing tests (lines 180-182) all test that

ajbozarth · 2016-08-29T19:10:03Z

@tgravescs I'll double check but I believe spark.history.retainedApplications now affects the max number of cached applications not the total listed, I was under the same assumption previously and had suggested that workaround in the JIRA

tgravescs · 2016-08-29T19:12:44Z

yep, I believe you are correct, and if so I think we should just update the docs for it to describe that. If you don't mind doing it along with this that would be great.

ajbozarth · 2016-08-29T19:24:44Z

@tgravescs pushed those doc updates if you want to take a look, I'm not 100% sure on my wording

SparkQA · 2016-08-29T21:22:43Z

Test build #64595 has finished for PR 14835 at commit bbb9718.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ajbozarth · 2016-08-29T21:25:11Z

retest this please

SparkQA · 2016-08-29T23:41:47Z

Test build #64601 has finished for PR 14835 at commit bbb9718.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2016-08-30T15:24:49Z

docs/monitoring.md

-      The number of application UIs to retain. If this cap is exceeded, then the oldest
-      applications will be removed.
+      The number of application UIs to retain in the cache. If this cap is exceeded, then the oldest
+      applications will be removed from the cache.


Kind of a nit here but perhaps we should change this to "The number of applications to retain in the cache". I'm wondering if UIs could still be confusing with the number displayed. And perhaps after the last sentence add something like: If an application is not in the cache, it will have to be loaded from disk if its accessed from the UI.

Thanks, that'll help me reword it a bit

ajbozarth · 2016-08-30T18:38:54Z

@tgravescs I edited the wording, but I feel that the word "UI" needs to be there to make sure they understand that it's the UI being stored not some copy of the application. I though the phrase "UI data" got this across and I choose to move the sentence around to clear any misunderstandings

SparkQA · 2016-08-30T20:45:04Z

Test build #64664 has finished for PR 14835 at commit 8da22e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2016-08-30T21:31:29Z

changes look good. thanks!

One interesting thing I found while trying this out is that the number displayed in the table can actually be greater then the maxApplication config due to some applications possibly having multiple attempts. The rest api will return the correct number of applications, it just the datatable may split them based on sorting order. Not a big deal for this, just an observation.

tgravescs · 2016-08-30T21:37:12Z

Merged into 2.1 branch. I can see this being either defect or feature. If you or others want it in 2.0.1 will need a separate pr as there were conflicts with the change due to executor page going to datatables.

ajbozarth · 2016-08-30T21:42:06Z

Thanks @tgravescs I'll quickly open a pr for 2.0.1 and ping you on it

…arge application history With the new History Server the summary page loads the application list via the the REST API, this makes it very slow to impossible to load with large (10K+) application history. This pr fixes this by adding the `spark.history.ui.maxApplications` conf to limit the number of applications the History Server displays. This is accomplished using a new optional `limit` param for the `applications` api. (Note this only applies to what the summary page displays, all the Application UI's are still accessible if the user knows the App ID and goes to the Application UI directly.) I've also added a new test for the `limit` param in `HistoryServerSuite.scala` Manual testing and dev/run-tests Author: Alex Bozarth <ajbozart@us.ibm.com> Closes apache#14835 from ajbozarth/spark17243.

…arge application history ## What changes were proposed in this pull request? back port of #14835 addressing merge conflicts With the new History Server the summary page loads the application list via the the REST API, this makes it very slow to impossible to load with large (10K+) application history. This pr fixes this by adding the `spark.history.ui.maxApplications` conf to limit the number of applications the History Server displays. This is accomplished using a new optional `limit` param for the `applications` api. (Note this only applies to what the summary page displays, all the Application UI's are still accessible if the user knows the App ID and goes to the Application UI directly.) I've also added a new test for the `limit` param in `HistoryServerSuite.scala` ## How was this patch tested? Manual testing and dev/run-tests Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #14886 from ajbozarth/spark17243-branch-2.0.

ajbozarth added 3 commits August 26, 2016 13:12

Added limit param to applications list api and maxApplications conf t…

9bcb02d

…o HistoryServer

Switched to use ConfigBuilder

fab32f9

Added tests and fixed a typos

6163f13

steveloughran reviewed Aug 26, 2016
View reviewed changes

Switched to js solution and updated docs

172dc56

tgravescs reviewed Aug 29, 2016
View reviewed changes

Fixed some wording in the docs

bbb9718

tgravescs reviewed Aug 30, 2016
View reviewed changes

ajbozarth added 2 commits August 30, 2016 11:29

Merge branch 'master' into spark17243

aa1c6f9

updated doc wording

8da22e3

asfgit closed this in f7beae6 Aug 30, 2016

ajbozarth deleted the spark17243 branch August 30, 2016 21:42

ajbozarth mentioned this pull request Aug 30, 2016

[SPARK-17243] [Web UI] Spark 2.0 History Server won't load with very large application history #14886

Closed

[SPARK-17243] [Web UI] Spark 2.0 History Server won't load with very large application history #14835

[SPARK-17243] [Web UI] Spark 2.0 History Server won't load with very large application history #14835

Uh oh!

Conversation

ajbozarth commented Aug 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ajbozarth commented Aug 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajbozarth commented Aug 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveloughran commented Aug 26, 2016

Uh oh!

ajbozarth commented Aug 26, 2016

Uh oh!

ajbozarth commented Aug 26, 2016

Uh oh!

SparkQA commented Aug 26, 2016

Uh oh!

SparkQA commented Aug 27, 2016

Uh oh!

QQshu1 commented Aug 27, 2016

Uh oh!

ajbozarth commented Aug 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgravescs commented Aug 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajbozarth commented Aug 29, 2016

Uh oh!

tgravescs commented Aug 29, 2016

Uh oh!

ajbozarth commented Aug 29, 2016

Uh oh!

SparkQA commented Aug 29, 2016

Uh oh!

ajbozarth commented Aug 29, 2016

Uh oh!

SparkQA commented Aug 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajbozarth commented Aug 30, 2016

Uh oh!

SparkQA commented Aug 30, 2016

Uh oh!

tgravescs commented Aug 30, 2016

Uh oh!

tgravescs commented Aug 30, 2016

Uh oh!

ajbozarth commented Aug 30, 2016

Uh oh!

Uh oh!

ajbozarth commented Aug 26, 2016 •

edited

Loading

ajbozarth commented Aug 26, 2016 •

edited

Loading