Run list perf optimizations #149

yebrahim · 2018-11-08T19:09:03Z

This is a quick analysis of the experiment details page performance. Most of the time spent is because we have to make multiple consecutive requests to load all the information we need. We currently do this:

Call getExperiment API to get the details (name, description.. etc).
Call listJobs API to get all recurring jobs in this experiment.
Call listRuns API to show the first page of runs in this experiment.
For each run (in parallel), call its getRun API to get its details (name, status, duration... etc).
For each run (in parallel), call getPipeline on its pipeline ID, in order to show the pipeline name.
For each run (in parallel), call getExperiment on its experiment ID, if any, to show the experiment name. This is not needed when listing runs of a given experiment, but it's technical debt we accumulated, since we're using the same component to list runs everywhere.

Some low-hanging perf improvements can be obtained by doing the following:

There is no need to do the first three steps in sequence, they're not interdependent.
There is no need to do the second three steps in sequence, they're all using the same metadata fields.
We can render the list of runs as we get them, before we get the details of each run and its pipeline. This will show their names and statuses, but not their run time or their pipeline name. Subsequent requests then re-render to fill out these fields.

The text was updated successfully, but these errors were encountered:

vicaire · 2018-11-08T23:45:59Z

@yebrahim, the bottleneck is most likely the number of simultaneous queries made to the API server and the underlying DB.

I suggest the following improvements:

Implement a backend API that that returns all the needed job data (first page) with a single DB query. Implement a backend API that returns all the needed run data (first page) with a single DB query. We could call these APIs ListJobDetails and ListRunDetails.
Increase the number of threads that can handle request concurrently in the Backend. (Depending on what it currently is).
Increase the number of backend servers that can handle requests. (We probably don't want to rely on this though, as running additional servers is costly for the user.)

yebrahim · 2018-11-09T01:02:44Z

I can't get request service time from the API server, @IronPan can maybe comment on this.
I think you're underestimating network latency though. Compared to how long it takes the request to reach the API server through the k8s network, the time it takes the API server to serve the requests, even hundreds of parallel requests, should be negligible.

In any case, here's a waterfall of requests vs render time:

This maps nicely to the 6 items I listed above.

The average request time shown here is about 40ms, and I think all of that is network latency. I might be wrong here of course, there's no way for me to tell. Some profiling on the API server is needed.

yebrahim · 2018-11-09T01:09:51Z

One more thing: for each request, we actually make two network hops instead of one, since we use the frontend webserver as a proxy.
There are two reasons behind this design:

Our backend doesn't allow CORS.
The webserver handles more work than just being a proxy: make k8s master API calls on behalf of the client, read artifacts from GCS and minio, report health and build info on both backend and frontend, and house Tensorboard proxy.

vicaire · 2018-11-09T01:13:51Z

Does the frontend reuse the same connection pool to make requests? Or does it start a new connection for each request? If the later, that could be the cause of the latency.

yebrahim · 2018-11-09T01:43:49Z

What's a connection pool?

vicaire · 2018-11-09T03:38:47Z

Here is a link:

https://en.wikipedia.org/wiki/Connection_pool

The link discusses database connections but it also applies to service connections.

To summarize, the UI should only establish the connection to the service a few times and keep reusing the same connections (that's the pool of connections). Re-creating new connections every time would lead to a significant increase in latency.

IronPan · 2019-01-31T21:33:45Z

Maybe we should consider using tools such as https://github.com/prometheus/client_golang to instrumenting backend perf

yebrahim · 2019-02-08T20:07:14Z

We're bound by network latency here. Some more numbers:

Endpoint	from API server pod	from client (browser/curl out side cluster)
List runs	20	170
List pipelines	5	135
List experiments	4	135
Get run	11	155
Get pipeline	5	135
Get experiment	3	132

Numbers are all in milliseconds, and gathered using curl both inside the API server's pod and outside the cluster.

I can see there's potential for improving the list run's server endpoint performance, but this is still negligible compared to the observed difference between within and without the pod, which I can only think is attributed to network latency.

I'm not sure there's a way to improve the latency here, I'll look into this next, but it seems to me like the low hanging fruit would be not to block on all requests before rendering the UI. Perhaps it's enough in the list runs call for example to show the data as it comes in, rather than waiting on gathering all bits of data.

vicaire · 2019-02-11T23:50:33Z

yebrahim@,

The latency looks OK for a network call (~100ms).

Rather than spending time on rendering the UI progressively, would it be possible to reduce the number of queries that are made to the backend?

This could be done by adding some optimized, paginated queries to the backend.

yebrahim · 2019-02-15T00:37:28Z

I'd like to look a little more into the network latency before dismissing it just to make sure we're not missing anything, but regardless, progressive rendering is a low hanging fruit, it should be less work than changing/add APIs and using them.

However, API change seems like the best resolution to the problem. Ideally, we'd switch to something like GraphQL, but barring that, we can modify the list response body to include references to other objects (experiments and pipelines namely). This should improve the list performance by a 3x factor.

vicaire · 2019-02-15T07:12:24Z

Agree that an API change is the best solution. We can make returning the extra data optional.

Switching to GraphQL would be great. If you would take this on (starting with a design proposal), that would be a fantastic contribution.

I am not clear about the 3x improvement. When does the current performance start to degrade? Depending on the number of rows to render, how many queries are made today? How many queries would be made after your change? Is it possible to change the number of queries to just a couple no matter the number of rows?

yebrahim · 2019-02-16T06:47:35Z

The 3x figure is actually a conservative estimate, based on the analysis in this comment. We have enough data to render by the time we list runs for a given experiment (or all runs), but we still wait until we get the experiment's jobs, then each run's pipelines, then each run's experiment.

The number of queries doesn't change, it's just that they'd be dispatched in parallel with the UI rendering the data incrementally as it's available.

vicaire · 2019-02-16T15:25:25Z

Did you check how the performance scales? What if there are 10 jobs or runs in the DB? 100? 1000? 10,000? 100,000? 1,000,000? What if the UI is rendering 10, 100, 200?

yebrahim · 2019-02-20T01:18:10Z

Thanks for the suggestion, that did surface another issue, which can be seen in this screenshot:

These requests were all started in parallel (they all started within ~50ms of each other), but Chrome has a maximum number of concurrent connections of 6 to any given domain (see here and it seems like most modern browsers do this. The HTTP/1 spec suggests two (see section 8.1.4 here). So the requests are queued up in batches of 6.

FYI: HTTP/2 has multiplexed requests that can use the same connection.

Changing the API seems like our best bet, to either GraphQL, or to just support getting details of multiple objects in the same request.

derekhh · 2019-04-03T19:18:47Z

This might be unrelated but the pipeline dashboard UI on our GKE cluster was really slow yesterday (we had probably around 100 runs) and I cleaned up the pods used by previous workflows following this comment: #844 (comment)

Then the speed is back to normal again.

rileyjbauer · 2019-07-23T21:55:41Z

Closing for now given the improvements to list queries

yebrahim added area/frontend improvement/optimization labels Nov 8, 2018

yebrahim mentioned this issue Nov 15, 2018

Perf issue - All runs page loads very slowly with >100 runs #259

Closed

yebrahim mentioned this issue Nov 28, 2018

Add loading spinner to custom table while loading items #405

Merged

yebrahim mentioned this issue Dec 14, 2018

"Information in the Argo UI appears much faster compared to the KF Pipelines UI" #544

Closed

vicaire added priority/p0 kind/bug and removed improvement/optimization labels Mar 23, 2019

vicaire assigned rileyjbauer Mar 26, 2019

vicaire mentioned this issue Apr 11, 2019

Retrieve the experiment during list run #1084

Closed

rileyjbauer mentioned this issue Apr 11, 2019

Removes unnecessary API calls #1144

Merged

rileyjbauer added priority/p1 and removed priority/p0 labels Jun 3, 2019

rileyjbauer closed this as completed Jul 23, 2019

Linchin pushed a commit to Linchin/pipelines that referenced this issue Apr 11, 2023

Add exclusion for dko vendor dir (kubeflow#149)

f0aaa5e

magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this issue Oct 22, 2023

Specify kfserving-related WebhookConfigName (kubeflow#149)

b56c50e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run list perf optimizations #149

Run list perf optimizations #149

yebrahim commented Nov 8, 2018

vicaire commented Nov 8, 2018

yebrahim commented Nov 9, 2018

yebrahim commented Nov 9, 2018

vicaire commented Nov 9, 2018

yebrahim commented Nov 9, 2018

vicaire commented Nov 9, 2018

IronPan commented Jan 31, 2019

yebrahim commented Feb 8, 2019

vicaire commented Feb 11, 2019

yebrahim commented Feb 15, 2019

vicaire commented Feb 15, 2019

yebrahim commented Feb 16, 2019

vicaire commented Feb 16, 2019

yebrahim commented Feb 20, 2019

derekhh commented Apr 3, 2019

rileyjbauer commented Jul 23, 2019

Run list perf optimizations #149

Run list perf optimizations #149

Comments

yebrahim commented Nov 8, 2018

vicaire commented Nov 8, 2018

yebrahim commented Nov 9, 2018

yebrahim commented Nov 9, 2018

vicaire commented Nov 9, 2018

yebrahim commented Nov 9, 2018

vicaire commented Nov 9, 2018

IronPan commented Jan 31, 2019

yebrahim commented Feb 8, 2019

vicaire commented Feb 11, 2019

yebrahim commented Feb 15, 2019

vicaire commented Feb 15, 2019

yebrahim commented Feb 16, 2019

vicaire commented Feb 16, 2019

yebrahim commented Feb 20, 2019

derekhh commented Apr 3, 2019

rileyjbauer commented Jul 23, 2019