-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run list perf optimizations #149
Comments
@yebrahim, the bottleneck is most likely the number of simultaneous queries made to the API server and the underlying DB. I suggest the following improvements:
|
I can't get request service time from the API server, @IronPan can maybe comment on this. In any case, here's a waterfall of requests vs render time:
The average request time shown here is about 40ms, and I think all of that is network latency. I might be wrong here of course, there's no way for me to tell. Some profiling on the API server is needed. |
One more thing: for each request, we actually make two network hops instead of one, since we use the frontend webserver as a proxy.
|
Does the frontend reuse the same connection pool to make requests? Or does it start a new connection for each request? If the later, that could be the cause of the latency. |
What's a connection pool? |
Here is a link: https://en.wikipedia.org/wiki/Connection_pool The link discusses database connections but it also applies to service connections. To summarize, the UI should only establish the connection to the service a few times and keep reusing the same connections (that's the pool of connections). Re-creating new connections every time would lead to a significant increase in latency. |
Maybe we should consider using tools such as https://github.com/prometheus/client_golang to instrumenting backend perf |
We're bound by network latency here. Some more numbers:
Numbers are all in milliseconds, and gathered using curl both inside the API server's pod and outside the cluster. I can see there's potential for improving the list run's server endpoint performance, but this is still negligible compared to the observed difference between within and without the pod, which I can only think is attributed to network latency. I'm not sure there's a way to improve the latency here, I'll look into this next, but it seems to me like the low hanging fruit would be not to block on all requests before rendering the UI. Perhaps it's enough in the list runs call for example to show the data as it comes in, rather than waiting on gathering all bits of data. |
yebrahim@, The latency looks OK for a network call (~100ms). Rather than spending time on rendering the UI progressively, would it be possible to reduce the number of queries that are made to the backend? This could be done by adding some optimized, paginated queries to the backend. |
I'd like to look a little more into the network latency before dismissing it just to make sure we're not missing anything, but regardless, progressive rendering is a low hanging fruit, it should be less work than changing/add APIs and using them. However, API change seems like the best resolution to the problem. Ideally, we'd switch to something like GraphQL, but barring that, we can modify the list response body to include references to other objects (experiments and pipelines namely). This should improve the list performance by a 3x factor. |
Agree that an API change is the best solution. We can make returning the extra data optional. Switching to GraphQL would be great. If you would take this on (starting with a design proposal), that would be a fantastic contribution. I am not clear about the 3x improvement. When does the current performance start to degrade? Depending on the number of rows to render, how many queries are made today? How many queries would be made after your change? Is it possible to change the number of queries to just a couple no matter the number of rows? |
The 3x figure is actually a conservative estimate, based on the analysis in this comment. We have enough data to render by the time we list runs for a given experiment (or all runs), but we still wait until we get the experiment's jobs, then each run's pipelines, then each run's experiment. The number of queries doesn't change, it's just that they'd be dispatched in parallel with the UI rendering the data incrementally as it's available. |
Did you check how the performance scales? What if there are 10 jobs or runs in the DB? 100? 1000? 10,000? 100,000? 1,000,000? What if the UI is rendering 10, 100, 200? |
Thanks for the suggestion, that did surface another issue, which can be seen in this screenshot: These requests were all started in parallel (they all started within ~50ms of each other), but Chrome has a maximum number of concurrent connections of 6 to any given domain (see here and it seems like most modern browsers do this. The HTTP/1 spec suggests two (see section 8.1.4 here). So the requests are queued up in batches of 6. FYI: HTTP/2 has multiplexed requests that can use the same connection. Changing the API seems like our best bet, to either GraphQL, or to just support getting details of multiple objects in the same request. |
This might be unrelated but the pipeline dashboard UI on our GKE cluster was really slow yesterday (we had probably around 100 runs) and I cleaned up the pods used by previous workflows following this comment: #844 (comment) Then the speed is back to normal again. |
Closing for now given the improvements to list queries |
This is a quick analysis of the experiment details page performance. Most of the time spent is because we have to make multiple consecutive requests to load all the information we need. We currently do this:
getExperiment
API to get the details (name, description.. etc).listJobs
API to get all recurring jobs in this experiment.listRuns
API to show the first page of runs in this experiment.getRun
API to get its details (name, status, duration... etc).getPipeline
on its pipeline ID, in order to show the pipeline name.getExperiment
on its experiment ID, if any, to show the experiment name. This is not needed when listing runs of a given experiment, but it's technical debt we accumulated, since we're using the same component to list runs everywhere.
Some low-hanging perf improvements can be obtained by doing the following:
The text was updated successfully, but these errors were encountered: