Improvements that significantly reduce the chances of request timeouts #7042

hubertp · 2023-06-15T19:06:26Z

Pull Request Description

Request Timeouts started plaguing IDE due to numerous executionContext/***Visualization requests. While caused by a bug they revealed a bigger problem in the Language Server when serving large amounts of requests:

Long and short lived jobs are fighting for various locks. Lock contention leads to some jobs waiting for a longer than desired leading to unexpected request timeouts. Increasing timeout value is just delaying the problem.
Requests coming from IDE are served almost instantly and handled by various commands. Commands can issue further jobs that serve request. We apparently have and always had a single-thread thread pool for serving such jobs, leading to immediate thread starvation.

Both reasons increase the chances of Request Timeouts when dealing with a large number of requests. For 2) I noticed that while we used to set the enso-runtime-server.jobParallelism option descriptor key to some machine-dependent value (most likely > 1), the value set would only be available for instrumentation. JobExecutionEngine where it is actually used would always get the default, i.e. a single-threaded ThreadPool. This means that this option descriptor was simply misused since its introduction. Moved that option to runtime options so that it can be set and retrieved during normal operation.

Adding parallelism intensified problem 1), because now we could execute multiple jobs and they would compete for resources. It also revealed a scenario for a yet another deadlock scenario, due to invalid order of lock acquisition. See ExecuteJob vs UpsertVisualisationJob order for details.

Still, a number of requests would continue to randomly timeout due to lock contention. It became apparent that
Attach/Modify/Detach-VisualisationCmd should not wait until a triggered UpsertVisualisationJob sends a response to the client; long and short lived jobs will always compete for resources and we cannot guarantee that they will not timeout that way. That is why the response is sent immediately from the command handler and not from the job executed after it.

This brings another problematic scenario:

AttachVisualisationCmd is executed, response sent to the client, UpsertVisualisationJob scheduled.
In the meantime ModifyVisualisationCmd comes and fails; command cannot find the visualization that will only be added by UpsertVisualisationJob, which might have not yet been scheduled to run.

Remedied that by checking visualisation-related jobs that are still in progress. It also allowed for cancelling jobs which results wouldn't be used anyway (ModifyVisualisationCmd sends its own UpsertVisualisationJob). This is not a theoretical scenario, it happened frequently on IDE startup.

This change does not fully solve the rather problematic setup of numerous locks, which are requested by short and long lived jobs. A better design should still be investigated. But it significantly reduces the chances of Request Timeouts which IDE had to deal with.

With this change I haven't been able to experience Request Timeouts for relatively modest projects anymore.

I added the possibility of logging wait times for locks to better investigate further problems.

Closes #7005

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

The documentation has been updated, if necessary.
Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
All code follows the
Scala,
Java,
and
Rust
style guides. In case you are using a language not listed above, follow the Rust style guide.
All code has been tested:
- Unit tests have been written where possible.
- If GUI codebase was changed, the GUI was tested when built using ./run ide build.

Request Timeouts started plaguing IDE due to numerous `executionContext/***Visualization` requests. While caused by a bug they revealed a bigger problem in the Language Server when serving large amounts of requests: 1) Long and short lived jobs are fighting for various locks. Lock contention leads to some jobs waiting for a long time until they can be executed. Increasing timeout wait is not a solution. 2) Requests coming from IDE are served almost instantly and handled by various commands. Commands can issue further jobs that serve the request. We apparently have and always had a single-thread thread pool for serving such jobs, leading to immediate thread starvation. Both reasons increase the chances of Request Timeouts. For 2) I noticed that while we used to set the `enso-runtime-server.jobParallelism` option descriptor key to some machine-dependent value (most likely > 1), the value set would only be available for instrumentation. `JobExecutionEngine` where it is actually used would always get the default, i.e. a single-threaded ThreadPool. Option descriptors were simply misused, since the option was introduced. Moved that option to runtime options so that it can be set and retrieved during normal operation. Adding parallelism intensified problem 1, because now we could execute multiple jobs and they would compete for resources. It also revealed a scenario for a yet another deadlock scenario, due to invalid order of lock acquisition. See `ExecuteJob` vs `UpsertVisualisationJob` order for details. Still, a number of requests would continue to randomly timeout due to lock contention. It became apparent that `Attach/Modify/Detach-VisualisationCmd` should not wait until a triggered `UpsertVisualisationJob` sends a response to the client; long and short lived jobs will always compete for resources and we cannot guarantee that they will not timeout that way. So a response is sent quicker from the command handler. This brings another problematic scenario: 1. `AttachVisualisationCmd` is executed, response sent to the client, `UpsertVisualisationJob` scheduled 2. In the meantime `ModifyVisualisationCmd` comes and fails because it cannot find the visualization that will only be added by `UpsertVisualisationJob` that hasn't yet been scheduled to run. Remedied that by checking visualisation-related jobs that are still in progress. It also allowed for cancelling jobs which results wouldn't be used anyway (`ModifyVisualisationCmd` sends its own `UpsertVisualisationJob`). This is not a theoretical scenario, it happened frequently due to IDE requests. This change does not fully solve the rather problematic setup of numerous locks, which are requested by short and long lived jobs. A better design should still be investigated. But it significantly reduces the chances of Request Timeouts which IDE had to deal with. I haven't been able to experience Request Timeouts for relatively modest projects anymore. I added the possibility of logging wait times for locks to better investigate further problems.

4e6

Looks good

4e6 · 2023-06-16T14:21:25Z

engine/runtime/src/main/java/org/enso/interpreter/runtime/EnsoContext.java

+    var n = getOption(RuntimeOptions.JOB_PARALLELISM_KEY);
+    var base = n == null ? 1 : n.intValue();
+    var optimal = Math.round(base * 0.5);
+    return optimal < 1 ? 1 : (int) optimal;


Yeah, I didn't really feel like doing something smarter here is necessary. We already have another thread pool for commands and akka/zio executors.

#7042) Request Timeouts started plaguing IDE due to numerous `executionContext/***Visualization` requests. While caused by a bug they revealed a bigger problem in the Language Server when serving large amounts of requests: 1) Long and short lived jobs are fighting for various locks. Lock contention leads to some jobs waiting for a longer than desired leading to unexpected request timeouts. Increasing timeout value is just delaying the problem. 2) Requests coming from IDE are served almost instantly and handled by various commands. Commands can issue further jobs that serve request. We apparently have and always had a single-thread thread pool for serving such jobs, leading to immediate thread starvation. Both reasons increase the chances of Request Timeouts when dealing with a large number of requests. For 2) I noticed that while we used to set the `enso-runtime-server.jobParallelism` option descriptor key to some machine-dependent value (most likely > 1), the value set would **only** be available for instrumentation. `JobExecutionEngine` where it is actually used would always get the default, i.e. a single-threaded ThreadPool. This means that this option descriptor was simply misused since its introduction. Moved that option to runtime options so that it can be set and retrieved during normal operation. Adding parallelism intensified problem 1), because now we could execute multiple jobs and they would compete for resources. It also revealed a scenario for a yet another deadlock scenario, due to invalid order of lock acquisition. See `ExecuteJob` vs `UpsertVisualisationJob` order for details. Still, a number of requests would continue to randomly timeout due to lock contention. It became apparent that `Attach/Modify/Detach-VisualisationCmd` should not wait until a triggered `UpsertVisualisationJob` sends a response to the client; long and short lived jobs will always compete for resources and we cannot guarantee that they will not timeout that way. That is why the response is sent immediately from the command handler and not from the job executed after it. This brings another problematic scenario: 1. `AttachVisualisationCmd` is executed, response sent to the client, `UpsertVisualisationJob` scheduled. 2. In the meantime `ModifyVisualisationCmd` comes and fails; command cannot find the visualization that will only be added by `UpsertVisualisationJob`, which might have not yet been scheduled to run. Remedied that by checking visualisation-related jobs that are still in progress. It also allowed for cancelling jobs which results wouldn't be used anyway (`ModifyVisualisationCmd` sends its own `UpsertVisualisationJob`). This is not a theoretical scenario, it happened frequently on IDE startup. This change does not fully solve the rather problematic setup of numerous locks, which are requested by short and long lived jobs. A better design should still be investigated. But it significantly reduces the chances of Request Timeouts which IDE had to deal with. With this change I haven't been able to experience Request Timeouts for relatively modest projects anymore. I added the possibility of logging wait times for locks to better investigate further problems. Closes #7005

Evaluating visualization expression may trigger a full compilation. A change in #7042 went a bit too far and led to a situation when there could be compilations running at the same time leading to a rather obscure `RedefinedMethodException` when the compilation on one thread already finished. This will make the logic correct again at the price of potentially slowing the processing of visualization.

Evaluating visualization expression may trigger a full compilation. A change in #7042 went a bit too far and led to a situation when there could be compilations running at the same time leading to a rather obscure `RedefinedMethodException` when the compilation on one thread already finished. This will make the logic correct again at the price of potentially slowing the processing of visualization. Closes #8296. # Important Notes Should make visualizations a bit more stable as well.

hubertp added 3 commits June 16, 2023 10:47

update documentation

7506578

fmt after upgrade

403824b

hubertp force-pushed the wip/hubert/7005-mitigate-request-timeouts branch from 5c4f646 to 403824b Compare June 16, 2023 09:01

hubertp added 2 commits June 16, 2023 11:45

More logs

2b304ae

Changelog

1a6c104

hubertp marked this pull request as ready for review June 16, 2023 09:53

hubertp requested review from 4e6, JaroslavTulach and Akirathan as code owners June 16, 2023 09:53

nits

03e6f73

4e6 approved these changes Jun 16, 2023

View reviewed changes

hubertp added the CI: Ready to merge This PR is eligible for automatic merge label Jun 16, 2023

mergify bot merged commit e529f7b into develop Jun 16, 2023

mergify bot deleted the wip/hubert/7005-mitigate-request-timeouts branch June 16, 2023 17:57

hubertp mentioned this pull request Nov 24, 2023

RedefinedMethodException when evaluating visualization argument #8296

Closed

hubertp mentioned this pull request Nov 27, 2023

Ensure compilation is run with a compilation lock #8395

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements that significantly reduce the chances of request timeouts #7042

Improvements that significantly reduce the chances of request timeouts #7042

hubertp commented Jun 15, 2023 •

edited

Loading

4e6 left a comment

4e6 Jun 16, 2023

hubertp Jun 16, 2023

Improvements that significantly reduce the chances of request timeouts #7042

Improvements that significantly reduce the chances of request timeouts #7042

Conversation

hubertp commented Jun 15, 2023 • edited Loading

Pull Request Description

Checklist

4e6 left a comment

Choose a reason for hiding this comment

4e6 Jun 16, 2023

Choose a reason for hiding this comment

hubertp Jun 16, 2023

Choose a reason for hiding this comment

hubertp commented Jun 15, 2023 •

edited

Loading