[Proposal] Adding control-queue monitor with dedicated orchestrator to each control-queue. #1058

pasaini-microsoft · 2024-04-03T17:05:05Z

Motivation:

Issue: Unpredictable time to detect (TTD) for any orchestration being stuck in control-queue.

We have been facing issues where DTF orchestration used to get stuck at random. Given that customer load is not very regular in our service, it was challenging to understand upfront if the orchestration would be processed or will be stuck.
More often customers used to reach out with incidents complaining their request not completing for long time.
Motivation: motivation was to reduce the TTD for finding if orchestration can be stuck/waiting-forever in a control-queue irrespective of the cause.
This is where heartbeat orchestration comes into picture, which evaluates the control-queue siting outside the system (let's say system is completely down and cannot emit any metric or emitting false/partial metrics).
We built this system locally in our service, and it has been detecting the issue with 100% accuracy. We build monitors around it to raise ICM and mitigate the situation manually using TSGs.

Issue: Manual efforts in identifying the impacted worker and mitigation steps.

Post having heartbeat orchestrations, next issue was it used to take manual steps and active engineering cycles to identify the impacted worker and trigger mitigation.
Motivation: motivation was to make detection of impacted worker and perform mitigation steps automatically.
This is where monitor for these heartbeat orchestration and owner of control-queue arrives.
We built this system locally in our service, and it has been identifying the impacted worker predictably and auto-mitigates the issue.

With both these systems in place, we are able to run DTF with no orchestrations stuck, without any manual steps for detection and mitigation. To bring this capability to all users of DTF users, proposing below change.

Proposals:

Adding control-queue monitor with dedicated orchestrator to each control-queue.

This change consists of 2 main portions:

Addition of orchestration and its instances exactly one for each control-queue.

This orchestrator (aka ControlQueueHeartbeatTaskOrchestrator) is very simple and quick to run.
It just waits for some interval (configurable) and logs a heartbeat message and continue itself as new.
It validates if the context is correct, just in case partition count changes (it assumes partition count doesn't change, otherwise it may result in closing off the orchestrators).
It provides a way for user to provide a delegate (callback) to run with each heartbeat.
This one gets the partition-count information and creates instance-ids such that one is allotted to each control-queue.

Addition of a monitoring of these orchestrations from last processed time.

Now, with the instance id for each control-queue, it checks for last updated time of orchestration instance and uses it measuring against threshold.
If orchestration found stuck, appropriate message is logged.
This one does a bit more to find point in time owners of each control-queue, to pin the impacted taskhubworker owning the control-queue.
It also provides a way for user to provide delegate (callback) to run when it detect any anomaly, like stuck orchestration, fetching owner fails, fetching orchestration instance fails, or any of these times out).

…rol-queue.

pasaini-microsoft · 2024-04-05T20:14:52Z

src/DurableTask.AzureStorage/ControlQueueHeartbeat/FileWriter.cs

+    /// <summary>
+    /// 
+    /// </summary>
+    public class FileWriter


This will be removed and structed logging will be added.

davidmrdavid · 2024-04-08T17:55:09Z

Hi @pasaini-microsoft: Thanks for opening this PR. As we discussed internally, I'll loop in the other DTFx/DF maintainers for discussion regarding the scope of the PR, as well as the design. I'm personally most excited about the ability to create an instanceID targetting a specific partition, so at the very least I'd like to get that in.

It would help greatly if you could update the PR description to contain not just a list of the changes, but also a small motivation and background behind this change. For example, you can explain that you're using this utility in your own app to help you detect stuck orchestrations, and so on. Thanks!

davidmrdavid

Finished a partial pass. Please see my comment here for my thoughts on how I'd like to move this forward: #1058 (review)

davidmrdavid · 2024-04-08T17:57:01Z

src/DurableTask.AzureStorage/AzureStorageOrchestrationServiceSettings.cs

+        /// Time interval between control queue heartbeat orchestration.
+        /// </summary>
+        public TimeSpan ControlQueueHearbeatOrchestrationInterval { get; set; } = DefaultControlQueueHearbeatOrchestrationInterval;
+
+        /// <summary>
+        /// Time interval between control queue heartbeat orchestration.
+        /// </summary>
+        public TimeSpan ControlQueueOrchHeartbeatDetectionInterval { get; set; } = DefaultControlQueueOrchHeartbeatDetectionInterval;
+
+        /// <summary>
+        /// Time interval between control queue heartbeat orchestration.
+        /// </summary>
+        public TimeSpan ControlQueueOrchHeartbeatDetectionThreshold { get; set; } = DefaultControlQueueOrchHeartbeatDetectionThreshold;
+
        /// <summary>


why do these all have the same description?

davidmrdavid · 2024-04-08T17:58:55Z

src/DurableTask.AzureStorage/ControlQueueHeartbeat/ControlQueueHeartbeatDetectionInfo.cs

+        /// <summary>
+        /// The orchestration instance for control-queue is stuck.
+        /// </summary>
+        OrchestrationInstanceStuck,


nit - a note for myself. I'd rather not include the word "stuck" in these states, because a heartbeat cannot definitely conclude that the orchestration is stuck: the problem could be something else. I'm not sure what the right term should be though, so I'll follow up on this thread later with some suggestions.

davidmrdavid · 2024-04-08T18:00:29Z

src/DurableTask.AzureStorage/ControlQueueHeartbeat/ControlQueueHeartbeatTaskResult.cs

+        InvalidInput,
+        InputContextMismatch


Can you please tell me a bit more about how would enters these states?

added comments.

usually it happens if someone updates the partition-count.

davidmrdavid · 2024-04-08T18:15:17Z

src/DurableTask.AzureStorage/ControlQueueHeartbeat/ControlQueueHeartbeatTaskOrchestrator.cs

+            // Ensuring this section doesn't run again. 
+            // This queues the user provided callback without waiting for it to finish.
+            // This is to keep heartbeat orchestrator thin and fast.
+            if (!context.IsReplaying)
+            {
+                // No wait to complete provided delegate. The current orchestrator need to be very thin and quick to run. 
+                bool isQueued = ThreadPool.QueueUserWorkItem(async (_) =>
+                {
+                    await this.callBack(context.OrchestrationInstance, controlQueueHeartbeatTaskContextInput, controlQueueHeartbeatTaskContextInit, this.cancellationTokenSrc.Token);
+                });
+            }


I realize this was on @cgillum's suggestion, but I still think it's worth calling out in a comment where why we're not following the DF orchestrator coding constraints of not having threading in orchestrators.

davidmrdavid · 2024-04-08T18:21:58Z

src/DurableTask.AzureStorage/ControlQueueHeartbeat/ControlQueueHeartbeatTaskOrchestrator.cs

+    /// <summary>
+    /// Control-queue heartbeat orchestrator.
+    /// This is supposed to be initialized with ControlQueueHeartbeatTaskContext informing orchestrator about configuration of taskhubworker and heartbeat interval.
+    /// </summary>
+    internal class ControlQueueHeartbeatTaskOrchestrator : TaskOrchestration<ControlQueueHeartbeatTaskResult, ControlQueueHeartbeatTaskInputContext>
+    {
+        public const string OrchestrationName = "ControlQueueHeartbeatTaskOrchestrator";
+


Given that this orchestrator is rather simple to write, my intuition is that customers wouldn't mind writing these themselves so long as they have the right utilities to let them schedule it every partition.

With that in mind, my recommendation would be to make this orchestrator part of our samples, and to instead prioritize merging the work done to help users schedule an orchestrator in a given partitionID. I think this latter utility function (create an instanceID that is guaranteed to land on a given partition ID) has value beyond this queue-monitoring scenario, so it's easier to justify building that into the framework. Ideally, it would also be less code to maintain as well :-)

The reason for keeping it with the DurableTask.AzureStorage was:

Not giving user the choice of choosing their orchestrators as they might have gaps like when to continue as new vs complete.

The instance-ids are dedicated in such a way it helps reverse engineer the control-queue and instance id to identify which control-queue is stuck.

The owner-id is information is not publicly available which is required for detection.

Though it is possible to allow orchestration of arbitrary type or as just samples, but it will increase the possibility of false positives if user creates faulty orchestrators.

I agree with @davidmrdavid here. Focusing this work to just a feature of manual control of what partition an orchestration goes to is something I am open to accepting.

As for the heartbeat orchestration, I'd have to think more on if we even want it as a sample. Even in samples we could have issues opened on us regarding it. Is that something we want to take ownership of? Or could you host the sample yourself on github? Or even package it up into your own nuget?

jviau · 2024-04-08T19:56:00Z

I love the idea of monitoring partition queue processing, but I don't think via an orchestration is the right way to do this. This is a health check after all, and there is a health check ecosystem in .NET. We should evaluate leveraging that. Metrics (#785) may also be a good avenue for this feature.

https://www.nuget.org/packages/Microsoft.Extensions.Diagnostics.HealthChecks/9.0.0-preview.2.24128.4

pasaini-microsoft · 2024-04-09T05:09:30Z

I love the idea of monitoring partition queue processing, but I don't think via an orchestration is the right way to do this. This is a health check after all, and there is a health check ecosystem in .NET. We should evaluate leveraging that. Metrics (#785) may also be a good avenue for this feature.

https://www.nuget.org/packages/Microsoft.Extensions.Diagnostics.HealthChecks/9.0.0-preview.2.24128.4

Thanks jviau. The reason for using orchestration was to keep the detection as independent from control-queue processing system as possible. This helps detect if control-queue is stuck for any reason including if taskhubworker is just absent or some control-queue just missed to be owned by any worker.
This basically helps avoid false negatives.

pasaini-microsoft added 8 commits April 3, 2024 09:57

Adding control-queue monitor with dedicated orchestrator to each cont…

74d465d

…rol-queue.

Adding changes to avoid failures due to intentional task cancelation.

04560d5

Removing additional cancellation token.

be3818b

Adding UTs and updating configuration for sample.

74aa625

Adding comments for possible structured logging.

b6b7b07

minor updates.

9d8754d

Making all callbacks to QueueUserWorkItem and rename of orchestrator.

bf08248

Adding ControlQueueHeartbeatTaskResult enum.

d5698af

pasaini-microsoft commented Apr 5, 2024

View reviewed changes

davidmrdavid changed the title ~~Adding control-queue monitor with dedicated orchestrator to each control-queue.~~ [Proposal] Adding control-queue monitor with dedicated orchestrator to each control-queue. Apr 8, 2024

davidmrdavid added dt.azurestorage DurableTask.AzureStorage enhancement labels Apr 8, 2024

davidmrdavid reviewed Apr 8, 2024

View reviewed changes

Addressing PR comments.

365c748

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Adding control-queue monitor with dedicated orchestrator to each control-queue. #1058

[Proposal] Adding control-queue monitor with dedicated orchestrator to each control-queue. #1058

pasaini-microsoft commented Apr 3, 2024 •

edited

Loading

pasaini-microsoft Apr 5, 2024

davidmrdavid commented Apr 8, 2024

davidmrdavid left a comment •

edited

Loading

davidmrdavid Apr 8, 2024

pasaini-microsoft Apr 9, 2024

davidmrdavid Apr 8, 2024

davidmrdavid Apr 8, 2024

pasaini-microsoft Apr 9, 2024

pasaini-microsoft Apr 9, 2024

davidmrdavid Apr 8, 2024

pasaini-microsoft Apr 9, 2024

davidmrdavid Apr 8, 2024 •

edited

Loading

pasaini-microsoft Apr 9, 2024

jviau Apr 15, 2024

jviau commented Apr 8, 2024

pasaini-microsoft commented Apr 9, 2024

[Proposal] Adding control-queue monitor with dedicated orchestrator to each control-queue. #1058

Are you sure you want to change the base?

[Proposal] Adding control-queue monitor with dedicated orchestrator to each control-queue. #1058

Conversation

pasaini-microsoft commented Apr 3, 2024 • edited Loading

Motivation:

Proposals:

Choose a reason for hiding this comment

davidmrdavid commented Apr 8, 2024

davidmrdavid left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidmrdavid Apr 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jviau commented Apr 8, 2024

pasaini-microsoft commented Apr 9, 2024

pasaini-microsoft commented Apr 3, 2024 •

edited

Loading

davidmrdavid left a comment •

edited

Loading

davidmrdavid Apr 8, 2024 •

edited

Loading