Akka.Persistence HealthChecks #7842

Aaronontheweb · 2025-09-24T21:43:24Z

Changes

Implements #7840

Checklist

For significant changes, please ensure that the following have been completed (delete if not relevant):

This change follows the Akka.NET API Compatibility Guidelines.
This change follows the Akka.NET Wire Compatibility Guidelines.
I have reviewed my own pull request.
Design discussion issue Akka.Persistence HealthCheck API #7840
Changes in public API reviewed, if any.
I have added website documentation for this feature.

Latest `dev` Benchmarks

Include data from the relevant benchmark prior to this change here.

This PR's Benchmarks

Include data from after this change here.

akkadotnet#7840

this is mostly a sanity test. I don't want to get sucked into testing the `CircuitBreaker` necessarily either

Aaronontheweb

Detailed my changes, so far. Looking for feedback before I move on with implementing the SnapshotStore checks.

src/core/Akka.Persistence.Tests/JournalHealthCheckSpec.cs

Aaronontheweb · 2025-09-24T21:50:51Z

src/core/Akka.Persistence/Journal/AsyncWriteJournal.cs

+        /// <summary>
+        /// Set to <c>true</c> when a fatal error has occurred, i.e., the Akka.Persistence configuration is illegal
+        /// </summary>
+        private bool _hasFatalError;


If we have fatal errors, such as configuration errors, in the CTOR we should try to capture those.

Would it be worth it to have an enum to distinguish the types of known errors?

Thinking with a devops hat, a fatal error due to config vs a fatal error due to a condition that forces a journal shutdown/restart (Is that a thing or am I misremembering? 😇) would be useful to track... but maybe I'm forgetting things here (probably?).

As an additional thought... IDK if it's a requirement for all plugins buuuut a simple check like getting the max ordering ID would be nice.

I could just capture the Exception itself in a nullable field - that's probably easiest.

There's only a handful of "fatal" exceptions - we treat the rest of the exceptions as "eventually recoverable"

As an additional thought... IDK if it's a requirement for all plugins buuuut a simple check like getting the max ordering ID would be nice.

that might be a stretch just given some of the limitations across plugins

src/core/Akka.Persistence/Journal/AsyncWriteJournal.cs

Aaronontheweb · 2025-09-24T21:57:59Z

src/core/Akka.Persistence/Journal/AsyncWriteJournal.cs

                case DeleteMessagesTo deleteMessagesTo:
                    HandleDeleteMessagesTo(deleteMessagesTo);
                    return true;
+                case CheckHealth checkHealth:


Messaging handler for the health check inside the journal.

Aaronontheweb · 2025-09-24T21:58:13Z

src/core/Akka.Persistence/Journal/AsyncWriteJournal.cs

        {
            var i = 0;
-            var enumerator = results?.GetEnumerator();
+            using var enumerator = results?.GetEnumerator();


Fixed a memory leak here.

src/core/Akka.Persistence/Snapshot/SnapshotStore.cs

Aaronontheweb · 2025-09-24T21:59:57Z

src/core/Akka.Persistence/JournalProtocol.cs

+    {
+        public CheckHealth(CancellationToken cancellationToken)
+        {
+            CancellationToken = cancellationToken;


Health check messaging protocol - these messages are all tagged with INoSerializationVerificationNeeded so it's safe to pass the CancellationToken around via them.

Aaronontheweb · 2025-09-24T22:00:22Z

src/core/Akka.Persistence/Persistence.cs

+        /// <param name="journalPluginId">The HOCON id of the Akka.Persistence plugin./</param>
+        /// <param name="cancellationToken">An optional cancellation token.</param>
+        /// <returns>A <see cref="PersistenceHealthCheckResult"/> with health status and possibly a descriptive message.</returns>
+        public async Task<PersistenceHealthCheckResult> CheckJournalHealthAsync(string journalPluginId,


Convenience method for invoking the health check - if the journalPluginId is not found this method will err.

to11mtm

Left some thoughts. Might poke at it more but these were the hard hitters.

to11mtm · 2025-09-24T22:35:39Z

src/core/Akka.Persistence/Journal/AsyncWriteJournal.cs

+        {
+            if(_breaker.IsHalfOpen)
+                return Task.FromResult(new PersistenceHealthCheckResult(PersistenceHealthStatus.Degraded, 
+                    $"Circuit breaker is half-open, some operations may be failing intermittently with error: {_breaker.LastCaughtException?.Message ?? "N/A"}"));


Two thoughts looking at this;

First, IMO It would be nice if we had some way to track 'last success' (And maybe even have auto-polling as a config option to track)

Second is a question, could we have a more 'structured' output instead of just a formatted string? It at least makes it easier to handle such events from a parsing standpoint (esp if we are able to add time or possibly other things, need to keep looking at this to know what's up)

Second is a question, could we have a more 'structured' output instead of just a formatted string? It at least makes it easier to handle such events from a parsing standpoint (esp if we are able to add time or possibly other things, need to keep looking at this to know what's up)

As long as it can fit into a HealthCheckResult:

public HealthCheckResult(HealthStatus status, string? description = null, Exception? exception = null, IReadOnlyDictionary<string, object>? data = null) { Status = status; Description = description; Exception = exception; Data = data ?? _emptyReadOnlyDictionary; }

Then that should be fine.

to11mtm · 2025-09-24T22:42:45Z

src/core/Akka.Persistence/Journal/AsyncWriteJournal.cs

        }

-        private void ProcessResults(IImmutableList<Exception> results, int atomicWriteCount, WriteMessages writeMessage, IActorRef resequencer,
+        private static void ProcessResults(IImmutableList<Exception> results, int atomicWriteCount, WriteMessages writeMessage, IActorRef resequencer,


Soooo this is safe and I get why it may have happened, I would however suggest double-checking this against Persistence SQL (or another plugin of choice that uses AsyncJournal with some extra logic behind it) just in case the jumps bt methodtables cause an issue.

(To be clear I'm probably overthinking this BUUUUUT It would still be good to know the difference just in case 😇🤷‍♂️)

This method was always private so it can't have side-effects in other plugins.

oh wow I'm dumb I meant to say the benchmarks.

Main side effect I'm thinking of is code locality.

But, again, probably overthinking it.

to11mtm · 2025-09-24T22:52:06Z

src/core/Akka.Persistence/Journal/AsyncWriteJournal.cs

+        /// <summary>
+        /// Set to <c>true</c> when a fatal error has occurred, i.e., the Akka.Persistence configuration is illegal
+        /// </summary>
+        private bool _hasFatalError;


Would it be worth it to have an enum to distinguish the types of known errors?

Thinking with a devops hat, a fatal error due to config vs a fatal error due to a condition that forces a journal shutdown/restart (Is that a thing or am I misremembering? 😇) would be useful to track... but maybe I'm forgetting things here (probably?).

As an additional thought... IDK if it's a requirement for all plugins buuuut a simple check like getting the max ordering ID would be nice.

to11mtm · 2025-09-24T22:57:19Z

src/core/Akka.Persistence/Journal/AsyncWriteJournal.cs

+            if(_breaker.IsOpen)
+                return Task.FromResult(new PersistenceHealthCheckResult(PersistenceHealthStatus.Degraded, 
+                    $"Circuit breaker is open, some operations may be failing intermittently with error: with error: {_breaker.LastCaughtException?.Message ?? "N/A"}"));
+            return Task.FromResult(_hasFatalError ? new PersistenceHealthCheckResult(PersistenceHealthStatus.Unhealthy, "Fatal error has occurred. The ActorSystem must be restarted.") 


nitpick: Formatting here makes this a PITA to read, probably best to split the ternary into more lines.

That's how I originally had it (the way you suggested) until Rider nitpicked me. Never doubt your vibe I guess 🤷

src/core/Akka.Persistence/Journal/AsyncWriteJournal.cs

src/core/Akka.Persistence.Tests/JournalHealthCheckSpec.cs

Aaronontheweb

Detailed my changes

Aaronontheweb · 2025-09-30T18:01:23Z

src/core/Akka.API.Tests/verify/CoreAPISpec.ApprovePersistence.DotNet.verified.txt

These are all of the new public message and enum types we added to support health checks.

Aaronontheweb · 2025-09-30T18:01:52Z

src/core/Akka.Persistence.Tests/JournalHealthCheckSpec.cs

+                                               akka.actor.serialize-messages = off
+
+                                   """;
+        return TestConfigs.TestSchedulerConfig


We use the TestScheduler here to drive the CircuitBreaker resets.

src/core/Akka.Persistence.Tests/JournalHealthCheckSpec.cs

Aaronontheweb · 2025-09-30T18:02:40Z

src/core/Akka.Persistence.Tests/JournalHealthCheckSpec.cs

+        var pluginHealth = await Extension.CheckJournalHealthAsync("akka.persistence.journal.failing-open", cts.Token);
+
+        Assert.Equal(PersistenceHealthStatus.Degraded, pluginHealth.Status);
+        Assert.Contains("Circuit breaker is open", pluginHealth.Description);


Probably not necessary to be this specific, but wanted to illustrate that leveraging the CircuitBreaker is a good default health heuristic.

Aaronontheweb · 2025-09-30T18:03:10Z

src/core/Akka.Persistence.Tests/JournalHealthCheckSpec.cs

+        testScheduler.Advance(TimeSpan.FromSeconds(1));
+
+        // Give the transition time to complete
+        await Task.Delay(100);


Even though we can use the TestScheduler to advance time, the circuit breaker still has to asynchronously perform its update.

src/core/Akka.Persistence/Journal/AsyncWriteJournal.cs

Aaronontheweb · 2025-09-30T18:04:43Z

src/core/Akka.Persistence/Journal/AsyncWriteJournal.cs

+        {
+            if(_breaker.IsHalfOpen)
+                return Task.FromResult(new PersistenceHealthCheckResult(PersistenceHealthStatus.Degraded, 
+                    $"Circuit breaker is half-open, some operations may be failing intermittently", _breaker.LastCaughtException, _defaultHealthCheckTags));


We record the PersistenceHealthStatus, a description, the last exception caught by the CircuitBreaker, and our default tags - which just include the plugin id.

Aaronontheweb · 2025-09-30T18:09:56Z

src/core/Akka.Persistence/Journal/AsyncWriteJournal.cs

                    return true;
+                case CheckJournalHealth checkHealth:
+                    var sender = Sender;
+                    CheckHealthAsync(checkHealth.CancellationToken)


Actual handling of the health check invocation

Aaronontheweb · 2025-09-30T18:10:09Z

src/core/Akka.Persistence/Journal/AsyncWriteJournal.cs


            async Task ExecuteHighestSequenceNr()
            {
-                void CompleteHighSeqNo(long highSeqNo)


Moved this below the return

Aaronontheweb · 2025-09-30T18:10:56Z

src/core/Akka.Persistence/Persistence.cs

+        /// <param name="snapshotStorePluginId">The HOCON id of the Akka.Persistence plugin.</param>
+        /// <param name="cancellationToken">An optional cancellation token.</param>
+        /// <returns>A <see cref="PersistenceHealthCheckResult"/> with health status and possibly a descriptive message.</returns>
+        public async Task<PersistenceHealthCheckResult> CheckSnapshotStoreHealthAsync(string snapshotStorePluginId,


Convenience method for checking the snapshot store plugin health.

Aaronontheweb · 2025-09-30T18:11:28Z

All of the failing unit tests are racy - not relation to these changes.

Arkatufus

LGTM

Aaronontheweb · 2025-09-30T23:42:49Z

I'll need to port this to dev

* Akka.Persistence: add health check support to `AsyncWriteJournal` akkadotnet#7840 * added messaging protocol to support plugin health check * Added tests for basic Akka.Persistence health checks this is mostly a sanity test. I don't want to get sucked into testing the `CircuitBreaker` necessarily either * added structured output to health check results * fix compilation errors * added failure specs * implemented `SnapshotStore` health checks * renamed test class * SnapshotStoreHealthCheckSpecs * API approvals

* Akka.Persistence: add health check support to `AsyncWriteJournal` #7840 * added messaging protocol to support plugin health check * Added tests for basic Akka.Persistence health checks this is mostly a sanity test. I don't want to get sucked into testing the `CircuitBreaker` necessarily either * added structured output to health check results * fix compilation errors * added failure specs * implemented `SnapshotStore` health checks * renamed test class * SnapshotStoreHealthCheckSpecs * API approvals

Aaronontheweb added 3 commits September 24, 2025 16:03

Akka.Persistence: add health check support to AsyncWriteJournal

7b39e76

akkadotnet#7840

added messaging protocol to support plugin health check

4dc2d6b

Added tests for basic Akka.Persistence health checks

822d7ab

this is mostly a sanity test. I don't want to get sucked into testing the `CircuitBreaker` necessarily either

Aaronontheweb added the akka-persistence label Sep 24, 2025

Aaronontheweb commented Sep 24, 2025

View reviewed changes

Aaronontheweb mentioned this pull request Sep 24, 2025

Akka.Persistence HealthCheck API #7840

Closed

to11mtm reviewed Sep 24, 2025

View reviewed changes

Aaronontheweb added 7 commits September 29, 2025 17:02

added structured output to health check results

1454316

fix compilation errors

c8b35d5

added failure specs

3e7d428

implemented SnapshotStore health checks

150f77e

renamed test class

4a79ae3

SnapshotStoreHealthCheckSpecs

153d434

API approvals

0a79d08

Aaronontheweb changed the title ~~[WIP] Akka.Persistence HealthChecks~~ Akka.Persistence HealthChecks Sep 30, 2025

Aaronontheweb marked this pull request as ready for review September 30, 2025 17:58

Aaronontheweb commented Sep 30, 2025

View reviewed changes

Aaronontheweb mentioned this pull request Sep 30, 2025

Adding Akka.Persistence health checks akkadotnet/Akka.Hosting#662

Merged

6 tasks

Arkatufus approved these changes Sep 30, 2025

View reviewed changes

Aaronontheweb merged commit ba5e5f8 into akkadotnet:v1.5 Sep 30, 2025
6 of 11 checks passed

Aaronontheweb deleted the akka-persistence-healthchecks branch September 30, 2025 23:42

Arkatufus added this to the 1.5.51 milestone Oct 1, 2025

Arkatufus mentioned this pull request Oct 1, 2025

Update RELEASE_NOTES.md for 1.5.51 release #7844

Merged

dependabot bot mentioned this pull request Oct 1, 2025

Bump Akka.TestKit.Xunit2 from 1.5.49 to 1.5.51 petabridge/akka-bootcamp#410

Closed

Aaronontheweb mentioned this pull request Oct 1, 2025

Port Akka.Persistence HealthChecks (#7842) #7845

Merged

dependabot bot mentioned this pull request Oct 2, 2025

Bump Akka.Cluster.TestKit from 1.5.47 to 1.5.51 akkadotnet/Akka.MultiNodeTestRunner#325

Merged

Akka.Persistence HealthChecks #7842

Akka.Persistence HealthChecks #7842

Uh oh!

Conversation

Aaronontheweb commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Checklist

Latest dev Benchmarks

This PR's Benchmarks

Uh oh!

Aaronontheweb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

to11mtm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Aaronontheweb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb commented Sep 30, 2025

Uh oh!

Arkatufus left a comment

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb commented Sep 24, 2025 •

edited

Loading

Latest `dev` Benchmarks