Skip to content

feat: Retry failed Event requests#123

Open
darielkremov wants to merge 4 commits intoPostHog:mainfrom
darielkremov:dariel/retry-requests
Open

feat: Retry failed Event requests#123
darielkremov wants to merge 4 commits intoPostHog:mainfrom
darielkremov:dariel/retry-requests

Conversation

@darielkremov
Copy link
Contributor

Added Polly to retry failed Event requests, as recommended in .NET docs. As the various client SDKs differ in configuration I've set up a minimal config that can be easily adjusted and extended to more APIs (for example flags, config, etc) if needed. The default settings try to match js/python client configs. We cover batch and capture endpoints for now, but not any of the flags or config endpoints to stay consistent with other clients.
We can easily adjust the exceptions or ApiResult statuses that are being retried if needed, for now it retries ApiException and timeouts, other exceptions throw directly.

A ResiliencePipeline can be passed to new PostHogClient(), if null it will create pipeline with default settings from PostHogOptions, or if no retry needed - ResiliencePipeline.Empty executes the callbacks directly with no extra logic.

Honestly, the added test is a bit more of an integration test than unit, but at least helps verify the behavior and executes quickly.

Closes #7

@darielkremov darielkremov requested a review from haacked as a code owner October 26, 2025 17:39
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

darielkremov and others added 2 commits October 26, 2025 19:42
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@github-actions
Copy link
Contributor

Hey, this PR has been open for a while. Someone should look at it, or we can just close it if it's no longer relevant

@github-actions github-actions bot added the stale label Nov 10, 2025
@haacked haacked removed the stale label Dec 14, 2025
Copy link
Collaborator

@haacked haacked left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate this PR! I have a few must fix issues and a few nits.

Also, would be nice to add more test coverage: Right now, only one test covers retry behavior. It makes sense to add tests for:

  1. Success after 1-2 retries (not just exhaustion)
  2. Both /batch AND /capture endpoints
  3. 4xx errors should NOT retry
  4. 5xx errors SHOULD retry
  5. HttpRequestException handling
  6. Configuration options (MaxRetries=0, custom delays)
  7. Cancellation during retry
  8. Default pipeline construction

Comment on lines +69 to +74
ShouldHandle = args => args.Outcome switch
{
{ Exception: ApiException } => PredicateResult.True(),
{ Exception: TimeoutRejectedException } => PredicateResult.True(),
_ => PredicateResult.False()
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ShouldHandle predicate here retries all ApiException instances without checking HTTP status codes. This means 400/401/403/404 errors are retried, but these are permanent failures that will never succeed.

Also, I believe we should handle HttpRequestException for network-level failures (DNS, connection refused, socket errors). Maybe something like:

Suggested change
ShouldHandle = args => args.Outcome switch
{
{ Exception: ApiException } => PredicateResult.True(),
{ Exception: TimeoutRejectedException } => PredicateResult.True(),
_ => PredicateResult.False()
}
ShouldHandle = args => args.Outcome switch
{
{ Exception: ApiException { Status: >= HttpStatusCode.InternalServerError } } => PredicateResult.True(),
{ Exception: ApiException { Status: HttpStatusCode.TooManyRequests } } => PredicateResult.True(),
{ Exception: TimeoutRejectedException } => PredicateResult.True(),
{ Exception: HttpRequestException } => PredicateResult.True(),
_ => PredicateResult.False()
}

}
.AddRetry(new RetryStrategyOptions
{
MaxRetryAttempts = _options.Value.MaxRetries,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without UseJitter = true, simultaneous failures could cause thundering herd effects.

Suggested change
MaxRetryAttempts = _options.Value.MaxRetries,
MaxRetryAttempts = _options.Value.MaxRetries,
UseJitter = true,

public Uri HostUrl { get; set; } = new("https://us.i.posthog.com");

/// <summary>
/// Timeout in milliseconds for any calls. Defaults to 10 seconds.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It's a TimeSpan, not milliseconds.

Suggested change
/// Timeout in milliseconds for any calls. Defaults to 10 seconds.
/// Timeout for any calls. Defaults to 10 seconds.

{ Exception: TimeoutRejectedException } => PredicateResult.True(),
_ => PredicateResult.False()
},
OnRetry = args =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry pipeline here is nearly identical to the one created in PostHogClient. Might make sense to create an internal factory method for creating the pipeline and use that method in both places.

// This test is pretty expensive because it dynamically compiles and loads an assembly.
// Consider alternatives.
[Fact]
[Fact(Skip= "Takes long time to run because it dynamically compiles and loads an assembly.")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unrelated to this PR.

/// <summary>
/// Number of times to retry a failed request. Defaults to 3.
/// </summary>
public int MaxRetries { get; set; } = 3;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably validate this (and the other new settings) in the PostHogClient ctor.

For example, we may want to have a max value on the allowed retries. Make sure its not negative. Etc.

Same thing for the timespans. Make sure they're not less than TimeSpan.Zero. Etc.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds retry functionality to the PostHog .NET SDK using the Polly resilience library. Failed event requests to the /batch and /capture endpoints will now be automatically retried with exponential backoff to handle transient failures. The default configuration uses 3 retry attempts with a 3-second base delay, capped at 10 seconds, and a 10-second request timeout, matching settings from other PostHog client SDKs.

Key changes:

  • Added Polly.Core dependency for resilience patterns (retry, timeout)
  • Introduced configurable retry and timeout settings in PostHogOptions
  • Wrapped event API calls with a ResiliencePipeline that handles ApiException and TimeoutRejectedException

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/PostHog/PostHog.csproj Adds Polly.Core package reference for resilience functionality
src/PostHog/Config/PostHogOptions.cs Adds new configuration properties for request timeout, retry attempts, retry delay, and maximum retry delay
src/PostHog/PostHogClient.cs Adds ResiliencePipeline parameter and builds default pipeline with retry and timeout strategies when not provided
src/PostHog/Api/PostHogApiClient.cs Wraps batch and capture endpoint calls with ResiliencePipeline execution for automatic retries
tests/TestLibrary/TestContainer.cs Updates PostHogClient instantiation to pass ResiliencePipeline.Empty for deterministic test behavior
tests/TestLibrary/Fakes/FakeHttpMessageHandlerExtensions.cs Adds helper method to inject exceptions into batch responses for retry testing
tests/UnitTests/PostHogClientTests.cs Adds comprehensive retry behavior test and marks slow compilation test as skipped

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +42 to +43
/// Number of times to retry a failed request. Defaults to 3.
/// </summary>
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation should clarify what types of requests are retried. According to the PR description, retries apply to event capture requests (batch and capture endpoints) but not to feature flag or config endpoints. This should be documented here to set clear expectations about which operations will be retried.

Suggested change
/// Number of times to retry a failed request. Defaults to 3.
/// </summary>
/// Number of times to retry a failed event capture request (batch and capture endpoints). Defaults to 3.
/// </summary>
/// <remarks>
/// Retries apply only to event capture requests (batch and capture endpoints). Feature flag and config endpoint requests are not retried.
/// </remarks>

Copilot uses AI. Check for mistakes.
var flushTask2 = client.FlushAsync();
await Task.Yield();

// Advance fake time past timeout to trigger the retry
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment could be clearer about what's happening. The test advances time by 16 seconds to exceed the 15-second timeout configured on line 705, which causes a TimeoutRejectedException that triggers a retry. Consider revising the comment to: "Advance fake time past the 15-second timeout to cause a TimeoutRejectedException that will be retried" for better clarity.

Suggested change
// Advance fake time past timeout to trigger the retry
// Advance fake time past the 15-second timeout to cause a TimeoutRejectedException that will be retried

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +77
.AddRetry(new RetryStrategyOptions
{
MaxRetryAttempts = _options.Value.MaxRetries,
Delay = _options.Value.RetryDelay,
MaxDelay = _options.Value.MaxRetryDelay,
BackoffType = DelayBackoffType.Exponential,
ShouldHandle = args => args.Outcome switch
{
{ Exception: ApiException } => PredicateResult.True(),
{ Exception: TimeoutRejectedException } => PredicateResult.True(),
_ => PredicateResult.False()
}
})
.AddTimeout(_options.Value.RequestTimeout)
.Build();
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout is applied after the retry strategy, which means it applies to the entire retry sequence. With the default configuration (MaxRetries=3, RetryDelay=3s, MaxRetryDelay=10s, exponential backoff), the maximum time for all retries is 3+6+10=19 seconds, but RequestTimeout is only 10 seconds. This means the operation will timeout before exhausting all retry attempts. The order should typically be reversed - add timeout before retry to apply timeout per-request, or adjust the RequestTimeout value to account for the total retry sequence duration (e.g., at least 20+ seconds to allow all retries).

Copilot uses AI. Check for mistakes.
// This test is pretty expensive because it dynamically compiles and loads an assembly.
// Consider alternatives.
[Fact]
[Fact(Skip= "Takes long time to run because it dynamically compiles and loads an assembly.")]
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Skip parameter assignment is missing a space after the equals sign. For consistency with C# conventions and the rest of the codebase, it should be 'Skip = "..."' instead of 'Skip= "..."'.

Suggested change
[Fact(Skip= "Takes long time to run because it dynamically compiles and loads an assembly.")]
[Fact(Skip = "Takes long time to run because it dynamically compiles and loads an assembly.")]

Copilot uses AI. Check for mistakes.
BackoffType = DelayBackoffType.Exponential,
ShouldHandle = args => args.Outcome switch
{
{ Exception: ApiException } => PredicateResult.True(),
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry strategy retries all ApiException instances regardless of HTTP status code. This means that client errors like 400 (Bad Request), 401 (Unauthorized), 403 (Forbidden), and 404 (Not Found) will be retried, which typically shouldn't be retried as they won't succeed on retry. Consider filtering ApiException based on the Status property to only retry transient errors (5xx server errors, 429 Too Many Requests, 408 Request Timeout, etc.) and not retry permanent client errors (4xx except certain codes like 429).

Suggested change
{ Exception: ApiException } => PredicateResult.True(),
{ Exception: ApiException apiEx } when
(apiEx.Status >= 500 && apiEx.Status < 600) || // 5xx server errors
apiEx.Status == 429 || // Too Many Requests
apiEx.Status == 408 // Request Timeout
=> PredicateResult.True(),

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +74
ShouldHandle = args => args.Outcome switch
{
{ Exception: ApiException } => PredicateResult.True(),
{ Exception: TimeoutRejectedException } => PredicateResult.True(),
_ => PredicateResult.False()
}
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry strategy doesn't handle HttpRequestException, which is commonly thrown for network-related failures like connection timeouts, DNS failures, or connection refused errors. These are typically transient errors that should be retried. Consider adding HttpRequestException to the ShouldHandle predicate to retry network-level failures in addition to ApiException and TimeoutRejectedException.

Copilot uses AI. Check for mistakes.
/// <param name="httpClientFactory">Creates <see cref="HttpClient"/> for making requests to PostHog's API.</param>
/// <param name="taskScheduler">Used to run tasks on the background.</param>
/// <param name="timeProvider">The time provider <see cref="TimeProvider"/> to use to determine time.</param>
/// <param name="resiliencePipeline">The <see cref="Polly.ResiliencePipeline"/> to use when making event requests to PostHog's API.</param>
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The XML documentation should clarify that when resiliencePipeline is null, a default pipeline will be created using the retry and timeout settings from PostHogOptions. It should also mention that passing ResiliencePipeline.Empty will disable retry/timeout behavior entirely. This would help users understand the different configuration options available.

Suggested change
/// <param name="resiliencePipeline">The <see cref="Polly.ResiliencePipeline"/> to use when making event requests to PostHog's API.</param>
/// <param name="resiliencePipeline">
/// The <see cref="Polly.ResiliencePipeline"/> to use when making event requests to PostHog's API.
/// If <c>null</c>, a default pipeline will be created using the retry and timeout settings from <see cref="PostHogOptions"/>.
/// Passing <see cref="Polly.ResiliencePipeline.Empty"/> disables retry and timeout behavior entirely.
/// </param>

Copilot uses AI. Check for mistakes.
public Uri HostUrl { get; set; } = new("https://us.i.posthog.com");

/// <summary>
/// Timeout in milliseconds for any calls. Defaults to 10 seconds.
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation incorrectly states "Timeout in milliseconds" but the property type is TimeSpan and the default is set using TimeSpan.FromSeconds(10). The documentation should be updated to accurately describe this as a timeout duration or time span, not specifically in milliseconds.

Suggested change
/// Timeout in milliseconds for any calls. Defaults to 10 seconds.
/// Timeout duration for any calls. Defaults to 10 seconds.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Contributor

Hey, this PR has been open for a while. Someone should look at it, or we can just close it if it's no longer relevant

@github-actions github-actions bot added the stale label Dec 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Retry failed requests

3 participants