Skip to content

Conversation

@alexluong
Copy link
Collaborator

fixes #571

Fix: Treat webhook connection errors as delivery errors (retry) instead of system errors (DLQ)

Problem

Connection errors (DNS failures, connection refused, etc.) were being sent to DLQ instead of being retried. This happened because these errors were classified as "pre-delivery errors" rather than "delivery errors".

Solution

Network errors are now treated as delivery errors, which means they will be acknowledged and scheduled for retry instead of being nacked and sent to DLQ.

Error Classification

Code Cause Type Behavior
dns_error Domain doesn't exist Delivery Error ack + retry
connection_refused Server not running Delivery Error ack + retry
connection_reset Connection dropped Delivery Error ack + retry
network_unreachable Network unavailable Delivery Error ack + retry
timeout I/O timeout or deadline Delivery Error ack + retry
tls_error Certificate/TLS failure Delivery Error ack + retry
redirect_error Too many redirects Delivery Error ack + retry
network_error Other network issues Delivery Error ack + retry
canceled Context canceled (shutdown) System Error nack + requeue

The fix applies to both webhook and webhook_standard destination types.

alexluong and others added 7 commits December 16, 2025 15:06
* test: destwebhook.config.custom_headers tests

* feat: implement destwebhook.config.custom_headers

* feat: port custom headers implementation to destwebhookstandard

* docs: openapi.yaml

* chore: remove unused func

* feat: key value map portal ui

* fix: Fix custom header display and form logic

* chore: Remove left over console log

---------

Co-authored-by: Alexandre Bouchard <alexbouchardd@gmail.com>
* feat: port simple-json-match from js to golang

* feat: destination model filter field

* feat: destination filter api

* refactor: centralize destination event matching

* test: publishmq filter

* test: e2e filter test

* docs: openapi filter unset documentation

* test: e2e destination api with filter

* feat: portal filter ui

* chore: Filter UI display improvements and syntax guide

---------

Co-authored-by: Alexandre Bouchard <alexbouchardd@gmail.com>
…ilter (#599)

* feat: implement portal ui config

* chore: setup eslint & prettier for portal

* chore: gofmt

* docs: generate config
* test: e2e suite testing destination disable

* fix: check alert threshold larger or equal
@vercel
Copy link

vercel bot commented Dec 16, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
outpost-docs Ready Ready Preview, Comment Dec 16, 2025 0:54am
outpost-website Ready Ready Preview, Comment Dec 16, 2025 0:54am

@alexluong alexluong changed the title Dlq fix: treat webhook connection errors as delivery errors (retry) instead of system errors Dec 16, 2025
@alexluong alexluong changed the base branch from main to v0.10.0 December 16, 2025 12:55
@alexbouchardd
Copy link
Contributor

The fix applies to both webhook and webhook_standard destination types.

Can you clarify what the behavior is for say RabbitMQ if the queue is unavailable / has incorrect credentials? That should also retry. Anything that is outside of Outpost control really, should retry and not go to DLQ

@alexluong
Copy link
Collaborator Author

Can you clarify what the behavior is for say RabbitMQ if the queue is unavailable / has incorrect credentials?

I can look into RabbitMQ later. What do you think if we move forward with this PR and have RabbitMQ as a follow-up?

As a whole we need to be a bit more intentional about the error handling for every destination type. This PR focuses on webhook for now.

If you prefer we handle everything in this PR, happy to do so as well.

@alexbouchardd
Copy link
Contributor

alexbouchardd commented Dec 16, 2025

I'm just using RabbitMQ as an example of all other destination types here but no, I think that's fundamental design for the error handling where destination specific config shouldn't cause message to go to DQL and that's a generalized implementation rather then destination per destination

@alexluong
Copy link
Collaborator Author

Here's the delivery logic

deliverymq message handler:
  pre_delivery_err, delivery_err = deliver(destination, event)
  if pre_delivery_err -> nack
  if delivery_err -> schedule retry
  ack

It just happens that in webhook, the HTTP request is splitted into 2 steps. I am also just now realized that this PR basically will ignore cases where the system loses network, or if its DNS resolver has issue. That's why I said it's not that simple, at least from what I'm seeing.

For example, this is the step in RabbitMQ that would qualify as PreDelivery

CleanShot 2025-12-16 at 22 46 46

This could be because auth issue. Or RabbitMQ destination is down. But what if it's because some system issue? How can we differentiate?

@alexbouchardd
Copy link
Contributor

If we don't have a connection to rabbitMQ (or any destination for that matter) it's a delivery error, not "pre delivery". It's not "Outpost" fault. The only things that would outpost fault would be unexpected data in the message, errors fetch the destination config (Redis calls), parsing or application logic errors

@alexbouchardd
Copy link
Contributor

alexbouchardd commented Dec 16, 2025

Ultimately it's about who's responsible for the issue, the operator or the tenant

@alexluong
Copy link
Collaborator Author

I get that, I guess it doesn't quite click for me what's the simplest way to differentiate.

For example, when we see an error as "DNS resolution" or "connection reset" error or "generic network error", is that 100% of the time destination error?

@alexbouchardd
Copy link
Contributor

Yes it is, or least we treat it that way since its transient

@alexluong
Copy link
Collaborator Author

okay noted, let me noodle on this and apply this idea to the other destinations.

For now, I'd like to move forward with v0.10.0 without this PR, then focus on the CH implementation, then I'll get back to this. Do let me know if you prefer I prioritize this instead.

Base automatically changed from v0.10.0 to main December 17, 2025 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Clarification on DLQ usage and handling of user-caused delivery failures in SQS setups

3 participants