Skip to content

Rudderstack becomes extremely slow when you have one destination down #4953

Open
@fackyhigh

Description

@fackyhigh

Describe the bug
We have data-plane running in k8s. There are 60 pods. When we see in Grafana that one of our Webhook destinations is down, then Rudderstack becomes extremely slow. Webhook delivery time increases dramatically, rt tables count increases from 2 to ~30 per PostgreSQL pod, webhook event sync lag time goes from 10 second to one hour almost.

Steps to reproduce the bug
Enter the steps to reproduce the behavior.

  1. Configure multiple webhooks as destinations for one source
  2. Create a load
  3. Fail one of the webhooks
  4. See how rt tables count, webhook delivery time and event sync lag time grows.

Expected behavior
When destination is down the system is still running fast and it doesn't affect other destinations.

Screenshots
image
image

Any additional context
Rudderstack version is 1.28.1

Please, tell us what to tweak so Rudderstack could work as usual at the times when one destination may go down. As well It'l be appreciated if you share how the retry logic actually works and why it affects other destinations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions