Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add blog re post of how Zapier uses KEDA #687

Merged
merged 2 commits into from
Mar 9, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions content/blog/2022-03-10-how-zapier-uses-keda.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
+++
title = "How Zapier uses KEDA"
date = 2022-03-10
author = "Ratnadeep Debnath (Zapier)"
aliases = [
"/blog/how-zapier-uses-keda"
]
+++

[RabbitMQ](https://www.rabbitmq.com/) is at the heart of Zap processing at [Zapier](https://zapier.com). We enqueue messages to RabbitMQ for each step in a Zap. These messages get consumed by our backend workers, which run on [Kubernetes](https://kubernetes.io). To keep up with the varying task loads in Zapier we need to scale our workers with our message backlog.

For a long time, we scaled with CPU-based autoscaling using Kubernetes native Horizontal Pod Autoscale (HPA), where more tasks led to more processing, increasing CPU usage, and triggering our workers' autoscaling. It seemed to work pretty well, except for certain edge cases.

We do a lot of [blocking I/O](https://medium.com/coderscorner/tale-of-client-server-and-socket-a6ef54a74763) in Python (we don’t use an event-based loop in our workers written in Python). This means that we could have a fleet of workers idling on blocking I/O with low CPU profiles while the queue keeps growing unbounded, as low CPU usage would prevent autoscaling from kicking in. In a situation where workers are idle while waiting for I/O, we could have a growing backlog of messages that a CPU-based autoscaler would miss. This situation allowed a traffic jam to form and introduce delays in processing Zap tasks.

Ideally, we would like to scale our workers on both CPU and our backlog of ready messages in RabbitMQ. Unfortunately, Kubernetes’ native HPA does not support scaling based on RabbitMQ queue length out of the box. There is a potential solution by collecting RabbitMQ metrics in Prometheus, creating a custom metrics server, and configuring HPA to use these metrics. However, this is a lot of work and why reinvent the wheel when there’s KEDA.

We have installed KEDA in our Kubernetes clusters and started opting into KEDA for autoscaling. Our goal is to autoscale our workers not just based on CPU usage, but also based on the number of ready messages in RabbitMQ queues they are consuming from.

At Zapier, we use KEDA to autoscale our workers not just based on CPU usage, but also based on the number of ready messages in RabbitMQ queues they are consuming from. We monitor our KEDA setup in Grafana using Prometheus metrics, and use Prometheus rules to alert on errors. Using KEDA to autoscale our workers significantly prevented delays in our Zap processing due to blocked I/O calls. We are slowly updating apps at Zapier to use KEDA.

tomkerkhove marked this conversation as resolved.
Show resolved Hide resolved
**Read the rest of this [post on how Zapier uses KEDA](https://www.cncf.io/blog/2022/01/21/keda-at-zapier/) on the Cloud Native Computing Foundation blog.**