Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional metrics for worker saturation analysis and scaling #11

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions lib/yabeda/sidekiq.rb
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,10 @@ module Sidekiq
gauge :active_processes, tags: [], comment: "The number of active Sidekiq worker processes."
gauge :queue_latency, tags: %i[queue], comment: "The queue latency, the difference in seconds since the oldest job in the queue was enqueued"

gauge :concurrency, tags: [], comment: "The total number of jobs that can be run at a time across all processes."
gauge :available_workers, tags: [], comment: "The number of workers available for new jobs across all processes."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the metric name I can't understand whether it is about processes or threads. I need to read the collector's code below to get it. Let's clarify:

Suggested change
gauge :available_workers, tags: [], comment: "The number of workers available for new jobs across all processes."
gauge :available_threads, tags: [], comment: "The number of threads available for job processing across all processes."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name was chosen to be consistent with how it’s reporting active workers — it looks like Sidekiq uses the term “worker” to mean “thread”.

gauge :saturation, tags: [], comment: "Percentage of workers available for new jobs across all processes."

histogram :job_latency, comment: "The job latency, the difference in seconds between enqueued and running time",
unit: :seconds, per: :job,
tags: %i[queue worker],
Expand All @@ -59,6 +63,27 @@ module Sidekiq
sidekiq_queue_latency.set({ queue: queue.name }, queue.latency)
end

# Process-level metrics. These come from a common pool, but we can calculate them as global values.
# The "quiet" flag (set when the process receives TSTP signal) is only available in the global ProcessSet,
# so we may as well get everything from there.
process_set = ::Sidekiq::ProcessSet.new
total_concurrency = 0
total_available_workers = 0
process_set.each do |process|
concurrency = process['concurrency']
busy_workers = process['busy']
available_workers = (process['quiet'] == 'true') ? 0 : (concurrency - busy_workers)

total_concurrency += concurrency
total_available_workers += available_workers
end
# Use available_workers instead of busy_workers here because we want quieted processes to report as full.
saturation = 1 - (total_available_workers.to_f / total_concurrency)

sidekiq_concurrency.set({}, total_concurrency)
sidekiq_available_workers.set({}, total_available_workers)
sidekiq_saturation.set({}, saturation)

# That is quite slow if your retry set is large
# I don't want to enable it by default
# retries_by_queues =
Expand Down