Add basic Prometheus instrumentation for workers #111
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds basic Prometheus instrumentation for workers, i.e. both Aleph workers and ingest-file workers will expose these metrics once upgraded to the latest servicelayer version.
The following metrics are exposed:
task_started_total: Counter for the total number of tasks that have started processing. Has labels for
stage
(e.g.ingest
,reindex
,xref
, …) andretries
(this would allow to find out how many tasks were successful processed only after n retries).task_succeeded_total: Counter for the total number of successfully processed tasks. Has labels for
stage
andretries
.task_failed_total: Counter for the total number of failed tasks. This includes tasks that failed non-permanently and will be retried as well as tasks that have already exhausted the maximum number of retries. In addition to the
stage
andretries
labels this has afailed_permanently
label.task_duration_seconds: Histogram that tracks task processing duration. Histogram metrics require setting fixed buckets. As task duration can vary quite a bit depending on stage (e.g. reindex tasks are usually super fast, whereas xrefs take longer), I’ve set them up to cover a wide range for now (250ms to 24h), we might want to adjust them later based on the metrics we collected in production.
I think it would be nice if we could also expose the wait time of a task (difference between the time the task was created and the time a worker started processing it) and/or the total time from creation to success/failure, as these are metrics that are directly relevant to end users. However, I don’t think we currently store task creation time.
Technically,
task_succeeded_total
is redundant: Under the hood, histogram metrics store several other metrics, including the sum of all observations and the count of observations. However, these are often named confusingly (e.g.task_duration_seconds_count
). Also,task_succeeded_total
has an additionalretries
label, so I think it’s worth tracking the success count separately.How to test this
Start Aleph on your computer.
(You follow the same steps analogously for the Aleph worker.)
The logs should include a message indicating that a metrics server is running on port 9090. You should now be able to access logs on your host machine:
Open the Aleph UI, upload a few documents, and check the exposed metrics again.