Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic Prometheus instrumentation for workers #111

Merged
merged 3 commits into from
Oct 13, 2023
Merged

Conversation

tillprochaska
Copy link
Contributor

@tillprochaska tillprochaska commented Sep 20, 2023

This PR adds basic Prometheus instrumentation for workers, i.e. both Aleph workers and ingest-file workers will expose these metrics once upgraded to the latest servicelayer version.

The following metrics are exposed:

  • task_started_total: Counter for the total number of tasks that have started processing. Has labels for stage (e.g. ingest, reindex, xref, …) and retries (this would allow to find out how many tasks were successful processed only after n retries).

  • task_succeeded_total: Counter for the total number of successfully processed tasks. Has labels for stage and retries.

  • task_failed_total: Counter for the total number of failed tasks. This includes tasks that failed non-permanently and will be retried as well as tasks that have already exhausted the maximum number of retries. In addition to the stage and retries labels this has a failed_permanently label.

  • task_duration_seconds: Histogram that tracks task processing duration. Histogram metrics require setting fixed buckets. As task duration can vary quite a bit depending on stage (e.g. reindex tasks are usually super fast, whereas xrefs take longer), I’ve set them up to cover a wide range for now (250ms to 24h), we might want to adjust them later based on the metrics we collected in production.

I think it would be nice if we could also expose the wait time of a task (difference between the time the task was created and the time a worker started processing it) and/or the total time from creation to success/failure, as these are metrics that are directly relevant to end users. However, I don’t think we currently store task creation time.

Technically, task_succeeded_total is redundant: Under the hood, histogram metrics store several other metrics, including the sum of all observations and the count of observations. However, these are often named confusingly (e.g. task_duration_seconds_count). Also, task_succeeded_total has an additional retries label, so I think it’s worth tracking the success count separately.

How to test this

Start Aleph on your computer.

# Stop the current ingest-file container
docker compose -f docker-compose.dev.yml stop ingest-file

# Start a new ingest-file container
docker compose -f docker-compose.dev.yml run --rm --env PROMETHEUS_ENABLED=true -p=9090:9090 ingest-file

# Install servicelayer from git
pip install git+https://github.com/alephdata/servicelayer@feature/prometheus

# Run the ingest-file worker
ingestors process

(You follow the same steps analogously for the Aleph worker.)

The logs should include a message indicating that a metrics server is running on port 9090. You should now be able to access logs on your host machine:

curl http://localhost:9090/metrics

Open the Aleph UI, upload a few documents, and check the exposed metrics again.

@tillprochaska tillprochaska marked this pull request as draft September 20, 2023 08:26
@tillprochaska tillprochaska marked this pull request as ready for review September 20, 2023 16:05
@tillprochaska tillprochaska linked an issue Sep 20, 2023 that may be closed by this pull request
Copy link
Contributor

@stchris stchris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I'm eager to get this going!

@Rosencrantz
Copy link

@tillprochaska What's preventing this from being merged?

@stchris stchris merged commit a294c28 into main Oct 13, 2023
1 check passed
@stchris stchris deleted the feature/prometheus branch October 13, 2023 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FEATURE: Metrics
4 participants