Metrics #3216

tillprochaska · 2023-07-17T11:36:53Z

This PR exposes metrics in the Prometheus format. The PR also includes changes to the Aleph Helm chart to make it easy to collect these metrics, e.g. when using the prometheus-operator.

Related PRs:

Implemented metrics:

Testing:
Follow these steps to test the changes locally:

Set PROMETHEUS_ENABLED=true in your aleph.env file and restart Aleph.
You can view metrics exposed by the API at http://localhost:9100 (in the container).
Run docker compose -f docker-compose.dev.yml run --rm api bash.
Run gunicorn --bind 0.0.0.0:9100 aleph.metrics.exporter:app to start the metrics exporter.
You can view the metrics exposed by the exporter at http://localhost:9100 within the container you just started.

Implementation considerations:

Our API service is served using Gunicorn with multiple worker processes. Prometheus requires using a special multiprocess mode for this scenario which has a few limitations, e.g. using custom collectors are not supported out of the box. We’re using custom collectors to expose metrics that are not collected during runtime of the API or workers, but instead are exposed based on data fetched from the SQL database or Redis (e.g number of tasks in queues or number of users in the database).

There is a way to make custom collectors work while using the multiprocess mode at the same time, but it adds some complexity to the setup (see Custom collectors and multiprocess mode prometheus/client_python#943), so I eventually decided to run a Prometheus exporter as a separate sidecar service to expose custom collector metrics. This also makes it very easy to scrape API runtime metrics at a different interval than custom collector metrics.
API metrics are exposed on a separate port (and not at the same port as the API) to ensure metrics are not exposed publicly.
prometheus_flask_exporter is a third-party exporter to expose Prometheus metrics about a Flask web application. I’m not using it though, as it’s actually quite straight forward to collect basic HTTP request metrics and I’ve found a direct integration easier when adjust to our needs and more transparent when debugging (vs. using an external extension like prometheus_flask_exporter).

To do:

Rosencrantz · 2023-09-21T07:13:22Z

Thanks Till. An amazing writeup as usual. Just looking through the list of information you've managed to start collecting is really impressive.

One thing that I'd like your input on, because I think it may be something that could be useful for us is looking at inactivity. A few examples:

Inactive users (anyone that hasn't logged in or used Aleph for a period of time)
Inactive datasets (a collection or investigation that hasn't be opened or used for a period of time)
Empty datasets (anything with 0 entities or documents)

Would be interested to know your thoughts here?

tillprochaska · 2023-09-21T08:11:10Z

Inactive users (anyone that hasn't logged in or used Aleph for a period of time)

Prometheus is a perfect fit for exposing time-based metrics like task queue length, number of processed tasks, average task processing time, number and response time for API requests by endpoint etc. However, there is another set of metrics we’d like to collect to answer questions like “Which datasets are accessed most?” or “How many daily/monthly active users do we have?”. These metrics cannot be implemented using just Prometheus in a straightforward way – or rather, they could be implemented using just Prometheus, but that’s a bad idea. Here’s why:

In Prometheus, you can add labels to your metrics. For example, you might have a counter metric called “user_session_created_total” that is incremented every time a user signs in. You can also specify an arbitrary number of labels, e.g. you might pass a auth_method label that could be oauth or password. You can then also use labels when querying your metrics, e.g. to find out how many session where created using the password auth method you’d do something like user_session_created_total{method="password"}.

It might be tempting to do the same thing with user IDs to. Just add a user_id label to the metric and now we’re tracking how often individual users signed in, too, right?

Unfortunately, it’s a bit more complex. Under the hood, Prometheus stores every combination of label values as a separate time series in its database. So if you have an auth_method label that can have two values (password and oauth), that results in two time series stored in the database. But the user_id label could have as many values as you have users in your Aleph instance -- which could easily get slow and expensive.

Good thing is that we actually do not need to know when and how often exactly a specific user signed in. All we want to know is how many users have been active in the past 7 or 30 days etc. The solution for this is actually quite simple:

Add a separate last_login_at timestamp to the table in Postgres that stores user information.
Whenever a user signs in, update that timestamp. (We have a rather short session length by default, so the time of last sign-in should be a good enough proxy for recent activity.)
Expose a metric active_users with a period label (that has a limited set of labels, e.g. 7d, 30d, 1y). The values for this metric can be easily calculated by running a SQL query.

I actually already prototyped this, but didn’t want to include it in the initial PR to keep it easy to review.

Inactive datasets (a collection or investigation that hasn't be opened or used for a period of time)

This is similar to the use case above, but I think the solution is a bit more complex for two reasons:

What qualifies as an active dataset? Does that mean someone had to open the dataset? Is it enough for at least one entity from that dataset to have been included in the results for a search? At least one xref match?
While updating the last_login_at timestamp for a user once per session is not a performance bottleneck, writing to the database every time a dataset is accessed would likely be a bad idea.

So this is likely going to be slightly more complex if we actually want to implement it.

Empty datasets (anything with 0 entities or documents)

Haven’t thought about this one in detail yet. I think we might be able to leverage the collection stats we already compute and cache (but they are sometimes incorrect or out of date, so that would be reflected in the metric).

helm/charts/aleph/templates/worker.yaml

tillprochaska · 2023-11-22T10:20:24Z

aleph/logic/xref.py

+XREF_CANDIDATES_QUERY_ROUNDTRIP_DURATION = Histogram(
+    "aleph_xref_candidates_query_roundtrip_duration_seconds",
+    "Roundtrip duration of the candidates query (incl. network, serialization etc.)",
+)


One additional metric related to xref that would be interesting is how long it takes to process a batch of entities from a single ES scroll response. This could be used to alert us when we are getting close to the scroll timeout.

I haven’t implemented this so far because the individual scroll requests are abstracted away by the scan helper from the ES Python client so it would consider a non-trivial amount of work and would increase complexity.

tillprochaska · 2023-11-22T10:23:16Z

aleph/metrics/collectors.py

+    def _users(self):
+        return GaugeMetricFamily(
+            "aleph_users",
+            "Total number of users",
+            value=Role.all_users().count(),
+        )


This metric could be extended in the future. In particular, it might be interesting to expose the number of users that have been active within the past 24h, 7d, 30d etc. This requires some additional, non-trivial work because we’d need to track when users signed in the last time, so I decided not to implement it in this PR.

tillprochaska · 2023-11-22T10:24:16Z

aleph/metrics/collectors.py

+    def _collections(self):
+        gauge = GaugeMetricFamily(
+            "aleph_collections",
+            "Total number of collections by category",
+            labels=["category"],
+        )
+
+        query = (
+            Collection.all()
+            .with_entities(Collection.category, func.count())
+            .group_by(Collection.category)
+        )
+
+        for category, count in query:
+            gauge.add_metric([category], count)
+
+        return gauge


This metric could be extended in the future to expose number of collections by countries.

tillprochaska · 2023-11-22T10:26:50Z

aleph/migration.py

-                table.drop(checkfirst=True)
+                table.drop(bind=db.engine, checkfirst=True)


This isn’t strictly related to this PR, but I think it might have been overlooked during the SQLAlchemy 2 migration and it caused migrations to fail in my development environment.

stchris

Looks good to me, I left some comments. Feels good to have instrumentation, thanks for your hard work, @tillprochaska !

Closes #3214

…s metrics

Prometheus Operator also uses the "endpoint" label and automatically renames "endpoint" labels exposed by the metrics endpoint to "exported_endpoints" which is ugly.

Even though it is considered an anti-pattern to add a prefix with the name of the software or component to metrics (according to the official Prometheus documentation), I have decided to add a prefix. I’ve found that this makes it much easier to find relevant metrics. The main disadvantage of per-component prefixes queries become slightly more complex if you want to query the same metric (e.g. HTTP request duration) across multiple components. This isn’t super important in our case though, so I think the trade-off is acceptable.

Although I'm not 100% sure, the exposed port 3000 probably is a left-over from the past, possibly when convert-document was still part of ingest-file. The network policy prevented Prometheus from scraping ingest-file metrics (and as the metrics port is now the only port exposed by ingest-file, should be otherwise unnecessary).

@stchris

As suggested by @stchris

There’s no need to do batched metric increments until this becomes a performance bottleneck.

I copied the boilerplate for custom collectors from the docs without thinking about it too much, but inheriting from `object` really isn’t necessary anymore in Python 3. The Prometheus client also exports an abstract `Collector` class -- it doesn’t do anything except providing type hints for the `collect` method which is nice.

…Kubernetes cluster

stchris

Seems fine to me! Can't wait to see this in action. Thanks for all the work on this, @tillprochaska !

@stchris

* Add Prometheus instrumentation Closes alephdata#3214 * Fix missing bind argument * Run Prometheus exporter as a separate service * Expose number of streaming requests and number of streamed entities as metrics * Expose number of auth attempts as Prometheus metrics * Update Helm chart to expose metrics endpoints, setup ServiceMonitors * Handle requests without Authz object gracefully * Rename Prometheus label to "api_endpoint" to prevent naming clashes Prometheus Operator also uses the "endpoint" label and automatically renames "endpoint" labels exposed by the metrics endpoint to "exported_endpoints" which is ugly. * Add xref metrics * Use common prefix for all metric names Even though it is considered an anti-pattern to add a prefix with the name of the software or component to metrics (according to the official Prometheus documentation), I have decided to add a prefix. I’ve found that this makes it much easier to find relevant metrics. The main disadvantage of per-component prefixes queries become slightly more complex if you want to query the same metric (e.g. HTTP request duration) across multiple components. This isn’t super important in our case though, so I think the trade-off is acceptable. * Expose Python platform information as Prometheus metrics * Remove unused port, network policy from K8s specs Although I'm not 100% sure, the exposed port 3000 probably is a left-over from the past, possibly when convert-document was still part of ingest-file. The network policy prevented Prometheus from scraping ingest-file metrics (and as the metrics port is now the only port exposed by ingest-file, should be otherwise unnecessary). * Use keyword args to set Prometheus metric labels As suggested by @stchris * Bump servicelayer from 1.22.0 to 1.22.1 * Simplify entity streaming metrics code There’s no need to do batched metric increments until this becomes a performance bottleneck. * Limit maximum size of Prometheus multiprocessing directory * Do not let collector classes inherit from `object` I copied the boilerplate for custom collectors from the docs without thinking about it too much, but inheriting from `object` really isn’t necessary anymore in Python 3. The Prometheus client also exports an abstract `Collector` class -- it doesn’t do anything except providing type hints for the `collect` method which is nice. * Add `aleph_` prefix to Prometheus API metrics * Fix metrics name (singular -> plural) * Add documentation on how to test Prometheus instrumentation in local Kubernetes cluster

tillprochaska linked an issue Jul 17, 2023 that may be closed by this pull request

FEATURE: Metrics #3214

Closed

tillprochaska force-pushed the feature/3214-metrics branch 2 times, most recently from 4100216 to 56ee131 Compare July 19, 2023 15:40

tillprochaska force-pushed the feature/3214-metrics branch 2 times, most recently from f023b74 to f1c7e92 Compare October 3, 2023 08:55

tillprochaska force-pushed the feature/3214-metrics branch 2 times, most recently from dd6e557 to ef8f482 Compare October 16, 2023 22:42

tillprochaska marked this pull request as ready for review October 16, 2023 22:44

tillprochaska force-pushed the feature/3214-metrics branch from ef8f482 to 785273f Compare October 17, 2023 09:49

tillprochaska requested review from stchris and catileptic October 17, 2023 09:52

tillprochaska force-pushed the feature/3214-metrics branch from 785273f to c7c3a41 Compare October 17, 2023 10:15

tillprochaska force-pushed the feature/3214-metrics branch 5 times, most recently from 1c2a818 to 0b3bd35 Compare November 10, 2023 14:52

tillprochaska commented Nov 13, 2023

View reviewed changes

helm/charts/aleph/templates/worker.yaml Outdated Show resolved Hide resolved

tillprochaska force-pushed the feature/3214-metrics branch 4 times, most recently from 2ee48ff to eb947d0 Compare November 19, 2023 14:17

tillprochaska changed the title ~~[WIP] Metrics~~ Metrics Nov 20, 2023

tillprochaska force-pushed the feature/3214-metrics branch from e5164c2 to 823c706 Compare November 20, 2023 10:46

tillprochaska commented Nov 22, 2023

View reviewed changes

stchris approved these changes Nov 22, 2023

View reviewed changes

tillprochaska force-pushed the feature/3214-metrics branch from d194eb5 to 7aabc4f Compare November 23, 2023 11:56

tillprochaska force-pushed the feature/3214-metrics branch 3 times, most recently from bd252e1 to 5495d71 Compare January 15, 2024 16:48

tillprochaska and others added 20 commits January 15, 2024 18:00

Add Prometheus instrumentation

7cfcb5d

Closes #3214

Fix missing bind argument

35ec9ca

Run Prometheus exporter as a separate service

58d4ee7

Expose number of streaming requests and number of streamed entities a…

21202a8

…s metrics

Expose number of auth attempts as Prometheus metrics

74e9eb7

Update Helm chart to expose metrics endpoints, setup ServiceMonitors

e0769a7

Handle requests without Authz object gracefully

bfd7a60

Rename Prometheus label to "api_endpoint" to prevent naming clashes

7bd852d

Prometheus Operator also uses the "endpoint" label and automatically renames "endpoint" labels exposed by the metrics endpoint to "exported_endpoints" which is ugly.

Add xref metrics

de5538f

Expose Python platform information as Prometheus metrics

4c238d2

Use keyword args to set Prometheus metric labels

d512e00

As suggested by @stchris

Bump servicelayer from 1.22.0 to 1.22.1

8137882

Simplify entity streaming metrics code

841f4b6

There’s no need to do batched metric increments until this becomes a performance bottleneck.

Limit maximum size of Prometheus multiprocessing directory

1288d94

Add aleph_ prefix to Prometheus API metrics

63fb216

Fix metrics name (singular -> plural)

ff937cd

Add documentation on how to test Prometheus instrumentation in local …

6197c30

…Kubernetes cluster

tillprochaska force-pushed the feature/3214-metrics branch from 5495d71 to 6197c30 Compare January 15, 2024 17:01

stchris approved these changes Jan 16, 2024

View reviewed changes

tillprochaska merged commit e5eba0d into develop Jan 16, 2024

lyz-code mentioned this pull request Apr 4, 2024

BUG: Enable prometheus metrics #3675

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metrics #3216

Metrics #3216

Uh oh!

tillprochaska commented Jul 17, 2023 •

edited

Loading

Uh oh!

Rosencrantz commented Sep 21, 2023

Uh oh!

tillprochaska commented Sep 21, 2023

Uh oh!

Uh oh!

tillprochaska Nov 22, 2023

Uh oh!

tillprochaska Nov 22, 2023

Uh oh!

tillprochaska Nov 22, 2023

Uh oh!

tillprochaska Nov 22, 2023

Uh oh!

stchris left a comment

Uh oh!

stchris left a comment

Uh oh!

Uh oh!

		table.drop(checkfirst=True)
		table.drop(bind=db.engine, checkfirst=True)

Metrics #3216

Metrics #3216

Uh oh!

Conversation

tillprochaska commented Jul 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rosencrantz commented Sep 21, 2023

Uh oh!

tillprochaska commented Sep 21, 2023

Uh oh!

Uh oh!

tillprochaska Nov 22, 2023

Choose a reason for hiding this comment

Uh oh!

tillprochaska Nov 22, 2023

Choose a reason for hiding this comment

Uh oh!

tillprochaska Nov 22, 2023

Choose a reason for hiding this comment

Uh oh!

tillprochaska Nov 22, 2023

Choose a reason for hiding this comment

Uh oh!

stchris left a comment

Choose a reason for hiding this comment

Uh oh!

stchris left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tillprochaska commented Jul 17, 2023 •

edited

Loading