-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics #3216
Metrics #3216
Conversation
4100216
to
56ee131
Compare
Thanks Till. An amazing writeup as usual. Just looking through the list of information you've managed to start collecting is really impressive. One thing that I'd like your input on, because I think it may be something that could be useful for us is looking at inactivity. A few examples:
Would be interested to know your thoughts here? |
Prometheus is a perfect fit for exposing time-based metrics like task queue length, number of processed tasks, average task processing time, number and response time for API requests by endpoint etc. However, there is another set of metrics we’d like to collect to answer questions like “Which datasets are accessed most?” or “How many daily/monthly active users do we have?”. These metrics cannot be implemented using just Prometheus in a straightforward way – or rather, they could be implemented using just Prometheus, but that’s a bad idea. Here’s why: In Prometheus, you can add labels to your metrics. For example, you might have a counter metric called “user_session_created_total” that is incremented every time a user signs in. You can also specify an arbitrary number of labels, e.g. you might pass a It might be tempting to do the same thing with user IDs to. Just add a Unfortunately, it’s a bit more complex. Under the hood, Prometheus stores every combination of label values as a separate time series in its database. So if you have an Good thing is that we actually do not need to know when and how often exactly a specific user signed in. All we want to know is how many users have been active in the past 7 or 30 days etc. The solution for this is actually quite simple:
I actually already prototyped this, but didn’t want to include it in the initial PR to keep it easy to review.
This is similar to the use case above, but I think the solution is a bit more complex for two reasons:
So this is likely going to be slightly more complex if we actually want to implement it.
Haven’t thought about this one in detail yet. I think we might be able to leverage the collection stats we already compute and cache (but they are sometimes incorrect or out of date, so that would be reflected in the metric). |
f023b74
to
f1c7e92
Compare
dd6e557
to
ef8f482
Compare
ef8f482
to
785273f
Compare
785273f
to
c7c3a41
Compare
1c2a818
to
0b3bd35
Compare
2ee48ff
to
eb947d0
Compare
e5164c2
to
823c706
Compare
XREF_CANDIDATES_QUERY_ROUNDTRIP_DURATION = Histogram( | ||
"aleph_xref_candidates_query_roundtrip_duration_seconds", | ||
"Roundtrip duration of the candidates query (incl. network, serialization etc.)", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One additional metric related to xref that would be interesting is how long it takes to process a batch of entities from a single ES scroll response. This could be used to alert us when we are getting close to the scroll timeout.
I haven’t implemented this so far because the individual scroll requests are abstracted away by the scan
helper from the ES Python client so it would consider a non-trivial amount of work and would increase complexity.
def _users(self): | ||
return GaugeMetricFamily( | ||
"aleph_users", | ||
"Total number of users", | ||
value=Role.all_users().count(), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This metric could be extended in the future. In particular, it might be interesting to expose the number of users that have been active within the past 24h, 7d, 30d etc. This requires some additional, non-trivial work because we’d need to track when users signed in the last time, so I decided not to implement it in this PR.
def _collections(self): | ||
gauge = GaugeMetricFamily( | ||
"aleph_collections", | ||
"Total number of collections by category", | ||
labels=["category"], | ||
) | ||
|
||
query = ( | ||
Collection.all() | ||
.with_entities(Collection.category, func.count()) | ||
.group_by(Collection.category) | ||
) | ||
|
||
for category, count in query: | ||
gauge.add_metric([category], count) | ||
|
||
return gauge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This metric could be extended in the future to expose number of collections by countries.
table.drop(checkfirst=True) | ||
table.drop(bind=db.engine, checkfirst=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn’t strictly related to this PR, but I think it might have been overlooked during the SQLAlchemy 2 migration and it caused migrations to fail in my development environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I left some comments. Feels good to have instrumentation, thanks for your hard work, @tillprochaska !
d194eb5
to
7aabc4f
Compare
bd252e1
to
5495d71
Compare
Prometheus Operator also uses the "endpoint" label and automatically renames "endpoint" labels exposed by the metrics endpoint to "exported_endpoints" which is ugly.
Even though it is considered an anti-pattern to add a prefix with the name of the software or component to metrics (according to the official Prometheus documentation), I have decided to add a prefix. I’ve found that this makes it much easier to find relevant metrics. The main disadvantage of per-component prefixes queries become slightly more complex if you want to query the same metric (e.g. HTTP request duration) across multiple components. This isn’t super important in our case though, so I think the trade-off is acceptable.
Although I'm not 100% sure, the exposed port 3000 probably is a left-over from the past, possibly when convert-document was still part of ingest-file. The network policy prevented Prometheus from scraping ingest-file metrics (and as the metrics port is now the only port exposed by ingest-file, should be otherwise unnecessary).
As suggested by @stchris
There’s no need to do batched metric increments until this becomes a performance bottleneck.
I copied the boilerplate for custom collectors from the docs without thinking about it too much, but inheriting from `object` really isn’t necessary anymore in Python 3. The Prometheus client also exports an abstract `Collector` class -- it doesn’t do anything except providing type hints for the `collect` method which is nice.
…Kubernetes cluster
5495d71
to
6197c30
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to me! Can't wait to see this in action. Thanks for all the work on this, @tillprochaska !
* Add Prometheus instrumentation Closes alephdata#3214 * Fix missing bind argument * Run Prometheus exporter as a separate service * Expose number of streaming requests and number of streamed entities as metrics * Expose number of auth attempts as Prometheus metrics * Update Helm chart to expose metrics endpoints, setup ServiceMonitors * Handle requests without Authz object gracefully * Rename Prometheus label to "api_endpoint" to prevent naming clashes Prometheus Operator also uses the "endpoint" label and automatically renames "endpoint" labels exposed by the metrics endpoint to "exported_endpoints" which is ugly. * Add xref metrics * Use common prefix for all metric names Even though it is considered an anti-pattern to add a prefix with the name of the software or component to metrics (according to the official Prometheus documentation), I have decided to add a prefix. I’ve found that this makes it much easier to find relevant metrics. The main disadvantage of per-component prefixes queries become slightly more complex if you want to query the same metric (e.g. HTTP request duration) across multiple components. This isn’t super important in our case though, so I think the trade-off is acceptable. * Expose Python platform information as Prometheus metrics * Remove unused port, network policy from K8s specs Although I'm not 100% sure, the exposed port 3000 probably is a left-over from the past, possibly when convert-document was still part of ingest-file. The network policy prevented Prometheus from scraping ingest-file metrics (and as the metrics port is now the only port exposed by ingest-file, should be otherwise unnecessary). * Use keyword args to set Prometheus metric labels As suggested by @stchris * Bump servicelayer from 1.22.0 to 1.22.1 * Simplify entity streaming metrics code There’s no need to do batched metric increments until this becomes a performance bottleneck. * Limit maximum size of Prometheus multiprocessing directory * Do not let collector classes inherit from `object` I copied the boilerplate for custom collectors from the docs without thinking about it too much, but inheriting from `object` really isn’t necessary anymore in Python 3. The Prometheus client also exports an abstract `Collector` class -- it doesn’t do anything except providing type hints for the `collect` method which is nice. * Add `aleph_` prefix to Prometheus API metrics * Fix metrics name (singular -> plural) * Add documentation on how to test Prometheus instrumentation in local Kubernetes cluster
This PR exposes metrics in the Prometheus format. The PR also includes changes to the Aleph Helm chart to make it easy to collect these metrics, e.g. when using the prometheus-operator.
Related PRs:
Implemented metrics:
Testing:
Follow these steps to test the changes locally:
Set
PROMETHEUS_ENABLED=true
in youraleph.env
file and restart Aleph.You can view metrics exposed by the API at
http://localhost:9100
(in the container).Run
docker compose -f docker-compose.dev.yml run --rm api bash
.Run
gunicorn --bind 0.0.0.0:9100 aleph.metrics.exporter:app
to start the metrics exporter.You can view the metrics exposed by the exporter at
http://localhost:9100
within the container you just started.Implementation considerations:
Our API service is served using Gunicorn with multiple worker processes. Prometheus requires using a special multiprocess mode for this scenario which has a few limitations, e.g. using custom collectors are not supported out of the box. We’re using custom collectors to expose metrics that are not collected during runtime of the API or workers, but instead are exposed based on data fetched from the SQL database or Redis (e.g number of tasks in queues or number of users in the database).
There is a way to make custom collectors work while using the multiprocess mode at the same time, but it adds some complexity to the setup (see Custom collectors and multiprocess mode prometheus/client_python#943), so I eventually decided to run a Prometheus exporter as a separate sidecar service to expose custom collector metrics. This also makes it very easy to scrape API runtime metrics at a different interval than custom collector metrics.
API metrics are exposed on a separate port (and not at the same port as the API) to ensure metrics are not exposed publicly.
prometheus_flask_exporter
is a third-party exporter to expose Prometheus metrics about a Flask web application. I’m not using it though, as it’s actually quite straight forward to collect basic HTTP request metrics and I’ve found a direct integration easier when adjust to our needs and more transparent when debugging (vs. using an external extension likeprometheus_flask_exporter
).To do:
api
service, create directory if it doesn’t exist.