Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics #3216

Merged
merged 20 commits into from
Jan 16, 2024
Merged

Metrics #3216

merged 20 commits into from
Jan 16, 2024

Conversation

tillprochaska
Copy link
Contributor

@tillprochaska tillprochaska commented Jul 17, 2023

This PR exposes metrics in the Prometheus format. The PR also includes changes to the Aleph Helm chart to make it easy to collect these metrics, e.g. when using the prometheus-operator.

Related PRs:

Implemented metrics:

  • System information (Aleph version, FtM version)
  • API
    • HTTP request metrics (duration, count by endpoint, status)
    • Streamed entities (count)
  • Platform information (Python version)
  • Users
    • Users (total)
    • Blocked users (total)
    • Active users (total by period)
    • Authentication attemps (failed, succesful)
    • Signups (count)
  • Features
    • Bookmarks (total)
    • Entity sets (total by entity set type)
    • Users with at least one bookmark (total)
    • Users with at least one entity set (total by entity set type)
    • Users with at least one investigation created (total)
    • Users with at least one file uploaded (total)
  • Collections
    • Number of collections (total by category)
    • Collections with at least one file uploaded (total by category)
    • Collections with at least one non-file entity (total by category)
  • Queues
    • Queued tasks (total by stage, status)
  • Workers
    • Tasks processed (count by success/failed, stage, retries)
    • Task duration (seconds by stage)
  • Ingest
    • Tasks processed (count by success/failed, ingestor)
    • Tasks duration (seconds by ingestor)
    • Bytes processed (bytes by ingestor)
    • PDF cache hits/misses
  • Xref
    • Number of xref'ed entities and mentions
    • Number of matches per entity/matches
    • Duration of ES candidates query (query processing time incl. & exlc. of network/serialization/… overhead)
    • Duration to process a single scroll batch of entities

Testing:
Follow these steps to test the changes locally:

  • Set PROMETHEUS_ENABLED=true in your aleph.env file and restart Aleph.

  • You can view metrics exposed by the API at http://localhost:9100 (in the container).

  • Run docker compose -f docker-compose.dev.yml run --rm api bash.

  • Run gunicorn --bind 0.0.0.0:9100 aleph.metrics.exporter:app to start the metrics exporter.

  • You can view the metrics exposed by the exporter at http://localhost:9100 within the container you just started.

Implementation considerations:

  • Our API service is served using Gunicorn with multiple worker processes. Prometheus requires using a special multiprocess mode for this scenario which has a few limitations, e.g. using custom collectors are not supported out of the box. We’re using custom collectors to expose metrics that are not collected during runtime of the API or workers, but instead are exposed based on data fetched from the SQL database or Redis (e.g number of tasks in queues or number of users in the database).

    There is a way to make custom collectors work while using the multiprocess mode at the same time, but it adds some complexity to the setup (see Custom collectors and multiprocess mode prometheus/client_python#943), so I eventually decided to run a Prometheus exporter as a separate sidecar service to expose custom collector metrics. This also makes it very easy to scrape API runtime metrics at a different interval than custom collector metrics.

  • API metrics are exposed on a separate port (and not at the same port as the API) to ensure metrics are not exposed publicly.

  • prometheus_flask_exporter is a third-party exporter to expose Prometheus metrics about a Flask web application. I’m not using it though, as it’s actually quite straight forward to collect basic HTTP request metrics and I’ve found a direct integration easier when adjust to our needs and more transparent when debugging (vs. using an external extension like prometheus_flask_exporter).

To do:

  • Delete contents of Gunicorn multiproc directory when (re-)starting the api service, create directory if it doesn’t exist.
  • Check whether multiproc directory is set on application startup
  • Adjust Gunicorn config to mark process as dead when worker process exits
  • Opt-in to Prometheus metrics using configuration option
  • Tests?
  • Adjust Helm charts (and test the changes locally?)
  • Add some inline documentation for the metrics modules
  • Release a new servicelayer version
  • Upgrade servicelayer in ingest-file
  • Upgrade servicelayer in Aleph
  • Release ingest-file
  • Upgrade ingest-file in Aleph
  • Add xref metrics

@tillprochaska tillprochaska linked an issue Jul 17, 2023 that may be closed by this pull request
@tillprochaska tillprochaska force-pushed the feature/3214-metrics branch 2 times, most recently from 4100216 to 56ee131 Compare July 19, 2023 15:40
@Rosencrantz
Copy link
Contributor

Thanks Till. An amazing writeup as usual. Just looking through the list of information you've managed to start collecting is really impressive.

One thing that I'd like your input on, because I think it may be something that could be useful for us is looking at inactivity. A few examples:

  • Inactive users (anyone that hasn't logged in or used Aleph for a period of time)
  • Inactive datasets (a collection or investigation that hasn't be opened or used for a period of time)
  • Empty datasets (anything with 0 entities or documents)

Would be interested to know your thoughts here?

@tillprochaska
Copy link
Contributor Author

Inactive users (anyone that hasn't logged in or used Aleph for a period of time)

Prometheus is a perfect fit for exposing time-based metrics like task queue length, number of processed tasks, average task processing time, number and response time for API requests by endpoint etc. However, there is another set of metrics we’d like to collect to answer questions like “Which datasets are accessed most?” or “How many daily/monthly active users do we have?”. These metrics cannot be implemented using just Prometheus in a straightforward way – or rather, they could be implemented using just Prometheus, but that’s a bad idea. Here’s why:

In Prometheus, you can add labels to your metrics. For example, you might have a counter metric called “user_session_created_total” that is incremented every time a user signs in. You can also specify an arbitrary number of labels, e.g. you might pass a auth_method label that could be oauth or password. You can then also use labels when querying your metrics, e.g. to find out how many session where created using the password auth method you’d do something like user_session_created_total{method="password"}.

It might be tempting to do the same thing with user IDs to. Just add a user_id label to the metric and now we’re tracking how often individual users signed in, too, right?

Unfortunately, it’s a bit more complex. Under the hood, Prometheus stores every combination of label values as a separate time series in its database. So if you have an auth_method label that can have two values (password and oauth), that results in two time series stored in the database. But the user_id label could have as many values as you have users in your Aleph instance -- which could easily get slow and expensive.

Good thing is that we actually do not need to know when and how often exactly a specific user signed in. All we want to know is how many users have been active in the past 7 or 30 days etc. The solution for this is actually quite simple:

  • Add a separate last_login_at timestamp to the table in Postgres that stores user information.
  • Whenever a user signs in, update that timestamp. (We have a rather short session length by default, so the time of last sign-in should be a good enough proxy for recent activity.)
  • Expose a metric active_users with a period label (that has a limited set of labels, e.g. 7d, 30d, 1y). The values for this metric can be easily calculated by running a SQL query.

I actually already prototyped this, but didn’t want to include it in the initial PR to keep it easy to review.

Inactive datasets (a collection or investigation that hasn't be opened or used for a period of time)

This is similar to the use case above, but I think the solution is a bit more complex for two reasons:

  • What qualifies as an active dataset? Does that mean someone had to open the dataset? Is it enough for at least one entity from that dataset to have been included in the results for a search? At least one xref match?

  • While updating the last_login_at timestamp for a user once per session is not a performance bottleneck, writing to the database every time a dataset is accessed would likely be a bad idea.

So this is likely going to be slightly more complex if we actually want to implement it.

Empty datasets (anything with 0 entities or documents)

Haven’t thought about this one in detail yet. I think we might be able to leverage the collection stats we already compute and cache (but they are sometimes incorrect or out of date, so that would be reflected in the metric).

@tillprochaska tillprochaska force-pushed the feature/3214-metrics branch 2 times, most recently from f023b74 to f1c7e92 Compare October 3, 2023 08:55
@tillprochaska tillprochaska force-pushed the feature/3214-metrics branch 2 times, most recently from dd6e557 to ef8f482 Compare October 16, 2023 22:42
@tillprochaska tillprochaska marked this pull request as ready for review October 16, 2023 22:44
@tillprochaska tillprochaska force-pushed the feature/3214-metrics branch 5 times, most recently from 1c2a818 to 0b3bd35 Compare November 10, 2023 14:52
@tillprochaska tillprochaska force-pushed the feature/3214-metrics branch 4 times, most recently from 2ee48ff to eb947d0 Compare November 19, 2023 14:17
@tillprochaska tillprochaska changed the title [WIP] Metrics Metrics Nov 20, 2023
XREF_CANDIDATES_QUERY_ROUNDTRIP_DURATION = Histogram(
"aleph_xref_candidates_query_roundtrip_duration_seconds",
"Roundtrip duration of the candidates query (incl. network, serialization etc.)",
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional metric related to xref that would be interesting is how long it takes to process a batch of entities from a single ES scroll response. This could be used to alert us when we are getting close to the scroll timeout.

I haven’t implemented this so far because the individual scroll requests are abstracted away by the scan helper from the ES Python client so it would consider a non-trivial amount of work and would increase complexity.

Comment on lines +37 to +43
def _users(self):
return GaugeMetricFamily(
"aleph_users",
"Total number of users",
value=Role.all_users().count(),
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric could be extended in the future. In particular, it might be interesting to expose the number of users that have been active within the past 24h, 7d, 30d etc. This requires some additional, non-trivial work because we’d need to track when users signed in the last time, so I decided not to implement it in this PR.

Comment on lines +44 to +61
def _collections(self):
gauge = GaugeMetricFamily(
"aleph_collections",
"Total number of collections by category",
labels=["category"],
)

query = (
Collection.all()
.with_entities(Collection.category, func.count())
.group_by(Collection.category)
)

for category, count in query:
gauge.add_metric([category], count)

return gauge
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric could be extended in the future to expose number of collections by countries.

table.drop(checkfirst=True)
table.drop(bind=db.engine, checkfirst=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn’t strictly related to this PR, but I think it might have been overlooked during the SQLAlchemy 2 migration and it caused migrations to fail in my development environment.

Copy link
Contributor

@stchris stchris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I left some comments. Feels good to have instrumentation, thanks for your hard work, @tillprochaska !

@tillprochaska tillprochaska force-pushed the feature/3214-metrics branch 3 times, most recently from bd252e1 to 5495d71 Compare January 15, 2024 16:48
tillprochaska and others added 20 commits January 15, 2024 18:00
Prometheus Operator also uses the "endpoint" label and automatically renames "endpoint" labels exposed by the metrics endpoint to "exported_endpoints" which is ugly.
Even though it is considered an anti-pattern to add a prefix with the name of the software or component to metrics (according to the official Prometheus documentation), I have decided to add a prefix. I’ve found that this makes it much easier to find relevant metrics. The main disadvantage of per-component prefixes queries become slightly more complex if you want to query the same metric (e.g. HTTP request duration) across multiple components. This isn’t super important in our case though, so I think the trade-off is acceptable.
Although I'm not 100% sure, the exposed port 3000 probably is a left-over from the past, possibly when convert-document was still part of ingest-file. The network policy prevented Prometheus from scraping ingest-file metrics (and as the metrics port is now the only port exposed by ingest-file, should be otherwise unnecessary).
There’s no need to do batched metric increments until this becomes a performance bottleneck.
I copied the boilerplate for custom collectors from the docs without thinking about it too much, but inheriting from `object` really isn’t necessary anymore in Python 3.

The Prometheus client also exports an abstract `Collector` class -- it doesn’t do anything except providing type hints for the `collect` method which is nice.
Copy link
Contributor

@stchris stchris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me! Can't wait to see this in action. Thanks for all the work on this, @tillprochaska !

@tillprochaska tillprochaska merged commit e5eba0d into develop Jan 16, 2024
3 checks passed
simonwoerpel pushed a commit to investigativedata/aleph that referenced this pull request Apr 22, 2024
* Add Prometheus instrumentation

Closes alephdata#3214

* Fix missing bind argument

* Run Prometheus exporter as a separate service

* Expose number of streaming requests and number of streamed entities as metrics

* Expose number of auth attempts as Prometheus metrics

* Update Helm chart to expose metrics endpoints, setup ServiceMonitors

* Handle requests without Authz object gracefully

* Rename Prometheus label to "api_endpoint" to prevent naming clashes

Prometheus Operator also uses the "endpoint" label and automatically renames "endpoint" labels exposed by the metrics endpoint to "exported_endpoints" which is ugly.

* Add xref metrics

* Use common prefix for all metric names

Even though it is considered an anti-pattern to add a prefix with the name of the software or component to metrics (according to the official Prometheus documentation), I have decided to add a prefix. I’ve found that this makes it much easier to find relevant metrics. The main disadvantage of per-component prefixes queries become slightly more complex if you want to query the same metric (e.g. HTTP request duration) across multiple components. This isn’t super important in our case though, so I think the trade-off is acceptable.

* Expose Python platform information as Prometheus metrics

* Remove unused port, network policy from K8s specs

Although I'm not 100% sure, the exposed port 3000 probably is a left-over from the past, possibly when convert-document was still part of ingest-file. The network policy prevented Prometheus from scraping ingest-file metrics (and as the metrics port is now the only port exposed by ingest-file, should be otherwise unnecessary).

* Use keyword args to set Prometheus metric labels

As suggested by @stchris

* Bump servicelayer from 1.22.0 to 1.22.1

* Simplify entity streaming metrics code

There’s no need to do batched metric increments until this becomes a performance bottleneck.

* Limit maximum size of Prometheus multiprocessing directory

* Do not let collector classes inherit from `object`

I copied the boilerplate for custom collectors from the docs without thinking about it too much, but inheriting from `object` really isn’t necessary anymore in Python 3.

The Prometheus client also exports an abstract `Collector` class -- it doesn’t do anything except providing type hints for the `collect` method which is nice.

* Add `aleph_` prefix to Prometheus API metrics

* Fix metrics name (singular -> plural)

* Add documentation on how to test Prometheus instrumentation in local Kubernetes cluster
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FEATURE: Metrics
4 participants