Skip to content

Commit

Permalink
Merge pull request #699 from psardana/prometheus_integration
Browse files Browse the repository at this point in the history
* added initial config for prometheus integration in opal server

* feat(data_update_publisher.py): add data_update_latency metric to track latency of data update events
feat(prometheus_metrics.py): create data_update_latency histogram to monitor latency of data update events

* refactor(api.py, data_update_publisher.py): update import paths for metrics to use opal_server.metrics.prometheus_metrics for better organization
chore(requirements.txt): add prometheus_client to dependencies for metrics tracking functionality

* feat(data_update_publisher.py): add data_update_count_per_topic metric to track updates per topic
feat(prometheus_metrics.py): introduce data_update_count_per_topic counter for monitoring data updates by topic

* feat(metrics): add new metrics for policy updates and bundle requests to enhance observability
fix(api.py): increment policy bundle request count and measure latency for bundle generation
fix(callbacks.py): observe size of changed directories in policy update notifications
fix(task.py): track policy update count and latency when triggering policy watcher

* moved prometheus metrics to opal common

* scopes and security prometheus metrics added

* added client metrics endpoint and total active clients metric

* data topic subscribed by client

* added token type in prometheus metric

* added labels to the metrics for data and policy updates

* added labels in token requests generations and errors

* added more labels for prometheus metrics for scope

* added metrics for opal client

* added docker compose example with prometheus

* fixed metric labels

* added documentation

* added open telemetry traces and metrics

* added metrics and traces in documentation

* added scope id as an attribute

* renamed docker compose

* fixed how span is being used

* added documentation

* fixed descriptions

* removed top level code and protected metrics end point

* fixes for tracing spans

* fix metrics end point

* fixed docker compose and removed logging exporter from otel

* Fixed pre-commit

---------

Co-authored-by: Dan Yishai <danyi1212@users.noreply.github.com>
  • Loading branch information
danyi1212 authored Dec 11, 2024
2 parents d027000 + 4b24dc0 commit 141a4a6
Show file tree
Hide file tree
Showing 26 changed files with 938 additions and 72 deletions.
96 changes: 96 additions & 0 deletions docker/docker-compose-with-prometheus-and-otel.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
services:
broadcast_channel:
image: postgres:alpine
environment:
- POSTGRES_DB=postgres
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
volumes:
- postgres_data:/var/lib/postgresql/data

otel-collector:
image: otel/opentelemetry-collector-contrib:0.114.0
volumes:
- ./docker_files/otel-collector-config.yaml:/etc/otelcol/config.yaml
command: ["--config", "/etc/otelcol/config.yaml"]
ports:
- "4317:4317"
- "8888:8888"
networks:
- opal-network

prometheus:
image: prom/prometheus:v2.45.0
volumes:
- ./docker_files/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
networks:
- opal-network
depends_on:
- otel-collector

grafana:
image: grafana/grafana:9.5.3
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
depends_on:
- prometheus
networks:
- opal-network

opal_server:
image: permitio/opal-server:latest
environment:
- OPAL_BROADCAST_URI=postgres://postgres:postgres@broadcast_channel:5432/postgres
- UVICORN_NUM_WORKERS=4
- OPAL_POLICY_REPO_URL=https://github.com/permitio/opal-example-policy-repo
- OPAL_POLICY_REPO_POLLING_INTERVAL=30
- OPAL_DATA_CONFIG_SOURCES={"config":{"entries":[{"url":"http://opal_server:7002/policy-data","topics":["policy_data"],"dst_path":"/static"}]}}
- OPAL_LOG_FORMAT_INCLUDE_PID=true
- OPAL_ENABLE_OPENTELEMETRY_TRACING=true
- OPAL_ENABLE_OPENTELEMETRY_METRICS=true
- OPAL_OPENTELEMETRY_OTLP_ENDPOINT="otel-collector:4317"
ports:
- "7002:7002"
depends_on:
- broadcast_channel
- otel-collector
networks:
- opal-network

opal_client:
image: permitio/opal-client:latest
environment:
- OPAL_SERVER_URL=http://opal_server:7002
- OPAL_LOG_FORMAT_INCLUDE_PID=true
- OPAL_INLINE_OPA_LOG_FORMAT=http
ports:
- "7766:7000"
- "8181:8181"
depends_on:
- opal_server
- otel-collector
command: sh -c "exec ./wait-for.sh opal_server:7002 --timeout=20 -- ./start.sh"
networks:
- opal-network

networks:
opal-network:
driver: bridge
volumes:
postgres_data:
prometheus_data:
grafana_data:
25 changes: 25 additions & 0 deletions docker/docker_files/otel-collector-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317

exporters:
prometheus:
endpoint: "0.0.0.0:8888"
debug:
verbosity: detailed

processors:
batch:

service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [debug]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
14 changes: 14 additions & 0 deletions docker/docker_files/prometheus.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
global:
scrape_interval: 15s
evaluation_interval: 15s

scrape_configs:
- job_name: 'opal_server'
static_configs:
- targets: ['opal_server:7002']
metrics_path: '/metrics'

- job_name: 'opal_client'
static_configs:
- targets: ['opal_client:7000']
metrics_path: '/metrics'
8 changes: 6 additions & 2 deletions documentation/docs/getting-started/configuration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Please use this table as a reference.
| LOG_FILE_COMPRESSION | | |
| LOG_FILE_SERIALIZE | Serialize log messages in file into json format (useful for log aggregation platforms) | |
| LOG_FILE_LEVEL |
| LOG_DIAGNOSE | Include diagnosis in log messages | |
| LOG_DIAGNOSE | Include diagnosis in log messages | |
| STATISTICS_ENABLED | Collect statistics about OPAL clients. | |
| STATISTICS_ADD_CLIENT_CHANNEL | The topic to update about the new OPAL clients connection. | |
| STATISTICS_REMOVE_CLIENT_CHANNEL | The topic to update about the OPAL clients disconnection. | |
Expand All @@ -40,7 +40,11 @@ Please use this table as a reference.
| AUTH_PUBLIC_KEY | | |
| AUTH_JWT_ALGORITHM | JWT algorithm. See possible values [here](https://pyjwt.readthedocs.io/en/stable/algorithms.html). | |
| AUTH_JWT_AUDIENCE | | |
| AUTH_JWT_ISSUER | | |
| AUTH_JWT_ISSUER | | |
| ENABLE_OPENTELEMETRY_TRACING | Set if OPAL should enable tracing with OpenTelemetry | |
| ENABLE_OPENTELEMETRY_METRICS | Set if OPAL should enable metrics with OpenTelemetry | |
| ENABLE_OPENTELEMETRY_TRACING | The OpenTelemetry OTLP endpoint to send traces to, set only if ENABLE_OPENTELEMETRY_TRACING is enabled | |


## OPAL Server Configuration Variables

Expand Down
116 changes: 116 additions & 0 deletions documentation/docs/tutorials/monitoring_opal.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ There are multiple ways you can monitor your OPAL deployment:
- **Health-checks** - OPAL exposes HTTP health check endpoints ([See below](##health-checks))
- [**Callbacks**](/tutorials/healthcheck_policy_and_update_callbacks#-data-update-callbacks) - Using the callback webhooks feature - having OPAL-clients report their updates
- **Statistics** - Using the built-in statistics feature in OPAL ([See below](##opal-statistics))
- **OpenTelemetry Metrics and Tracing** - OPAL can expose metrics and tracing information using OpenTelemetry for monitoring ([See below](#opentelemetry-metrics-and-tracing)).

## Health checks

Expand Down Expand Up @@ -52,3 +53,118 @@ Available through `/pubsub_client_info` api route on the server.
### Caveats:
- When `UVICORN_NUM_WORKERS > 1`, retrieved information would only include clients connected to the replying server process.
- This is an early access feature and is likely to change. Backward compatibility is not garaunteed.

## OpenTelemetry Metrics and Tracing

OPAL supports exporting metrics and tracing information using OpenTelemetry, which can be integrated with various monitoring and observability tools.

### Enabling OpenTelemetry Metrics and Tracing

To enable OpenTelemetry metrics and tracing, you need to set the following environment variables in both OPAL server and OPAL client:

```
OPAL_ENABLE_OPENTELEMETRY_TRACING=true
OPAL_ENABLE_OPENTELEMETRY_METRICS=true
OPAL_OPENTELEMETRY_OTLP_ENDPOINT=<your-otel-collector-endpoint>
```

- OPAL_ENABLE_OPENTELEMETRY_TRACING: Set to `true` to enable tracing.
- OPAL_ENABLE_OPENTELEMETRY_METRICS: Set to `true` to enable metrics.
- OPAL_OPENTELEMETRY_OTLP_ENDPOINT: Set the endpoint for the OpenTelemetry Collector

### Exposing Metrics and Traces

- Both the server and client will expose a `/metrics` endpoint that returns metrics in Prometheus format.
- Traces are exported to the configured OpenTelemetry Collector endpoint using OTLP over gRPC.

### Available Metrics and Traces

Below is a list of the available metrics and traces in OPAL, along with their types, available tags (attributes), and explanations.

#### OPAL Server Metrics and Traces

##### 1) `opal_server_data_update`
- **Type**: Trace
- **Description**: Represents a data update operation in the OPAL server. This trace spans the process of publishing data updates to clients.
- **Attributes**:
- `topics_count`: Number of topics involved in the data update.
- `entries_count`: Number of data update entries.
- Additional attributes related to errors or execution time.

##### 2) `opal_server_policy_update`
- **Type**: Trace
- **Description**: Represents a policy update operation in the OPAL server. This trace spans the process of checking for policy changes and notifying clients.
- **Attributes**:
- Information about the policy repository, such as commit hashes.
- Errors encountered during the update process.

##### 3) `opal_server_policy_bundle_request`
- **Type**: Trace
- **Description**: Represents a request for a policy bundle from a client. This trace spans the process of generating and serving the policy bundle to the client.
- **Attributes**:
- `bundle.type`: The type of bundle (full or diff).
- `bundle.size`: The size of the bundle in number of files or bytes.
- `scope_id`: The scope identifier if scopes are used.

##### 4) `opal_server_policy_bundle_size`
- **Type**: Metric (Histogram)
- **Unit**: Files
- **Description**: Records the size of the policy bundles served by the OPAL server. The size is measured in the number of files included in the bundle.
- **Attributes**:
- `type`: The type of bundle (full or diff).

##### 5) `opal_server_active_clients`
- **Type**: Metric (UpDownCounter)
- **Description**: Tracks the number of active clients connected to the OPAL server.
- **Attributes**:
- `client_id`: The unique identifier of the client.
- `source`: The source host and port of the client (e.g., 192.168.1.10:34567).

#### OPAL Client Metrics and Traces

##### 1) `opal_client_data_subscriptions`
- **Type**: Metric (UpDownCounter)
- **Description**: Tracks the number of data subscriptions per client.
- **Attributes**:
- `client_id`: The unique identifier of the client.
- `topic`: The topic to which the client is subscribed.

##### 2) `opal_client_data_update_trigger`
- **Type**: Trace
- **Description**: Represents the operation of triggering a data update via the API in the OPAL client.
- **Attributes**:
- `source`: The source of the trigger (e.g., API).
- Errors encountered during the trigger.

##### 3) `opal_client_data_update_apply`
- **Type**: Trace
- **Description**: Represents the application of a data update within the OPAL client. This trace spans the process of fetching and applying data updates from the server.
- **Attributes**:
- Execution time.
- Errors encountered during the update.

##### 4) `opal_client_policy_update_apply`
- **Type**: Trace
- **Description**: Represents the application of a policy update within the OPAL client. This trace spans the process of fetching and applying policy updates from the server.
- **Attributes**:
- Execution time.
- Errors encountered during the update.

##### 5) `opal_client_policy_store_status`
- **Type**: Metric (Observable Gauge)
- **Description**: Indicates the current status of the policy store's authentication type used by the OPAL client.
- **Attributes**:
- `auth_type`: The authentication type configured for the policy store (e.g., TOKEN, OAUTH, NONE).
- **Value**: The metric has a value of 1 when the policy store is active with the specified authentication type.

### Example

To monitor OPAL using Prometheus and Grafana, a ready-to-use Docker Compose configuration is provided in the root directory of the repository under docker. The file is named docker-compose-with-prometheus-and-otel.yml.

Run the following command to start Prometheus and Grafana:

```
docker compose -f docker/docker-compose-with-prometheus-and-otel.yml up
```

This setup will start Prometheus to scrape metrics from OPAL server and client, and Grafana to visualize the metrics.
Loading

0 comments on commit 141a4a6

Please sign in to comment.