Skip to content

Commit 21051b5

Browse files
author
Miguel Varela Ramos
authored
Observability docs (#1930)
1 parent 8149714 commit 21051b5

File tree

8 files changed

+350
-122
lines changed

8 files changed

+350
-122
lines changed

docs/clusters/aws/logging.md

Lines changed: 0 additions & 41 deletions
This file was deleted.

docs/clusters/gcp/logging.md

Lines changed: 0 additions & 28 deletions
This file was deleted.

docs/summary.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,9 @@
4848
* [Python packages](workloads/dependencies/python-packages.md)
4949
* [System packages](workloads/dependencies/system-packages.md)
5050
* [Custom images](workloads/dependencies/images.md)
51+
* Observability
52+
* [Logging](workloads/observability/logging.md)
53+
* [Metrics](workloads/observability/metrics.md)
5154

5255
## Clusters
5356

@@ -56,7 +59,6 @@
5659
* [Update](clusters/aws/update.md)
5760
* [Auth](clusters/aws/auth.md)
5861
* [Security](clusters/aws/security.md)
59-
* [Logging](clusters/aws/logging.md)
6062
* [Spot instances](clusters/aws/spot.md)
6163
* [Networking](clusters/aws/networking/index.md)
6264
* [Custom domain](clusters/aws/networking/custom-domain.md)
@@ -66,7 +68,6 @@
6668
* [Uninstall](clusters/aws/uninstall.md)
6769
* GCP
6870
* [Install](clusters/gcp/install.md)
69-
* [Logging](clusters/gcp/logging.md)
7071
* [Credentials](clusters/gcp/credentials.md)
7172
* [Setting up kubectl](clusters/gcp/kubectl.md)
7273
* [Uninstall](clusters/gcp/uninstall.md)

docs/workloads/batch/metrics.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Metrics
2+
3+
## Custom user metrics
4+
5+
It is possible to export custom user metrics by adding the `metrics_client`
6+
argument to the predictor constructor. Below there is an example of how to use the metrics client with
7+
the `PythonPredictor` type. The implementation would be similar to other predictor types.
8+
9+
```python
10+
class PythonPredictor:
11+
def __init__(self, config, metrics_client):
12+
self.metrics = metrics_client
13+
14+
def predict(self, payload):
15+
# --- my predict code here ---
16+
result = ...
17+
18+
# increment a counter with name "my_metric" and tags model:v1
19+
self.metrics.increment(metric="my_counter", value=1, tags={"model": "v1"})
20+
21+
# set the value for a gauge with name "my_gauge" and tags model:v1
22+
self.metrics.gauge(metric="my_gauge", value=42, tags={"model": "v1"})
23+
24+
# set the value for an histogram with name "my_histogram" and tags model:v1
25+
self.metrics.histogram(metric="my_histogram", value=100, tags={"model": "v1"})
26+
```
27+
28+
Refer to the [observability documentation](../observability/metrics.md#custom-user-metrics) for more information on
29+
custom metrics.
30+
31+
**Note**: The metrics client uses the UDP protocol to push metrics, to be fault tolerant, so if it fails during a
32+
metrics push there is no exception thrown.
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Logging
2+
3+
Cortex provides a logging solution, out-of-the-box, without the need to configure anything. By default, logs are
4+
collected with FluentBit, on every API kind, and are exported to each cloud provider logging solution. It is also
5+
possible to view the logs of a single API replica, while developing, through the `cortex logs` command.
6+
7+
## Cortex logs command
8+
9+
The cortex CLI tool provides a command to quickly check the logs for a single API replica while debugging.
10+
11+
To check the logs of an API run one of the following commands:
12+
13+
```shell
14+
# RealtimeAPI
15+
cortex logs <api_name>
16+
17+
# BatchAPI or TaskAPI
18+
cortex logs <api_name> <job_id> # the job needs to be in a running state
19+
```
20+
21+
**Important:** this method won't show the logs for all the API replicas and therefore is not a complete logging
22+
solution.
23+
24+
## Logs on AWS
25+
26+
For AWS clusters, logs will be pushed to [CloudWatch](https://console.aws.amazon.com/cloudwatch/home) using fluent-bit.
27+
A log group with the same name as your cluster will be created to store your logs. API logs are tagged with labels to
28+
help with log aggregation and filtering.
29+
30+
Below are some sample CloudWatch Log Insight queries:
31+
32+
**RealtimeAPI:**
33+
34+
```text
35+
fields @timestamp, log
36+
| filter labels.apiName="<INSERT API NAME>"
37+
| filter labels.apiKind="RealtimeAPI"
38+
| sort @timestamp asc
39+
| limit 1000
40+
```
41+
42+
**BatchAPI:**
43+
44+
```text
45+
fields @timestamp, log
46+
| filter labels.apiName="<INSERT API NAME>"
47+
| filter labels.jobID="<INSERT JOB ID>"
48+
| filter labels.apiKind="BatchAPI"
49+
| sort @timestamp asc
50+
| limit 1000
51+
```
52+
53+
**TaskAPI:**
54+
55+
```text
56+
fields @timestamp, log
57+
| filter labels.apiName="<INSERT API NAME>"
58+
| filter labels.jobID="<INSERT JOB ID>"
59+
| filter labels.apiKind="TaskAPI"
60+
| sort @timestamp asc
61+
| limit 1000
62+
```
63+
64+
## Logs on GCP
65+
66+
Logs will be pushed to [StackDriver](https://console.cloud.google.com/logs/query) using fluent-bit. API logs are tagged
67+
with labels to help with log aggregation and filtering.
68+
69+
Below are some sample Stackdriver queries:
70+
71+
**RealtimeAPI:**
72+
73+
```text
74+
resource.type="k8s_container"
75+
resource.labels.cluster_name="<INSERT CLUSTER NAME>"
76+
labels.apiKind="RealtimeAPI"
77+
labels.apiName="<INSERT API NAME>"
78+
```
79+
80+
**BatchAPI:**
81+
82+
```text
83+
resource.type="k8s_container"
84+
resource.labels.cluster_name="<INSERT CLUSTER NAME>"
85+
labels.apiKind="BatchAPI"
86+
labels.apiName="<INSERT API NAME>"
87+
labels.jobID="<INSERT JOB ID>"
88+
```
89+
90+
**TaskAPI:**
91+
92+
```text
93+
resource.type="k8s_container"
94+
resource.labels.cluster_name="<INSERT CLUSTER NAME>"
95+
labels.apiKind="TaskAPI"
96+
labels.apiName="<INSERT API NAME>"
97+
labels.jobID="<INSERT JOB ID>"
98+
```
99+
100+
Please make sure to navigate to the project containing your cluster and adjust the time range accordingly before running
101+
queries.
102+
103+
## Structured logging
104+
105+
You can use Cortex's logger in your Python code to log in JSON, which will enrich your logs with Cortex's metadata, and
106+
enable you to add custom metadata to the logs.
107+
108+
See the structured logging docs for each API kind:
109+
110+
- [RealtimeAPI](../../workloads/realtime/predictors.md#structured-logging)
111+
- [BatchAPI](../../workloads/batch/predictors.md#structured-logging)
112+
- [TaskAPI](../../workloads/task/definitions.md#structured-logging)
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
# Metrics
2+
3+
A cortex cluster includes a deployment of Prometheus for metrics collections and a deployment of Grafana for
4+
visualization. You can monitor your APIs with the Grafana dashboards that ship with Cortex, or even add custom metrics
5+
and dashboards.
6+
7+
## Accessing the dashboard
8+
9+
The dashboard URL is displayed once you run a `cortex get <api_name>` command.
10+
11+
Alternatively, you can access it on `http://<operator_url>/dashboard`. Run the following command to get the operator
12+
URL:
13+
14+
```shell
15+
cortex env list
16+
```
17+
18+
If your operator load balancer is configured to be internal, there are a few options for accessing the dashboard:
19+
20+
1. Access the dashboard from a machine that has VPC Peering configured to your cluster's VPC, or which is inside of your
21+
cluster's VPC
22+
1. Run `kubectl port-forward -n default grafana-0 3000:3000` to forward Grafana's port to your local machine, and access
23+
the dashboard on [http://localhost:3000/](http://localhost:3000/) (see instructions for setting up `kubectl`
24+
on [AWS](../../clusters/aws/kubectl.md) or [GCP](../../clusters/gcp/kubectl.md))
25+
1. Set up VPN access to your cluster's
26+
VPC ([AWS docs](https://docs.aws.amazon.com/vpc/latest/userguide/vpn-connections.html))
27+
28+
### Default credentials
29+
30+
The dashboard is protected with username / password authentication, which by default are:
31+
32+
- Username: admin
33+
- Password: admin
34+
35+
You will be prompted to change the admin user password in the first time you log in.
36+
37+
Grafana allows managing the access of several users and managing teams. For more information on this topic check
38+
the [grafana documentation](https://grafana.com/docs/grafana/latest/manage-users/).
39+
40+
### Selecting an API
41+
42+
You can select one or more APIs to visualize in the top left corner of the dashboard.
43+
44+
![](https://user-images.githubusercontent.com/7456627/107375721-57545180-6ae9-11eb-9474-ba58ad7eb0c5.png)
45+
46+
### Selecting a time range
47+
48+
Grafana allows you to select a time range on which the metrics will be visualized. You can do so in the top right corner
49+
of the dashboard.
50+
51+
![](https://user-images.githubusercontent.com/7456627/107376148-d9dd1100-6ae9-11eb-8c2b-c678b41ade01.png)
52+
53+
**Note: Cortex only retains a maximum of 2 weeks worth of data at any moment in time**
54+
55+
### Available dashboards
56+
57+
There are more than one dashboard available by default. You can view the available dashboards by accessing the Grafana
58+
menu: `Dashboards -> Manage -> Cortex folder`.
59+
60+
The dashboards that Cortex ships with are the following:
61+
62+
- RealtimeAPI
63+
- BatchAPI
64+
- Cluster resources
65+
- Node resources
66+
67+
## Exposed metrics
68+
69+
Cortex exposes more metrics with Prometheus, that can be potentially useful. To check the available metrics, access
70+
the `Explore` menu in grafana and press the `Metrics` button.
71+
72+
![](https://user-images.githubusercontent.com/7456627/107377492-515f7000-6aeb-11eb-9b46-909120335060.png)
73+
74+
You can use any of these metrics to set up your own dashboards.
75+
76+
## Custom user metrics
77+
78+
It is possible to export your own custom metrics by using the `MetricsClient` class in your predictor code. This allows
79+
you to create a custom metrics from your deployed API that can be later be used on your own custom dashboards.
80+
81+
Code examples on how to use custom metrics for each API kind can be found here:
82+
83+
- [RealtimeAPI](../realtime/metrics.md#custom-user-metrics)
84+
- [BatchAPI](../batch/metrics.md#custom-user-metrics)
85+
- [TaskAPI](../task/metrics.md#custom-user-metrics)
86+
87+
### Metric types
88+
89+
Currently, we only support 3 different metric types that will be converted to its respective Prometheus type:
90+
91+
- [Counter](https://prometheus.io/docs/concepts/metric_types/#counter) - a cumulative metric that represents a single
92+
monotonically increasing counter whose value can only increase or be reset to zero on restart.
93+
- [Gauge](https://prometheus.io/docs/concepts/metric_types/#gauge) - a single numerical value that can arbitrarily go up
94+
and down.
95+
- [Histogram](https://prometheus.io/docs/concepts/metric_types/#histogram) - samples observations (usually things like
96+
request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed
97+
values.
98+
99+
### Pushing metrics
100+
101+
- Counter
102+
103+
```python
104+
metrics.increment('my_counter', value=1, tags={"tag": "tag_name"})
105+
```
106+
107+
- Gauge
108+
109+
```python
110+
metrics.gauge('active_connections', value=1001, tags={"tag": "tag_name"})
111+
```
112+
113+
- Histogram
114+
115+
```python
116+
metrics.histogram('inference_time_milliseconds', 120, tags={"tag": "tag_name"})
117+
```
118+
119+
### Metrics client class reference
120+
121+
```python
122+
class MetricsClient:
123+
124+
def gauge(self, metric: str, value: float, tags: Dict[str, str] = None):
125+
"""
126+
Record the value of a gauge.
127+
128+
Example:
129+
>>> metrics.gauge('active_connections', 1001, tags={"protocol": "http"})
130+
"""
131+
pass
132+
133+
def increment(self, metric: str, value: float = 1, tags: Dict[str, str] = None):
134+
"""
135+
Increment the value of a counter.
136+
137+
Example:
138+
>>> metrics.increment('model_calls', 1, tags={"model_version": "v1"})
139+
"""
140+
pass
141+
142+
def histogram(self, metric: str, value: float, tags: Dict[str, str] = None):
143+
"""
144+
Set the value in a histogram metric
145+
146+
Example:
147+
>>> metrics.histogram('inference_time_milliseconds', 120, tags={"model_version": "v1"})
148+
"""
149+
pass
150+
```

0 commit comments

Comments
 (0)