Skip to content

Commit

Permalink
Address feedback
Browse files Browse the repository at this point in the history
Drop security manager service check

Reorganize check as an OpenMetricsBaseCheck subclass

Fix E2E tests

Update docs

Fix service checks: can_connect -> prometheus.health
  • Loading branch information
Florimond Manca committed Oct 2, 2020
1 parent a1d86cd commit 38e8b19
Show file tree
Hide file tree
Showing 16 changed files with 99 additions and 226 deletions.
53 changes: 10 additions & 43 deletions azure_iot_edge/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@

Use the Datadog-Azure IoT Edge integration to collect metrics and health status from IoT Edge devices.

**Note**: this integration requires IoT Edge runtime version 1.0.10 or above.

## Setup

Follow the instructions below to install and configure this check for an IoT Edge device running on a device host.
Expand All @@ -22,35 +24,6 @@ It is recommended to configure the IoT Edge device so that the Agent runs as a c

Follow the steps below to configure the IoT Edge device, runtime modules, and the Datadog Agent to start collecting IoT Edge metrics.

1. Edit your IoT Edge device `config.yaml` file:
- **Linux**: Make sure the `connect.management_uri` and `listen.management_uri` options point to a Unix Domain Socket (note that this is the default configuration). For example:

```yaml
# /etc/iotedge/config.yaml

connect:
management_uri: "unix:///var/run/iotedge/mgmt.sock"
# ...

listen:
management_uri: "fd://iotedge.mgmt.socket"
# ...
```

- **Windows**: Make sure the `connect.management_uri` and `listen.management_uri` options point to an HTTP endpoint. For example:

```yaml
# /etc/iotedge/config.yaml
connect:
management_uri: "http://localhost:15580"
# ...
listen:
management_uri: "http://localhost:15580"
# ...
```

1. Configure the **Edge Agent** runtime module as follows:
- Image version must be `1.0.10` or above.
- Under "Environment Variables", experimental metrics must be enabled by adding these environment variables (note the double underscores):
Expand All @@ -74,15 +47,12 @@ Follow the steps below to configure the IoT Edge device, runtime modules, and th
"HostConfig": {
"NetworkMode": "default",
"Env": ["NetworkId=azure-iot-edge"],
"Binds": [
"/var/run/docker.sock:/var/run/docker.sock",
"/var/run/iotedge/mgmt.sock:/var/run/iotedge/mgmt.sock"
]
"Binds": [ "/var/run/docker.sock:/var/run/docker.sock"]
},
"Labels": {
"com.datadoghq.ad.check_names": "[\"azure_iot_edge\"]",
"com.datadoghq.ad.init_configs": "[{}]",
"com.datadoghq.ad.instances": "[{\"edge_hub_prometheus_url\": \"http://edgeHub:9600/metrics\", \"edge_agent_prometheus_url\": \"http://edgeAgent:9600/metrics\", \"security_manager_management_api_url\": \"unix:///var/run/iotedge/mgmt.sock\"}]"
"com.datadoghq.ad.instances": "[{\"edge_hub_prometheus_url\": \"http://edgeHub:9600/metrics\", \"edge_agent_prometheus_url\": \"http://edgeAgent:9600/metrics\"}]"
}
}
```
Expand All @@ -91,13 +61,13 @@ Follow the steps below to configure the IoT Edge device, runtime modules, and th
{
"HostConfig": {
"NetworkMode": "default",
"Env": ["NetworkId=azure-iot-edge"],
"Binds": ["/var/run/docker.sock:/var/run/docker.sock"]
"Env": ["NetworkId=nat"],
"Binds": ["//./pipe/iotedge_moby_engine:/./pipe/docker_engine"]
},
"Labels": {
"com.datadoghq.ad.check_names": "[\"azure_iot_edge\"]",
"com.datadoghq.ad.init_configs": "[{}]",
"com.datadoghq.ad.instances": "[{\"edge_hub_prometheus_url\": \"http://edgeHub:9600/metrics\", \"edge_agent_prometheus_url\": \"http://edgeAgent:9600/metrics\", \"security_manager_management_api_url\": \"http://host.docker.internal:15580/\"]}]"
"com.datadoghq.ad.instances": "[{\"edge_hub_prometheus_url\": \"http://edgeHub:9600/metrics\", \"edge_agent_prometheus_url\": \"http://edgeAgent:9600/metrics\"}]"
}
}
```
Expand All @@ -118,14 +88,11 @@ See [metadata.csv][8] for a list of metrics provided by this check.

### Service Checks

**azure.iot_edge.security_manager.can_connect**:
Returns `CRITICAL` if the Agent is unable to reach the Security Manager management API. Returns `OK` otherwise.

**azure.iot_edge.edge_agent.can_connect**:
**azure.iot_edge.edge_agent.prometheus.health**:
Returns `CRITICAL` if the Agent is unable to reach the Edge Agent metrics Prometheus endpoint. Returns `OK` otherwise.

**azure.iot_edge.edge_hub.can_connect**:
Returns `CRITICAL` if the forest state is `critical`; `WARNING` if it is `maintenance`, `offline`, or `at-risk`; and `OK` otherwise.
**azure.iot_edge.edge_hub.prometheus.health**:
Returns `CRITICAL` if the Agent is unable to reach the Edge Hub metrics Prometheus endpoint. Returns `OK` otherwise.

### Events

Expand Down
7 changes: 0 additions & 7 deletions azure_iot_edge/assets/configuration/spec.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,6 @@ files:
- template: init_config/default
- template: instances
options:
- name: security_manager_management_api_url
description: |
The URL of the management API exposed by the Security Manager, for health check purposes.
required: true
value:
type: string
example: http://localhost:15580
- name: edge_hub_prometheus_url
description: |
The URL where Edge Hub metrics are exposed via Prometheus.
Expand Down
18 changes: 2 additions & 16 deletions azure_iot_edge/assets/service_checks.json
Original file line number Diff line number Diff line change
@@ -1,26 +1,12 @@
[
{
"agent_version": "6.24.0",
"integration": "Azure IoT Edge",
"groups": [
"host"
],
"check": "azure.iot_edge.security_manager.can_connect",
"statuses": [
"ok",
"critical"
],
"name": "Security Manager health",
"description": "Returns `CRITICAL` if the Agent is unable to reach the Security Manager management API. Returns `OK` otherwise."
},
{
"agent_version": "6.24.0",
"integration": "Azure IoT Edge",
"groups": [
"host",
"endpoint"
],
"check": "azure.iot_edge.edge_agent.can_connect",
"check": "azure.iot_edge.edge_agent.prometheus.health",
"statuses": [
"ok",
"critical"
Expand All @@ -35,7 +21,7 @@
"host",
"endpoint"
],
"check": "azure.iot_edge.edge_hub.can_connect",
"check": "azure.iot_edge.edge_hub.prometheus.health",
"statuses": [
"ok",
"critical"
Expand Down
81 changes: 7 additions & 74 deletions azure_iot_edge/datadog_checks/azure_iot_edge/check.py
Original file line number Diff line number Diff line change
@@ -1,89 +1,22 @@
# (C) Datadog, Inc. 2020-present
# All rights reserved
# Licensed under a 3-clause BSD style license (see LICENSE)
import json
from typing import cast

from datadog_checks.base import AgentCheck, OpenMetricsBaseCheck
from datadog_checks.base import OpenMetricsBaseCheck

from .config import Config
from .types import Instance


class AzureIoTEdgeCheck(AgentCheck):
class AzureIoTEdgeCheck(OpenMetricsBaseCheck):
__NAMESPACE__ = 'azure.iot_edge' # Child of `azure.` namespace.

def __init__(self, name, init_config, instances):
super(AzureIoTEdgeCheck, self).__init__(name, init_config, instances)
self._config = Config(cast(Instance, self.instance), check_namespace=self.__NAMESPACE__)
self._edge_hub_check = OpenMetricsBaseCheck(name, init_config, [self._config.edge_hub_instance])
self._edge_agent_check = OpenMetricsBaseCheck(name, init_config, [self._config.edge_agent_instance])

# Need a custom metric transformer due to version info being located in a JSON-encoded string.
edge_agent_metric_transformers = {'edgeAgent_metadata': self._transform_version_metadata}
scraper_config = self._edge_agent_check.get_scraper_config(self._config.edge_agent_instance)
scraper_config['_default_metric_transformers'].update(edge_agent_metric_transformers)

@AgentCheck.metadata_entrypoint
def _transform_version_metadata(self, metric, scraper_config):
"""
Submit version metadata from an Edge Agent metadata metric instance.
"""
# NOTE: `metric` looks like this:
# edgeAgent_metadata{...,edge_agent_version="...",host_information="{\"...\", \"Version\": \"1.0.10~rc2\"}"}
# See: https://github.com/Azure/iotedge/blob/1.0.10-rc2/doc/BuiltInMetrics.md#edgeagent

labels = metric.samples[0][OpenMetricsBaseCheck.SAMPLE_LABELS] # type: dict

host_information = labels.get('host_information')
if host_information is None:
self.log.debug('Label "host_information" not found, skipping version metadata')
return

try:
host_info = json.loads(host_information) # type: dict
except json.JSONDecodeError as exc:
self.log.debug('Error decoding host information, skipping version metadata: %r', exc)
return

# NOTE: Security Manager and Edge Agent SemVer versions are usually the same, but this is not guaranteed.
# (The user can configure the Edge Agent module version in the IoT Edge web UI on the Azure portal.)
security_manager_version = host_info.get('Version')
if security_manager_version is None:
self.log.debug('Key "Version" not found in host_information, skipping version metadata')
return

self.set_metadata('version', security_manager_version)
self._config = Config(cast(Instance, instances[0]))
super(AzureIoTEdgeCheck, self).__init__(name, init_config, self._config.prometheus_instances)

def check(self, _):
self._check_security_manager_health()

# NOTE: This check consumes configuration from a single instance, and then delegates the Edge Agent and Edge Hub
# Prometheus endpoints checks to to two OpenMetrics checks, using a composition approach.
# Compared to requiring separate instances for monitor the Edge Agent, Edge Hub, and the Security Manager,
# we keep the "1 instance = 1 IoT Edge device" mental model, which provides better configuration UX for users
# (as having to configure 3 separate instances is more error-prone).
# The composition approach also hopefully makes the code more natural and easier to understand, compared
# to subclassing from OpenMetricsBaseCheck and having to dig into its internals such as `.process()` and
# scraper configurations.

# Sync Agent-assigned check ID so that metrics of these sub-checks are reported as coming from this check.
self._edge_hub_check.check_id = self.check_id
self._edge_agent_check.check_id = self.check_id

self._edge_hub_check.check(self._config.edge_hub_instance)
self._edge_agent_check.check(self._config.edge_agent_instance)

def _check_security_manager_health(self):
try:
self.http.get(self._config.security_manager_management_api_url)
except Exception as exc:
status = self.CRITICAL
message = str(exc)
else:
# The endpoint is responding, which means the management API server is running and accessible to the Agent,
# so it's fair to assume that the security manager is running and in good shape.
status = self.OK
message = ''

self.service_check('security_manager.can_connect', status, message=message, tags=self._config.tags)
for instance in self._config.prometheus_instances:
scraper_config = self.get_scraper_config(instance)
self.process(scraper_config)
34 changes: 16 additions & 18 deletions azure_iot_edge/datadog_checks/azure_iot_edge/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,45 +17,43 @@ class Config(object):
Encapsulates the validation of an `instance` dictionary while improving type information.
"""

def __init__(self, instance, check_namespace):
# type: (Instance, str) -> None
self._check_namespace = check_namespace

def __init__(self, instance):
# type: (Instance) -> None
tags = instance.get('tags', [])

if not isinstance(tags, list):
raise ConfigurationError('tags {!r} must be a list (got {!r})'.format(tags, type(tags)))

self.tags = tags # type: List[str]

security_manager_management_api_url = instance.get('security_manager_management_api_url')
if not security_manager_management_api_url:
raise ConfigurationError('option "security_manager_management_api_url" is required')

self.security_manager_management_api_url = security_manager_management_api_url

edge_hub_prometheus_url = instance.get('edge_hub_prometheus_url')
if not edge_hub_prometheus_url:
raise ConfigurationError('option "edge_hub_prometheus_url" is required')

self.edge_hub_instance = self._create_prometheus_instance(
edge_hub_prometheus_url, namespace='edge_hub', metrics=EDGE_HUB_METRICS, tags=self.tags
edge_hub_instance = self._create_prometheus_instance(
edge_hub_prometheus_url, namespace='edge_hub', metrics=EDGE_HUB_METRICS, tags=tags
)

edge_agent_prometheus_url = instance.get('edge_agent_prometheus_url')
if not edge_agent_prometheus_url:
raise ConfigurationError('option "edge_agent_prometheus_url" is required')

self.edge_agent_instance = self._create_prometheus_instance(
edge_agent_prometheus_url, namespace='edge_agent', metrics=EDGE_AGENT_METRICS, tags=self.tags
edge_agent_instance = self._create_prometheus_instance(
edge_agent_prometheus_url, namespace='edge_agent', metrics=EDGE_AGENT_METRICS, tags=tags
)

# Configure version metadata collection.
edge_agent_instance['metadata_metric_name'] = 'edgeAgent_metadata'
edge_agent_instance['metadata_label_map'] = {'version': 'edge_agent_version'}

self.prometheus_instances = [
edge_hub_instance,
edge_agent_instance,
]

def _create_prometheus_instance(self, url, namespace, metrics, tags):
# type: (str, str, list, List[str]) -> InstanceType
return {
'prometheus_url': url,
# NOTE: `__NAMESPACE__` is not honored by the OpenMetricsBaseCheck, so we have to insert it manually.
'namespace': '{}.{}'.format(self._check_namespace, namespace),
'namespace': namespace,
'metrics': metrics,
'tags': tags,
'exclude_labels': [
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,10 @@ init_config:
#
instances:

## @param security_manager_management_api_url - string - required
## The URL of the management API exposed by the Security Manager, for health check purposes.
#
- security_manager_management_api_url: http://localhost:15580

## @param edge_hub_prometheus_url - string - required
## The URL where Edge Hub metrics are exposed via Prometheus.
#
edge_hub_prometheus_url: http://edgeHub:9600/metrics
- edge_hub_prometheus_url: http://edgeHub:9600/metrics

## @param edge_agent_prometheus_url - string - required
## The URL where Edge Agent metrics are exposed via Prometheus.
Expand Down
1 change: 0 additions & 1 deletion azure_iot_edge/datadog_checks/azure_iot_edge/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
{
'edge_hub_prometheus_url': str,
'edge_agent_prometheus_url': str,
'security_manager_management_api_url': str,
'tags': List[str],
},
total=False,
Expand Down
5 changes: 2 additions & 3 deletions azure_iot_edge/tests/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,10 @@
IOT_EDGE_IOTHUB_HOSTNAME = 'iot-edge-dev-hub.azure-devices.net'

MOCK_SERVER_PORT = 9678
MOCK_SECURITY_MANAGER_MANAGEMENT_API_URL = 'http://localhost:{}/mgmt.json'.format(MOCK_SERVER_PORT)
MOCK_EDGE_HUB_PROMETHEUS_URL = 'http://localhost:{}/metrics/edge_hub.txt'.format(MOCK_SERVER_PORT)
MOCK_EDGE_AGENT_PROMETHEUS_URL = 'http://localhost:{}/metrics/edge_agent.txt'.format(MOCK_SERVER_PORT)
MOCK_IOT_EDGE_VERSION = ('1', '0', '10', '1.0.10~rc2') # Defined in Edge Agent fixtures.
# Defined in Edge Agent fixtures.
MOCK_EDGE_AGENT_VERSION = ('1', '0', '10', '1.0.10-rc2.34217022 (029016ef1bf82dec749161d95c6b73aa5ee9baf1)')

CUSTOM_TAGS = ['env:testing']

Expand Down Expand Up @@ -336,7 +336,6 @@
E2E_IOT_EDGE_DEVICE_CA_PK = os.path.join(HERE, 'tls', 'private', 'new-device.key.pem')

E2E_NETWORK = 'iot-edge-network' # External, create it using `$ docker network create`.
E2E_SECURITY_MANAGER_MANAGEMENT_API_URL = 'http://localhost:15580/'
E2E_EDGE_HUB_PROMETHEUS_URL = 'http://localhost:9601/metrics'
E2E_EDGE_AGENT_PROMETHEUS_URL = 'http://localhost:9602/metrics'
E2E_EXTRA_SPAWNED_CONTAINERS = [
Expand Down
2 changes: 2 additions & 0 deletions azure_iot_edge/tests/compose/device/rund.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ agent:
auth: {}
hostname: "edgehub"
connect:
# Use an HTTP endpoint, because mounting Unix sockets is not supported on Docker for macOS.
# See: https://github.com/docker/for-mac/issues/483
management_uri: "http://$IOT_EDGE_DEVICE_HOSTNAME:15580"
workload_uri: "http://$IOT_EDGE_DEVICE_HOSTNAME:15581"
listen:
Expand Down
2 changes: 1 addition & 1 deletion azure_iot_edge/tests/compose/docker-compose-tls.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ services:
networks:
- iot-edge-network
labels:
com.datadoghq.ad.logs: '[{"source": "azure_iot_edge", "service": "azure_iot_edge_dev"}]'
com.datadoghq.ad.logs: '[{"source": "azure.iot_edge", "service": "azure_iot_edge_dev"}]'

networks:
iot-edge-network:
Expand Down
2 changes: 1 addition & 1 deletion azure_iot_edge/tests/compose/docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ services:
networks:
- iot-edge-network
labels:
com.datadoghq.ad.logs: '[{"source": "azure_iot_edge", "service": "azure_iot_edge_dev"}]'
com.datadoghq.ad.logs: '[{"source": "azure.iot_edge", "service": "azure_iot_edge_dev"}]'

networks:
iot-edge-network:
Expand Down
1 change: 0 additions & 1 deletion azure_iot_edge/tests/compose/mock_server/data/mgmt.json

This file was deleted.

Loading

0 comments on commit 38e8b19

Please sign in to comment.