Skip to content

Commit

Permalink
Add Azure IoT Edge integration (#7465)
Browse files Browse the repository at this point in the history
* Add skeleton

* Working Docker setup for 1.0.9

* Attempt 1.0.10-rc2 setup

* Finalize RC2 dev setup

* Fix double-endpoint setup, implement scraping of Prometheus endpoints

* Update CI config

* Add config class, add failing integration test

* Successfully collect and test metrics, improve env up/down robustness

* Make tests pass

* Use local mock server for CI tests

* Add Edge Agent metrics

* Update codecov config

* Tweak exclude_labels

* Fix invalid manifest

* Add edgeHub metrics

* Document mock server metrics generation

* Fix Python 2 tests compatibility

* Assert E2E tags

* Skip E2E tests if IOT_EDGE_CONNSTR is missing

* Use Windows-compatible mock server setup

* Add security daemon health service check

* Simplify prometheus url config, add config tests

* Fix style, fix Windows test compat

* Verify service check in e2e

* Fix check class name case

* Add config spec

* Add logs to config spec and test env

* Use auto-discovery for log collection

* Enable log collection via Docker labels

* Set required properties in config spec

* Reorganize config options order

* Loosen wait conditions

* Update namespace to azure.iot_edge

* Add version metadata collection

* Update manifest.json

* Check types

* Write up metadata.csv

* Fill in service_checks.json

* Add TLS support to E2E environment

* Add code comment about single-instance and composition approaches

* Drop note about setting certs in config.yaml

This is already done automatically by the E2E environment

* Write up README

* Lingo: security daemon -> security manager

* Add recommended monitors

* Apply no-brainer suggestions

Co-authored-by: Florian Veaux <florian.veaux@datadoghq.com>

* Update version metadata transformer

* Address feedback

Drop security manager service check

Reorganize check as an OpenMetricsBaseCheck subclass

Fix E2E tests

Update docs

Fix service checks: can_connect -> prometheus.health

* Move instance config to Edge Agent labels

* Apply suggestions from docs review

Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com>

* Fix type of renotify_interval in monitors json

Co-authored-by: Florian Veaux <florian.veaux@datadoghq.com>
Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com>
  • Loading branch information
3 people authored Oct 22, 2020
1 parent d038330 commit de79a68
Show file tree
Hide file tree
Showing 50 changed files with 2,797 additions and 1 deletion.
2 changes: 1 addition & 1 deletion .azure-pipelines/changes.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:
- template: './templates/test-single-windows.yml'
parameters:
job_name: Changed
check: '--changed datadog_checks_base datadog_checks_dev active_directory aspdotnet disk dns_check dotnetclr exchange_server iis pdh_check sqlserver tcp_check win32_event_log windows_service wmi_check'
check: '--changed datadog_checks_base datadog_checks_dev active_directory aspdotnet azure_iot_edge disk dns_check dotnetclr exchange_server iis pdh_check sqlserver tcp_check win32_event_log windows_service wmi_check'
display: Windows
pip_cache_config:
key: 'pip | $(Agent.OS) | datadog_checks_base/datadog_checks/base/data/agent_requirements.in'
Expand Down
6 changes: 6 additions & 0 deletions .azure-pipelines/templates/test-all-checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,12 @@ jobs:
- checkName: aspdotnet
displayName: ASP.NET
os: windows
- checkName: azure_iot_edge
displayName: Azure IoT Edge
os: linux
- checkName: azure_iot_edge
displayName: Azure IoT Edge
os: windows
- checkName: btrfs
displayName: Btrfs
os: linux
Expand Down
9 changes: 9 additions & 0 deletions .codecov.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,10 @@ coverage:
target: 75
flags:
- apache
Azure IoT Edge:
target: 75
flags:
- azure_iot_edge
Btrfs:
target: 75
flags:
Expand Down Expand Up @@ -565,6 +569,11 @@ flags:
paths:
- aspdotnet/datadog_checks/aspdotnet
- aspdotnet/tests
azure_iot_edge:
carryforward: true
paths:
- azure_iot_edge/datadog_checks/azure_iot_edge
- azure_iot_edge/tests
btrfs:
carryforward: true
paths:
Expand Down
2 changes: 2 additions & 0 deletions azure_iot_edge/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# CHANGELOG - Azure IoT Edge

10 changes: 10 additions & 0 deletions azure_iot_edge/MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
graft datadog_checks
graft tests

include MANIFEST.in
include README.md
include requirements.in
include requirements-dev.txt
include manifest.json

global-exclude *.py[cod] __pycache__
114 changes: 114 additions & 0 deletions azure_iot_edge/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Agent Check: Azure IoT Edge

## Overview

[Azure IoT Edge][1] is a fully managed service to deploy Cloud workloads to run on Internet of Things (IoT) Edge devices via standard containers.

Use the Datadog-Azure IoT Edge integration to collect metrics and health status from IoT Edge devices.

**Note**: This integration requires IoT Edge runtime version 1.0.10 or above.

## Setup

Follow the instructions below to install and configure this check for an IoT Edge device running on a device host.

### Installation

The Azure IoT Edge check is included in the [Datadog Agent][2] package.

No additional installation is needed on your device.

### Configuration

Configure the IoT Edge device so that the Agent runs as a custom module. Follow the Microsoft documentation on [deploying Azure IoT Edge modules][3] for information on installing and working with custom modules for Azure IoT Edge.

Follow the steps below to configure the IoT Edge device, runtime modules, and the Datadog Agent to start collecting IoT Edge metrics.

1. Configure the **Edge Agent** runtime module as follows:
- Image version must be `1.0.10` or above.
- Under "Create Options", add the following `Labels`. Edit the `com.datadoghq.ad.instances` label as appropriate. See the [sample azure_iot_edge.d/conf.yaml][5] for all available configuration options. See the documentation on [Docker Integrations Autodiscovery][6] for more information on labels-based integration configuration.

```json
"Labels": {
"com.datadoghq.ad.check_names": "[\"azure_iot_edge\"]",
"com.datadoghq.ad.init_configs": "[{}]",
"com.datadoghq.ad.instances": "[{\"edge_hub_prometheus_url\": \"http://edgeHub:9600/metrics\", \"edge_agent_prometheus_url\": \"http://edgeAgent:9600/metrics\"}]"
}
```

- Under "Environment Variables", enable experimental metrics by adding these environment variables (note the double underscores):
- `ExperimentalFeatures__Enabled`: `true`
- `ExperimentalFeatures__EnableMetrics`: `true`

1. Configure the **Edge Hub** runtime module as follows:
- Image version must be `1.0.10` or above.
- Under "Environment Variables", enable experimental metrics by adding these environment variables (note the double underscores):
- `ExperimentalFeatures__Enabled`: `true`
- `ExperimentalFeatures__EnableMetrics`: `true`

1. Install and configure the Datadog Agent as a **custom module**:
- Set the module name. For example: `datadog-agent`.
- Set the Agent image URI. For example: `datadog/agent:7`.
- Under "Environment Variables", configure your `DD_API_KEY`. You may also set extra Agent configuration here (see [Agent Environment Variables][4]).
- Under "Container Create Options", enter the following configuration based on your device OS. **Note**: `NetworkId` must correspond to the network name set in the device `config.yaml` file.

- Linux:
```json
{
"HostConfig": {
"NetworkMode": "default",
"Env": ["NetworkId=azure-iot-edge"],
"Binds": ["/var/run/docker.sock:/var/run/docker.sock"]
}
}
```
- Windows:
```json
{
"HostConfig": {
"NetworkMode": "default",
"Env": ["NetworkId=nat"],
"Binds": ["//./pipe/iotedge_moby_engine:/./pipe/docker_engine"]
}
}
```

- Save the Datadog Agent custom module.

1. Save and deploy changes to your device configuration.

### Validation

Once the Agent has been deployed to the device, [run the Agent's status subcommand][7] and look for `azure_iot_edge` under the Checks section.

## Data Collected

### Metrics

See [metadata.csv][8] for a list of metrics provided by this check.

### Service Checks

**azure.iot_edge.edge_agent.prometheus.health**:
Returns `CRITICAL` if the Agent is unable to reach the Edge Agent metrics Prometheus endpoint. Returns `OK` otherwise.

**azure.iot_edge.edge_hub.prometheus.health**:
Returns `CRITICAL` if the Agent is unable to reach the Edge Hub metrics Prometheus endpoint. Returns `OK` otherwise.

### Events

Azure IoT Edge does not include any events.

## Troubleshooting

Need help? Contact [Datadog support][9].

[1]: https://azure.microsoft.com/en-us/services/iot-edge/
[2]: https://docs.datadoghq.com/agent/
[3]: https://docs.microsoft.com/en-us/azure/iot-edge/how-to-deploy-modules-portal
[4]: https://docs.datadoghq.com/agent/guide/environment-variables/
[5]: https://github.com/DataDog/integrations-core/blob/master/azure_iot_edge/datadog_checks/azure_iot_edge/data/conf.yaml.example
[6]: https://docs.datadoghq.com/agent/docker/integrations/
[7]: https://docs.datadoghq.com/agent/guide/agent-commands/#agent-status-and-information
[8]: https://github.com/DataDog/integrations-core/blob/master/azure_iot_edge/metadata.csv
[9]: https://docs.datadoghq.com/help/
24 changes: 24 additions & 0 deletions azure_iot_edge/assets/configuration/spec.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: Azure IoT Edge
files:
- name: azure_iot_edge.yaml
options:
- template: init_config
options:
- template: init_config/default
- template: instances
options:
- name: edge_hub_prometheus_url
description: |
The URL where Edge Hub metrics are exposed via Prometheus.
required: true
value:
type: string
example: http://edgeHub:9600/metrics
- name: edge_agent_prometheus_url
description: |
The URL where Edge Agent metrics are exposed via Prometheus.
required: true
value:
type: string
example: http://edgeAgent:9600/metrics
- template: instances/default
Empty file.
31 changes: 31 additions & 0 deletions azure_iot_edge/assets/monitors/disk_usage.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{
"name": "[Azure IoT Edge] IoT Edge device {{host}} is running out of available disk space",
"type": "query alert",
"query": "max(last_1h):avg:azure.iot_edge.edge_agent.available_disk_space_bytes{*} by {host} / avg:azure.iot_edge.edge_agent.total_disk_space_bytes{*} by {host}.rollup(max, 60) * 100 < 10",
"message": "Please check device {{host}}, as Edge Agent reports that available disk space has dropped below {{threshold}}%.",
"tags": [
"integration:azure_iot_edge"
],
"options": {
"notify_audit": false,
"locked": false,
"timeout_h": 0,
"silenced": {},
"include_tags": true,
"no_data_timeframe": null,
"require_full_window": true,
"new_host_delay": 300,
"notify_no_data": false,
"renotify_interval": 0,
"escalation_message": "",
"thresholds": {
"critical": 10,
"warning": 25,
"critical_recovery": 11,
"warning_recovery": 26
}
},
"recommended_monitor_metadata": {
"description": "Triggers an alert when an IoT Edge device is running out of available disk space"
}
}
32 changes: 32 additions & 0 deletions azure_iot_edge/assets/monitors/edgehub_retries.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"name": "[Azure IoT Edge] Rate of Edge Hub operations retries is higher than usual on device device {{host}}",
"type": "query alert",
"query": "avg(last_1h):anomalies(per_minute(avg:azure.iot_edge.edge_hub.operation_retry_total{*} by {host}), 'basic', 2, direction='above', alert_window='last_15m', interval=60, count_default_zero='true') >= 1",
"message": "Please check device {{host}}, as Edge Hub reports a rate of operation retries of {{value}} per minute, which is higher than usual.",
"tags": [
"integration:azure_iot_edge"
],
"options": {
"notify_audit": false,
"locked": false,
"timeout_h": 0,
"new_host_delay": 300,
"require_full_window": false,
"notify_no_data": false,
"renotify_interval": 0,
"escalation_message": "",
"no_data_timeframe": null,
"include_tags": true,
"thresholds": {
"critical": 1,
"critical_recovery": 0
},
"threshold_windows": {
"trigger_window": "last_15m",
"recovery_window": "last_15m"
}
},
"recommended_monitor_metadata": {
"description": "Notifies when rate of Edge Hub operation retries is higher than usual"
}
}
32 changes: 32 additions & 0 deletions azure_iot_edge/assets/monitors/iothub_syncs.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"name": "[Azure IoT Edge] Rate of unsuccessful syncs with IoT Hub is higher than usual on device {{host}}",
"type": "query alert",
"query": "avg(last_1h):anomalies(per_minute(avg:azure.iot_edge.edge_agent.unsuccessful_iothub_syncs_total{*} by {host}), 'basic', 2, direction='above', alert_window='last_15m', interval=60, count_default_zero='true') >= 1",
"message": "Number of unsuccessful syncs between Edge Agent and IoT Hub on device {{host}} is at {{value}} per minute, which is higher than usual.",
"tags": [
"integration:azure_iot_edge"
],
"options": {
"notify_audit": false,
"locked": false,
"timeout_h": 0,
"new_host_delay": 300,
"require_full_window": false,
"notify_no_data": false,
"renotify_interval": 0,
"escalation_message": "",
"no_data_timeframe": null,
"include_tags": true,
"thresholds": {
"critical": 1,
"critical_recovery": 0
},
"threshold_windows": {
"trigger_window": "last_15m",
"recovery_window": "last_15m"
}
},
"recommended_monitor_metadata": {
"description": "Notifies when unsuccessful syncs between Edge Agent and IoT Hub are higher than usual"
}
}
31 changes: 31 additions & 0 deletions azure_iot_edge/assets/monitors/memory_usage.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{
"name": "[Azure IoT Edge] IoT Edge device {{host}} is running out of memory",
"type": "query alert",
"query": "max(last_1h):avg:azure.iot_edge.edge_agent.used_memory_bytes{*} by {host} / avg:azure.iot_edge.edge_agent.total_memory_bytes{*} by {host}.rollup(max, 60) * 100 > 80",
"message": "Please check device {{host}}, as Edge Agent reports usage of more than {{threshold}}% of available RAM for the last hour.",
"tags": [
"integration:azure_iot_edge"
],
"options": {
"notify_audit": false,
"locked": false,
"timeout_h": 0,
"silenced": {},
"include_tags": true,
"no_data_timeframe": null,
"require_full_window": true,
"new_host_delay": 300,
"notify_no_data": false,
"renotify_interval": 0,
"escalation_message": "",
"thresholds": {
"critical": 80,
"warning": 65,
"critical_recovery": 79,
"warning_recovery": 64
}
},
"recommended_monitor_metadata": {
"description": "Triggers an alert when an IoT Edge device is running out of memory"
}
}
32 changes: 32 additions & 0 deletions azure_iot_edge/assets/service_checks.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
[
{
"agent_version": "6.24.0",
"integration": "Azure IoT Edge",
"groups": [
"host",
"endpoint"
],
"check": "azure.iot_edge.edge_agent.prometheus.health",
"statuses": [
"ok",
"critical"
],
"name": "Edge Agent health",
"description": "Returns `CRITICAL` if the Agent is unable to reach the Edge Agent metrics Prometheus endpoint. Returns `OK` otherwise."
},
{
"agent_version": "6.24.0",
"integration": "Azure IoT Edge",
"groups": [
"host",
"endpoint"
],
"check": "azure.iot_edge.edge_hub.prometheus.health",
"statuses": [
"ok",
"critical"
],
"name": "Edge Hub health",
"description": "Returns `CRITICAL` if the Agent is unable to reach the Edge Hub metrics Prometheus endpoint. Returns `OK` otherwise."
}
]
4 changes: 4 additions & 0 deletions azure_iot_edge/datadog_checks/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# (C) Datadog, Inc. 2020-present
# All rights reserved
# Licensed under a 3-clause BSD style license (see LICENSE)
__path__ = __import__('pkgutil').extend_path(__path__, __name__) # type: ignore
4 changes: 4 additions & 0 deletions azure_iot_edge/datadog_checks/azure_iot_edge/__about__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# (C) Datadog, Inc. 2020-present
# All rights reserved
# Licensed under a 3-clause BSD style license (see LICENSE)
__version__ = '0.0.1'
7 changes: 7 additions & 0 deletions azure_iot_edge/datadog_checks/azure_iot_edge/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# (C) Datadog, Inc. 2020-present
# All rights reserved
# Licensed under a 3-clause BSD style license (see LICENSE)
from .__about__ import __version__
from .check import AzureIoTEdgeCheck

__all__ = ['__version__', 'AzureIoTEdgeCheck']
Loading

0 comments on commit de79a68

Please sign in to comment.