Skip to content

Commit de79a68

Browse files
florimondmancaFlorianVeauxkayayarai
authored
Add Azure IoT Edge integration (#7465)
* Add skeleton * Working Docker setup for 1.0.9 * Attempt 1.0.10-rc2 setup * Finalize RC2 dev setup * Fix double-endpoint setup, implement scraping of Prometheus endpoints * Update CI config * Add config class, add failing integration test * Successfully collect and test metrics, improve env up/down robustness * Make tests pass * Use local mock server for CI tests * Add Edge Agent metrics * Update codecov config * Tweak exclude_labels * Fix invalid manifest * Add edgeHub metrics * Document mock server metrics generation * Fix Python 2 tests compatibility * Assert E2E tags * Skip E2E tests if IOT_EDGE_CONNSTR is missing * Use Windows-compatible mock server setup * Add security daemon health service check * Simplify prometheus url config, add config tests * Fix style, fix Windows test compat * Verify service check in e2e * Fix check class name case * Add config spec * Add logs to config spec and test env * Use auto-discovery for log collection * Enable log collection via Docker labels * Set required properties in config spec * Reorganize config options order * Loosen wait conditions * Update namespace to azure.iot_edge * Add version metadata collection * Update manifest.json * Check types * Write up metadata.csv * Fill in service_checks.json * Add TLS support to E2E environment * Add code comment about single-instance and composition approaches * Drop note about setting certs in config.yaml This is already done automatically by the E2E environment * Write up README * Lingo: security daemon -> security manager * Add recommended monitors * Apply no-brainer suggestions Co-authored-by: Florian Veaux <florian.veaux@datadoghq.com> * Update version metadata transformer * Address feedback Drop security manager service check Reorganize check as an OpenMetricsBaseCheck subclass Fix E2E tests Update docs Fix service checks: can_connect -> prometheus.health * Move instance config to Edge Agent labels * Apply suggestions from docs review Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com> * Fix type of renotify_interval in monitors json Co-authored-by: Florian Veaux <florian.veaux@datadoghq.com> Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com>
1 parent d038330 commit de79a68

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+2797
-1
lines changed

.azure-pipelines/changes.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ jobs:
2727
- template: './templates/test-single-windows.yml'
2828
parameters:
2929
job_name: Changed
30-
check: '--changed datadog_checks_base datadog_checks_dev active_directory aspdotnet disk dns_check dotnetclr exchange_server iis pdh_check sqlserver tcp_check win32_event_log windows_service wmi_check'
30+
check: '--changed datadog_checks_base datadog_checks_dev active_directory aspdotnet azure_iot_edge disk dns_check dotnetclr exchange_server iis pdh_check sqlserver tcp_check win32_event_log windows_service wmi_check'
3131
display: Windows
3232
pip_cache_config:
3333
key: 'pip | $(Agent.OS) | datadog_checks_base/datadog_checks/base/data/agent_requirements.in'

.azure-pipelines/templates/test-all-checks.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,12 @@ jobs:
4949
- checkName: aspdotnet
5050
displayName: ASP.NET
5151
os: windows
52+
- checkName: azure_iot_edge
53+
displayName: Azure IoT Edge
54+
os: linux
55+
- checkName: azure_iot_edge
56+
displayName: Azure IoT Edge
57+
os: windows
5258
- checkName: btrfs
5359
displayName: Btrfs
5460
os: linux

.codecov.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,10 @@ coverage:
5151
target: 75
5252
flags:
5353
- apache
54+
Azure IoT Edge:
55+
target: 75
56+
flags:
57+
- azure_iot_edge
5458
Btrfs:
5559
target: 75
5660
flags:
@@ -565,6 +569,11 @@ flags:
565569
paths:
566570
- aspdotnet/datadog_checks/aspdotnet
567571
- aspdotnet/tests
572+
azure_iot_edge:
573+
carryforward: true
574+
paths:
575+
- azure_iot_edge/datadog_checks/azure_iot_edge
576+
- azure_iot_edge/tests
568577
btrfs:
569578
carryforward: true
570579
paths:

azure_iot_edge/CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# CHANGELOG - Azure IoT Edge
2+

azure_iot_edge/MANIFEST.in

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
graft datadog_checks
2+
graft tests
3+
4+
include MANIFEST.in
5+
include README.md
6+
include requirements.in
7+
include requirements-dev.txt
8+
include manifest.json
9+
10+
global-exclude *.py[cod] __pycache__

azure_iot_edge/README.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Agent Check: Azure IoT Edge
2+
3+
## Overview
4+
5+
[Azure IoT Edge][1] is a fully managed service to deploy Cloud workloads to run on Internet of Things (IoT) Edge devices via standard containers.
6+
7+
Use the Datadog-Azure IoT Edge integration to collect metrics and health status from IoT Edge devices.
8+
9+
**Note**: This integration requires IoT Edge runtime version 1.0.10 or above.
10+
11+
## Setup
12+
13+
Follow the instructions below to install and configure this check for an IoT Edge device running on a device host.
14+
15+
### Installation
16+
17+
The Azure IoT Edge check is included in the [Datadog Agent][2] package.
18+
19+
No additional installation is needed on your device.
20+
21+
### Configuration
22+
23+
Configure the IoT Edge device so that the Agent runs as a custom module. Follow the Microsoft documentation on [deploying Azure IoT Edge modules][3] for information on installing and working with custom modules for Azure IoT Edge.
24+
25+
Follow the steps below to configure the IoT Edge device, runtime modules, and the Datadog Agent to start collecting IoT Edge metrics.
26+
27+
1. Configure the **Edge Agent** runtime module as follows:
28+
- Image version must be `1.0.10` or above.
29+
- Under "Create Options", add the following `Labels`. Edit the `com.datadoghq.ad.instances` label as appropriate. See the [sample azure_iot_edge.d/conf.yaml][5] for all available configuration options. See the documentation on [Docker Integrations Autodiscovery][6] for more information on labels-based integration configuration.
30+
31+
```json
32+
"Labels": {
33+
"com.datadoghq.ad.check_names": "[\"azure_iot_edge\"]",
34+
"com.datadoghq.ad.init_configs": "[{}]",
35+
"com.datadoghq.ad.instances": "[{\"edge_hub_prometheus_url\": \"http://edgeHub:9600/metrics\", \"edge_agent_prometheus_url\": \"http://edgeAgent:9600/metrics\"}]"
36+
}
37+
```
38+
39+
- Under "Environment Variables", enable experimental metrics by adding these environment variables (note the double underscores):
40+
- `ExperimentalFeatures__Enabled`: `true`
41+
- `ExperimentalFeatures__EnableMetrics`: `true`
42+
43+
1. Configure the **Edge Hub** runtime module as follows:
44+
- Image version must be `1.0.10` or above.
45+
- Under "Environment Variables", enable experimental metrics by adding these environment variables (note the double underscores):
46+
- `ExperimentalFeatures__Enabled`: `true`
47+
- `ExperimentalFeatures__EnableMetrics`: `true`
48+
49+
1. Install and configure the Datadog Agent as a **custom module**:
50+
- Set the module name. For example: `datadog-agent`.
51+
- Set the Agent image URI. For example: `datadog/agent:7`.
52+
- Under "Environment Variables", configure your `DD_API_KEY`. You may also set extra Agent configuration here (see [Agent Environment Variables][4]).
53+
- Under "Container Create Options", enter the following configuration based on your device OS. **Note**: `NetworkId` must correspond to the network name set in the device `config.yaml` file.
54+
55+
- Linux:
56+
```json
57+
{
58+
"HostConfig": {
59+
"NetworkMode": "default",
60+
"Env": ["NetworkId=azure-iot-edge"],
61+
"Binds": ["/var/run/docker.sock:/var/run/docker.sock"]
62+
}
63+
}
64+
```
65+
- Windows:
66+
```json
67+
{
68+
"HostConfig": {
69+
"NetworkMode": "default",
70+
"Env": ["NetworkId=nat"],
71+
"Binds": ["//./pipe/iotedge_moby_engine:/./pipe/docker_engine"]
72+
}
73+
}
74+
```
75+
76+
- Save the Datadog Agent custom module.
77+
78+
1. Save and deploy changes to your device configuration.
79+
80+
### Validation
81+
82+
Once the Agent has been deployed to the device, [run the Agent's status subcommand][7] and look for `azure_iot_edge` under the Checks section.
83+
84+
## Data Collected
85+
86+
### Metrics
87+
88+
See [metadata.csv][8] for a list of metrics provided by this check.
89+
90+
### Service Checks
91+
92+
**azure.iot_edge.edge_agent.prometheus.health**:
93+
Returns `CRITICAL` if the Agent is unable to reach the Edge Agent metrics Prometheus endpoint. Returns `OK` otherwise.
94+
95+
**azure.iot_edge.edge_hub.prometheus.health**:
96+
Returns `CRITICAL` if the Agent is unable to reach the Edge Hub metrics Prometheus endpoint. Returns `OK` otherwise.
97+
98+
### Events
99+
100+
Azure IoT Edge does not include any events.
101+
102+
## Troubleshooting
103+
104+
Need help? Contact [Datadog support][9].
105+
106+
[1]: https://azure.microsoft.com/en-us/services/iot-edge/
107+
[2]: https://docs.datadoghq.com/agent/
108+
[3]: https://docs.microsoft.com/en-us/azure/iot-edge/how-to-deploy-modules-portal
109+
[4]: https://docs.datadoghq.com/agent/guide/environment-variables/
110+
[5]: https://github.com/DataDog/integrations-core/blob/master/azure_iot_edge/datadog_checks/azure_iot_edge/data/conf.yaml.example
111+
[6]: https://docs.datadoghq.com/agent/docker/integrations/
112+
[7]: https://docs.datadoghq.com/agent/guide/agent-commands/#agent-status-and-information
113+
[8]: https://github.com/DataDog/integrations-core/blob/master/azure_iot_edge/metadata.csv
114+
[9]: https://docs.datadoghq.com/help/
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
name: Azure IoT Edge
2+
files:
3+
- name: azure_iot_edge.yaml
4+
options:
5+
- template: init_config
6+
options:
7+
- template: init_config/default
8+
- template: instances
9+
options:
10+
- name: edge_hub_prometheus_url
11+
description: |
12+
The URL where Edge Hub metrics are exposed via Prometheus.
13+
required: true
14+
value:
15+
type: string
16+
example: http://edgeHub:9600/metrics
17+
- name: edge_agent_prometheus_url
18+
description: |
19+
The URL where Edge Agent metrics are exposed via Prometheus.
20+
required: true
21+
value:
22+
type: string
23+
example: http://edgeAgent:9600/metrics
24+
- template: instances/default

azure_iot_edge/assets/dashboards/azure_iot_edge_overview.json

Whitespace-only changes.
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
{
2+
"name": "[Azure IoT Edge] IoT Edge device {{host}} is running out of available disk space",
3+
"type": "query alert",
4+
"query": "max(last_1h):avg:azure.iot_edge.edge_agent.available_disk_space_bytes{*} by {host} / avg:azure.iot_edge.edge_agent.total_disk_space_bytes{*} by {host}.rollup(max, 60) * 100 < 10",
5+
"message": "Please check device {{host}}, as Edge Agent reports that available disk space has dropped below {{threshold}}%.",
6+
"tags": [
7+
"integration:azure_iot_edge"
8+
],
9+
"options": {
10+
"notify_audit": false,
11+
"locked": false,
12+
"timeout_h": 0,
13+
"silenced": {},
14+
"include_tags": true,
15+
"no_data_timeframe": null,
16+
"require_full_window": true,
17+
"new_host_delay": 300,
18+
"notify_no_data": false,
19+
"renotify_interval": 0,
20+
"escalation_message": "",
21+
"thresholds": {
22+
"critical": 10,
23+
"warning": 25,
24+
"critical_recovery": 11,
25+
"warning_recovery": 26
26+
}
27+
},
28+
"recommended_monitor_metadata": {
29+
"description": "Triggers an alert when an IoT Edge device is running out of available disk space"
30+
}
31+
}
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
{
2+
"name": "[Azure IoT Edge] Rate of Edge Hub operations retries is higher than usual on device device {{host}}",
3+
"type": "query alert",
4+
"query": "avg(last_1h):anomalies(per_minute(avg:azure.iot_edge.edge_hub.operation_retry_total{*} by {host}), 'basic', 2, direction='above', alert_window='last_15m', interval=60, count_default_zero='true') >= 1",
5+
"message": "Please check device {{host}}, as Edge Hub reports a rate of operation retries of {{value}} per minute, which is higher than usual.",
6+
"tags": [
7+
"integration:azure_iot_edge"
8+
],
9+
"options": {
10+
"notify_audit": false,
11+
"locked": false,
12+
"timeout_h": 0,
13+
"new_host_delay": 300,
14+
"require_full_window": false,
15+
"notify_no_data": false,
16+
"renotify_interval": 0,
17+
"escalation_message": "",
18+
"no_data_timeframe": null,
19+
"include_tags": true,
20+
"thresholds": {
21+
"critical": 1,
22+
"critical_recovery": 0
23+
},
24+
"threshold_windows": {
25+
"trigger_window": "last_15m",
26+
"recovery_window": "last_15m"
27+
}
28+
},
29+
"recommended_monitor_metadata": {
30+
"description": "Notifies when rate of Edge Hub operation retries is higher than usual"
31+
}
32+
}
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
{
2+
"name": "[Azure IoT Edge] Rate of unsuccessful syncs with IoT Hub is higher than usual on device {{host}}",
3+
"type": "query alert",
4+
"query": "avg(last_1h):anomalies(per_minute(avg:azure.iot_edge.edge_agent.unsuccessful_iothub_syncs_total{*} by {host}), 'basic', 2, direction='above', alert_window='last_15m', interval=60, count_default_zero='true') >= 1",
5+
"message": "Number of unsuccessful syncs between Edge Agent and IoT Hub on device {{host}} is at {{value}} per minute, which is higher than usual.",
6+
"tags": [
7+
"integration:azure_iot_edge"
8+
],
9+
"options": {
10+
"notify_audit": false,
11+
"locked": false,
12+
"timeout_h": 0,
13+
"new_host_delay": 300,
14+
"require_full_window": false,
15+
"notify_no_data": false,
16+
"renotify_interval": 0,
17+
"escalation_message": "",
18+
"no_data_timeframe": null,
19+
"include_tags": true,
20+
"thresholds": {
21+
"critical": 1,
22+
"critical_recovery": 0
23+
},
24+
"threshold_windows": {
25+
"trigger_window": "last_15m",
26+
"recovery_window": "last_15m"
27+
}
28+
},
29+
"recommended_monitor_metadata": {
30+
"description": "Notifies when unsuccessful syncs between Edge Agent and IoT Hub are higher than usual"
31+
}
32+
}
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
{
2+
"name": "[Azure IoT Edge] IoT Edge device {{host}} is running out of memory",
3+
"type": "query alert",
4+
"query": "max(last_1h):avg:azure.iot_edge.edge_agent.used_memory_bytes{*} by {host} / avg:azure.iot_edge.edge_agent.total_memory_bytes{*} by {host}.rollup(max, 60) * 100 > 80",
5+
"message": "Please check device {{host}}, as Edge Agent reports usage of more than {{threshold}}% of available RAM for the last hour.",
6+
"tags": [
7+
"integration:azure_iot_edge"
8+
],
9+
"options": {
10+
"notify_audit": false,
11+
"locked": false,
12+
"timeout_h": 0,
13+
"silenced": {},
14+
"include_tags": true,
15+
"no_data_timeframe": null,
16+
"require_full_window": true,
17+
"new_host_delay": 300,
18+
"notify_no_data": false,
19+
"renotify_interval": 0,
20+
"escalation_message": "",
21+
"thresholds": {
22+
"critical": 80,
23+
"warning": 65,
24+
"critical_recovery": 79,
25+
"warning_recovery": 64
26+
}
27+
},
28+
"recommended_monitor_metadata": {
29+
"description": "Triggers an alert when an IoT Edge device is running out of memory"
30+
}
31+
}
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
[
2+
{
3+
"agent_version": "6.24.0",
4+
"integration": "Azure IoT Edge",
5+
"groups": [
6+
"host",
7+
"endpoint"
8+
],
9+
"check": "azure.iot_edge.edge_agent.prometheus.health",
10+
"statuses": [
11+
"ok",
12+
"critical"
13+
],
14+
"name": "Edge Agent health",
15+
"description": "Returns `CRITICAL` if the Agent is unable to reach the Edge Agent metrics Prometheus endpoint. Returns `OK` otherwise."
16+
},
17+
{
18+
"agent_version": "6.24.0",
19+
"integration": "Azure IoT Edge",
20+
"groups": [
21+
"host",
22+
"endpoint"
23+
],
24+
"check": "azure.iot_edge.edge_hub.prometheus.health",
25+
"statuses": [
26+
"ok",
27+
"critical"
28+
],
29+
"name": "Edge Hub health",
30+
"description": "Returns `CRITICAL` if the Agent is unable to reach the Edge Hub metrics Prometheus endpoint. Returns `OK` otherwise."
31+
}
32+
]
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# (C) Datadog, Inc. 2020-present
2+
# All rights reserved
3+
# Licensed under a 3-clause BSD style license (see LICENSE)
4+
__path__ = __import__('pkgutil').extend_path(__path__, __name__) # type: ignore
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# (C) Datadog, Inc. 2020-present
2+
# All rights reserved
3+
# Licensed under a 3-clause BSD style license (see LICENSE)
4+
__version__ = '0.0.1'
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# (C) Datadog, Inc. 2020-present
2+
# All rights reserved
3+
# Licensed under a 3-clause BSD style license (see LICENSE)
4+
from .__about__ import __version__
5+
from .check import AzureIoTEdgeCheck
6+
7+
__all__ = ['__version__', 'AzureIoTEdgeCheck']

0 commit comments

Comments
 (0)