-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add Azure IoT Edge integration (#7465)
* Add skeleton * Working Docker setup for 1.0.9 * Attempt 1.0.10-rc2 setup * Finalize RC2 dev setup * Fix double-endpoint setup, implement scraping of Prometheus endpoints * Update CI config * Add config class, add failing integration test * Successfully collect and test metrics, improve env up/down robustness * Make tests pass * Use local mock server for CI tests * Add Edge Agent metrics * Update codecov config * Tweak exclude_labels * Fix invalid manifest * Add edgeHub metrics * Document mock server metrics generation * Fix Python 2 tests compatibility * Assert E2E tags * Skip E2E tests if IOT_EDGE_CONNSTR is missing * Use Windows-compatible mock server setup * Add security daemon health service check * Simplify prometheus url config, add config tests * Fix style, fix Windows test compat * Verify service check in e2e * Fix check class name case * Add config spec * Add logs to config spec and test env * Use auto-discovery for log collection * Enable log collection via Docker labels * Set required properties in config spec * Reorganize config options order * Loosen wait conditions * Update namespace to azure.iot_edge * Add version metadata collection * Update manifest.json * Check types * Write up metadata.csv * Fill in service_checks.json * Add TLS support to E2E environment * Add code comment about single-instance and composition approaches * Drop note about setting certs in config.yaml This is already done automatically by the E2E environment * Write up README * Lingo: security daemon -> security manager * Add recommended monitors * Apply no-brainer suggestions Co-authored-by: Florian Veaux <florian.veaux@datadoghq.com> * Update version metadata transformer * Address feedback Drop security manager service check Reorganize check as an OpenMetricsBaseCheck subclass Fix E2E tests Update docs Fix service checks: can_connect -> prometheus.health * Move instance config to Edge Agent labels * Apply suggestions from docs review Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com> * Fix type of renotify_interval in monitors json Co-authored-by: Florian Veaux <florian.veaux@datadoghq.com> Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com>
- Loading branch information
1 parent
d038330
commit de79a68
Showing
50 changed files
with
2,797 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
# CHANGELOG - Azure IoT Edge | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
graft datadog_checks | ||
graft tests | ||
|
||
include MANIFEST.in | ||
include README.md | ||
include requirements.in | ||
include requirements-dev.txt | ||
include manifest.json | ||
|
||
global-exclude *.py[cod] __pycache__ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
# Agent Check: Azure IoT Edge | ||
|
||
## Overview | ||
|
||
[Azure IoT Edge][1] is a fully managed service to deploy Cloud workloads to run on Internet of Things (IoT) Edge devices via standard containers. | ||
|
||
Use the Datadog-Azure IoT Edge integration to collect metrics and health status from IoT Edge devices. | ||
|
||
**Note**: This integration requires IoT Edge runtime version 1.0.10 or above. | ||
|
||
## Setup | ||
|
||
Follow the instructions below to install and configure this check for an IoT Edge device running on a device host. | ||
|
||
### Installation | ||
|
||
The Azure IoT Edge check is included in the [Datadog Agent][2] package. | ||
|
||
No additional installation is needed on your device. | ||
|
||
### Configuration | ||
|
||
Configure the IoT Edge device so that the Agent runs as a custom module. Follow the Microsoft documentation on [deploying Azure IoT Edge modules][3] for information on installing and working with custom modules for Azure IoT Edge. | ||
|
||
Follow the steps below to configure the IoT Edge device, runtime modules, and the Datadog Agent to start collecting IoT Edge metrics. | ||
|
||
1. Configure the **Edge Agent** runtime module as follows: | ||
- Image version must be `1.0.10` or above. | ||
- Under "Create Options", add the following `Labels`. Edit the `com.datadoghq.ad.instances` label as appropriate. See the [sample azure_iot_edge.d/conf.yaml][5] for all available configuration options. See the documentation on [Docker Integrations Autodiscovery][6] for more information on labels-based integration configuration. | ||
|
||
```json | ||
"Labels": { | ||
"com.datadoghq.ad.check_names": "[\"azure_iot_edge\"]", | ||
"com.datadoghq.ad.init_configs": "[{}]", | ||
"com.datadoghq.ad.instances": "[{\"edge_hub_prometheus_url\": \"http://edgeHub:9600/metrics\", \"edge_agent_prometheus_url\": \"http://edgeAgent:9600/metrics\"}]" | ||
} | ||
``` | ||
|
||
- Under "Environment Variables", enable experimental metrics by adding these environment variables (note the double underscores): | ||
- `ExperimentalFeatures__Enabled`: `true` | ||
- `ExperimentalFeatures__EnableMetrics`: `true` | ||
|
||
1. Configure the **Edge Hub** runtime module as follows: | ||
- Image version must be `1.0.10` or above. | ||
- Under "Environment Variables", enable experimental metrics by adding these environment variables (note the double underscores): | ||
- `ExperimentalFeatures__Enabled`: `true` | ||
- `ExperimentalFeatures__EnableMetrics`: `true` | ||
|
||
1. Install and configure the Datadog Agent as a **custom module**: | ||
- Set the module name. For example: `datadog-agent`. | ||
- Set the Agent image URI. For example: `datadog/agent:7`. | ||
- Under "Environment Variables", configure your `DD_API_KEY`. You may also set extra Agent configuration here (see [Agent Environment Variables][4]). | ||
- Under "Container Create Options", enter the following configuration based on your device OS. **Note**: `NetworkId` must correspond to the network name set in the device `config.yaml` file. | ||
|
||
- Linux: | ||
```json | ||
{ | ||
"HostConfig": { | ||
"NetworkMode": "default", | ||
"Env": ["NetworkId=azure-iot-edge"], | ||
"Binds": ["/var/run/docker.sock:/var/run/docker.sock"] | ||
} | ||
} | ||
``` | ||
- Windows: | ||
```json | ||
{ | ||
"HostConfig": { | ||
"NetworkMode": "default", | ||
"Env": ["NetworkId=nat"], | ||
"Binds": ["//./pipe/iotedge_moby_engine:/./pipe/docker_engine"] | ||
} | ||
} | ||
``` | ||
|
||
- Save the Datadog Agent custom module. | ||
|
||
1. Save and deploy changes to your device configuration. | ||
|
||
### Validation | ||
|
||
Once the Agent has been deployed to the device, [run the Agent's status subcommand][7] and look for `azure_iot_edge` under the Checks section. | ||
|
||
## Data Collected | ||
|
||
### Metrics | ||
|
||
See [metadata.csv][8] for a list of metrics provided by this check. | ||
|
||
### Service Checks | ||
|
||
**azure.iot_edge.edge_agent.prometheus.health**: | ||
Returns `CRITICAL` if the Agent is unable to reach the Edge Agent metrics Prometheus endpoint. Returns `OK` otherwise. | ||
|
||
**azure.iot_edge.edge_hub.prometheus.health**: | ||
Returns `CRITICAL` if the Agent is unable to reach the Edge Hub metrics Prometheus endpoint. Returns `OK` otherwise. | ||
|
||
### Events | ||
|
||
Azure IoT Edge does not include any events. | ||
|
||
## Troubleshooting | ||
|
||
Need help? Contact [Datadog support][9]. | ||
|
||
[1]: https://azure.microsoft.com/en-us/services/iot-edge/ | ||
[2]: https://docs.datadoghq.com/agent/ | ||
[3]: https://docs.microsoft.com/en-us/azure/iot-edge/how-to-deploy-modules-portal | ||
[4]: https://docs.datadoghq.com/agent/guide/environment-variables/ | ||
[5]: https://github.com/DataDog/integrations-core/blob/master/azure_iot_edge/datadog_checks/azure_iot_edge/data/conf.yaml.example | ||
[6]: https://docs.datadoghq.com/agent/docker/integrations/ | ||
[7]: https://docs.datadoghq.com/agent/guide/agent-commands/#agent-status-and-information | ||
[8]: https://github.com/DataDog/integrations-core/blob/master/azure_iot_edge/metadata.csv | ||
[9]: https://docs.datadoghq.com/help/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
name: Azure IoT Edge | ||
files: | ||
- name: azure_iot_edge.yaml | ||
options: | ||
- template: init_config | ||
options: | ||
- template: init_config/default | ||
- template: instances | ||
options: | ||
- name: edge_hub_prometheus_url | ||
description: | | ||
The URL where Edge Hub metrics are exposed via Prometheus. | ||
required: true | ||
value: | ||
type: string | ||
example: http://edgeHub:9600/metrics | ||
- name: edge_agent_prometheus_url | ||
description: | | ||
The URL where Edge Agent metrics are exposed via Prometheus. | ||
required: true | ||
value: | ||
type: string | ||
example: http://edgeAgent:9600/metrics | ||
- template: instances/default |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
{ | ||
"name": "[Azure IoT Edge] IoT Edge device {{host}} is running out of available disk space", | ||
"type": "query alert", | ||
"query": "max(last_1h):avg:azure.iot_edge.edge_agent.available_disk_space_bytes{*} by {host} / avg:azure.iot_edge.edge_agent.total_disk_space_bytes{*} by {host}.rollup(max, 60) * 100 < 10", | ||
"message": "Please check device {{host}}, as Edge Agent reports that available disk space has dropped below {{threshold}}%.", | ||
"tags": [ | ||
"integration:azure_iot_edge" | ||
], | ||
"options": { | ||
"notify_audit": false, | ||
"locked": false, | ||
"timeout_h": 0, | ||
"silenced": {}, | ||
"include_tags": true, | ||
"no_data_timeframe": null, | ||
"require_full_window": true, | ||
"new_host_delay": 300, | ||
"notify_no_data": false, | ||
"renotify_interval": 0, | ||
"escalation_message": "", | ||
"thresholds": { | ||
"critical": 10, | ||
"warning": 25, | ||
"critical_recovery": 11, | ||
"warning_recovery": 26 | ||
} | ||
}, | ||
"recommended_monitor_metadata": { | ||
"description": "Triggers an alert when an IoT Edge device is running out of available disk space" | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
{ | ||
"name": "[Azure IoT Edge] Rate of Edge Hub operations retries is higher than usual on device device {{host}}", | ||
"type": "query alert", | ||
"query": "avg(last_1h):anomalies(per_minute(avg:azure.iot_edge.edge_hub.operation_retry_total{*} by {host}), 'basic', 2, direction='above', alert_window='last_15m', interval=60, count_default_zero='true') >= 1", | ||
"message": "Please check device {{host}}, as Edge Hub reports a rate of operation retries of {{value}} per minute, which is higher than usual.", | ||
"tags": [ | ||
"integration:azure_iot_edge" | ||
], | ||
"options": { | ||
"notify_audit": false, | ||
"locked": false, | ||
"timeout_h": 0, | ||
"new_host_delay": 300, | ||
"require_full_window": false, | ||
"notify_no_data": false, | ||
"renotify_interval": 0, | ||
"escalation_message": "", | ||
"no_data_timeframe": null, | ||
"include_tags": true, | ||
"thresholds": { | ||
"critical": 1, | ||
"critical_recovery": 0 | ||
}, | ||
"threshold_windows": { | ||
"trigger_window": "last_15m", | ||
"recovery_window": "last_15m" | ||
} | ||
}, | ||
"recommended_monitor_metadata": { | ||
"description": "Notifies when rate of Edge Hub operation retries is higher than usual" | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
{ | ||
"name": "[Azure IoT Edge] Rate of unsuccessful syncs with IoT Hub is higher than usual on device {{host}}", | ||
"type": "query alert", | ||
"query": "avg(last_1h):anomalies(per_minute(avg:azure.iot_edge.edge_agent.unsuccessful_iothub_syncs_total{*} by {host}), 'basic', 2, direction='above', alert_window='last_15m', interval=60, count_default_zero='true') >= 1", | ||
"message": "Number of unsuccessful syncs between Edge Agent and IoT Hub on device {{host}} is at {{value}} per minute, which is higher than usual.", | ||
"tags": [ | ||
"integration:azure_iot_edge" | ||
], | ||
"options": { | ||
"notify_audit": false, | ||
"locked": false, | ||
"timeout_h": 0, | ||
"new_host_delay": 300, | ||
"require_full_window": false, | ||
"notify_no_data": false, | ||
"renotify_interval": 0, | ||
"escalation_message": "", | ||
"no_data_timeframe": null, | ||
"include_tags": true, | ||
"thresholds": { | ||
"critical": 1, | ||
"critical_recovery": 0 | ||
}, | ||
"threshold_windows": { | ||
"trigger_window": "last_15m", | ||
"recovery_window": "last_15m" | ||
} | ||
}, | ||
"recommended_monitor_metadata": { | ||
"description": "Notifies when unsuccessful syncs between Edge Agent and IoT Hub are higher than usual" | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
{ | ||
"name": "[Azure IoT Edge] IoT Edge device {{host}} is running out of memory", | ||
"type": "query alert", | ||
"query": "max(last_1h):avg:azure.iot_edge.edge_agent.used_memory_bytes{*} by {host} / avg:azure.iot_edge.edge_agent.total_memory_bytes{*} by {host}.rollup(max, 60) * 100 > 80", | ||
"message": "Please check device {{host}}, as Edge Agent reports usage of more than {{threshold}}% of available RAM for the last hour.", | ||
"tags": [ | ||
"integration:azure_iot_edge" | ||
], | ||
"options": { | ||
"notify_audit": false, | ||
"locked": false, | ||
"timeout_h": 0, | ||
"silenced": {}, | ||
"include_tags": true, | ||
"no_data_timeframe": null, | ||
"require_full_window": true, | ||
"new_host_delay": 300, | ||
"notify_no_data": false, | ||
"renotify_interval": 0, | ||
"escalation_message": "", | ||
"thresholds": { | ||
"critical": 80, | ||
"warning": 65, | ||
"critical_recovery": 79, | ||
"warning_recovery": 64 | ||
} | ||
}, | ||
"recommended_monitor_metadata": { | ||
"description": "Triggers an alert when an IoT Edge device is running out of memory" | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
[ | ||
{ | ||
"agent_version": "6.24.0", | ||
"integration": "Azure IoT Edge", | ||
"groups": [ | ||
"host", | ||
"endpoint" | ||
], | ||
"check": "azure.iot_edge.edge_agent.prometheus.health", | ||
"statuses": [ | ||
"ok", | ||
"critical" | ||
], | ||
"name": "Edge Agent health", | ||
"description": "Returns `CRITICAL` if the Agent is unable to reach the Edge Agent metrics Prometheus endpoint. Returns `OK` otherwise." | ||
}, | ||
{ | ||
"agent_version": "6.24.0", | ||
"integration": "Azure IoT Edge", | ||
"groups": [ | ||
"host", | ||
"endpoint" | ||
], | ||
"check": "azure.iot_edge.edge_hub.prometheus.health", | ||
"statuses": [ | ||
"ok", | ||
"critical" | ||
], | ||
"name": "Edge Hub health", | ||
"description": "Returns `CRITICAL` if the Agent is unable to reach the Edge Hub metrics Prometheus endpoint. Returns `OK` otherwise." | ||
} | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# (C) Datadog, Inc. 2020-present | ||
# All rights reserved | ||
# Licensed under a 3-clause BSD style license (see LICENSE) | ||
__path__ = __import__('pkgutil').extend_path(__path__, __name__) # type: ignore |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# (C) Datadog, Inc. 2020-present | ||
# All rights reserved | ||
# Licensed under a 3-clause BSD style license (see LICENSE) | ||
__version__ = '0.0.1' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# (C) Datadog, Inc. 2020-present | ||
# All rights reserved | ||
# Licensed under a 3-clause BSD style license (see LICENSE) | ||
from .__about__ import __version__ | ||
from .check import AzureIoTEdgeCheck | ||
|
||
__all__ = ['__version__', 'AzureIoTEdgeCheck'] |
Oops, something went wrong.