Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Azure IoT Edge integration #7465

Merged
merged 65 commits into from
Oct 22, 2020
Merged

Add Azure IoT Edge integration #7465

merged 65 commits into from
Oct 22, 2020

Conversation

florimondmanca
Copy link
Contributor

@florimondmanca florimondmanca commented Sep 1, 2020

What does this PR do?

Add integration for monitoring Azure IoT Edge.

Motivation

Additional Notes

  • Docker-based test and dev environment
  • Edge Agent metrics and service check
  • Edge Hub metrics and service check
  • Windows tests?
  • Metadata
    • manifest.json
    • metadata.csv
    • service_checks.json
    • Recommended monitor
  • Version metadata
  • Authentication support -> nothing extra to support, added a -tls E2E env and instructions below.
  • Updated CI templates
  • Unit and integration tests
  • E2E tests
  • Windows E2E validation (Tested on a Windows VM on Azure)
  • Investigate if the Agent can be deployed as a module on IoT Edge -> Yes
  • Write README, especially docs on configuring devices
  • Log collection PoC (collection via existing docker log integration)
  • Log collection support (docs, tests setup) -> Later
  • OOTB dashboard -> Later

Misc notes:

  • No service check was added for the Security Manager, as it caused more config / implementation / compatibility issues than benefits (service checks for EdgeAgent and EdgeHub should be enough for now).

QA Notes

Basic verifications

  • Run the tests, and make sure they're passing locally.
  • Setup your E2E environment following the instructions in tests/README.md. You'll need to connect to portal.azure.com and retrieve the device connection string. If you're unsure how to connect there, hint: I wrote up an Azure IoT Edge testing environment guide (can't link to it here, but it's in our Wiki) that contains that info. If you're still unsure, ping me!
  • Start an env
  • Make sure all these containers are up and running, and that their logs don't indicate errors:
    iot-edge-device
    edgeHub
    edgeAgent
    SimulatedTemperatureSensor
  • The IoT Edge web UI on the Azure portal should show the device as "200 - OK", and all 3 modules (edgeHub, edgeAgent, SimulatedTemperatureSensor) should show as "Running" with exit code 0.
  • Verify the Datadog Agent status (docker exec -it <container> agent status).
  • Run a check manually (ddev env check), make sure it reports sensible metrics, and that all service checks are OK.
  • Update the configuration options, such as one of the 2 Prometheus URLs, and make sure the check doesn't catastrophically crash, i.e. that only the appropriate service checks report as CRITICAL, and only the associated metrics go missing.
  • Configure your ddev to send metrics and logs to staging (or the dev org), then start the env again, and make sure metrics and logs show up in the Datadog UI. If you're unsure how to do this, there should be hints in the testing environment guide, otherwise feel free to ping me!
  • Make sure metadata is in sync: review tile, metrics, service checks, recommended monitors.

Linux VM testing

Motivation: make sure the integration works well against a standard host-based IoT Edge security manager. (The one we use in E2E is Docker-based, but that's not how users will typically run their security manager.)

  • Follow the instructions in the Azure IoT Edge testing environment guide to setup a Linux VM with a host-based IoT Edge installation.
  • Verify the Agent status (reminder: the Agent runs as a custom module there, so you can inspect the datadog-agent container).
  • Run a check manually on the Agent container, make sure the output looks good.

TLS-enabled devices

Motivation: make sure the integration works well when the IoT Edge security manager uses custom certs (aka acts as a "transparent gateway"). Note: by default, the security manager will generate throw-away certs and let IoT Hub know, so in practice TLS is always used. I just went through this setup to make sure there really wasn't anything else we needed to do on the integration side, so here are the steps if you want to verify that yourself.

  • Follow instructions in tests/tls/README.md to setup test certificates for yourself. This will require you to run a script, upload a root CA cert to IoT Hub, generate and a verification code, create a verification cert, and finally upload this verification cert again to IoT Hub. Make sure you don't change any of the generated filenames, as the -tls E2E environments depend on them.
  • Start and run a check over env py27-tls or py38-tls. Make sure to inspect iot-edge-device logs to verify that certs were correctly taken into account (in particular, the manager shouldn't start in "quickstart mode"):
<6>2020-09-28T14:18:18Z [INFO] - Configuring certificates...
<6>2020-09-28T14:18:18Z [INFO] - Configuring the Device CA certificate using "/private/device.cert.pem".
<6>2020-09-28T14:18:18Z [INFO] - Configuring the Device private key using "/private/device.key.pem".
<6>2020-09-28T14:18:18Z [INFO] - Configuring the trusted CA certificates using "/private/root.cert.pem".
<6>2020-09-28T14:18:18Z [INFO] - Finished configuring provisioning environment variables and certificates.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • PR title must be written as a CHANGELOG entry (see why)
  • Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
  • PR must have changelog/ and integration/ labels attached

@codecov
Copy link

codecov bot commented Sep 1, 2020

Codecov Report

Merging #7465 into master will decrease coverage by 8.50%.
The diff coverage is 81.73%.

Impacted Files Coverage Δ
azure_iot_edge/tests/e2e_utils.py 37.83% <37.83%> (ø)
azure_iot_edge/tests/conftest.py 54.76% <54.76%> (ø)
...ot_edge/datadog_checks/azure_iot_edge/__about__.py 100.00% <100.00%> (ø)
...iot_edge/datadog_checks/azure_iot_edge/__init__.py 100.00% <100.00%> (ø)
...re_iot_edge/datadog_checks/azure_iot_edge/check.py 100.00% <100.00%> (ø)
...e_iot_edge/datadog_checks/azure_iot_edge/config.py 100.00% <100.00%> (ø)
..._iot_edge/datadog_checks/azure_iot_edge/metrics.py 100.00% <100.00%> (ø)
...re_iot_edge/datadog_checks/azure_iot_edge/types.py 100.00% <100.00%> (ø)
azure_iot_edge/tests/common.py 100.00% <100.00%> (ø)
azure_iot_edge/tests/test_check.py 100.00% <100.00%> (ø)
... and 293 more

@florimondmanca florimondmanca marked this pull request as ready for review September 30, 2020 12:14
Copy link
Member

@FlorianVeaux FlorianVeaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job with the testing 👍

azure_iot_edge/README.md Outdated Show resolved Hide resolved
azure_iot_edge/README.md Outdated Show resolved Hide resolved
azure_iot_edge/README.md Outdated Show resolved Hide resolved
azure_iot_edge/README.md Show resolved Hide resolved
azure_iot_edge/README.md Outdated Show resolved Hide resolved
azure_iot_edge/README.md Outdated Show resolved Hide resolved
azure_iot_edge/datadog_checks/azure_iot_edge/check.py Outdated Show resolved Hide resolved
azure_iot_edge/datadog_checks/azure_iot_edge/check.py Outdated Show resolved Hide resolved
azure_iot_edge/datadog_checks/azure_iot_edge/check.py Outdated Show resolved Hide resolved
azure_iot_edge/datadog_checks/azure_iot_edge/check.py Outdated Show resolved Hide resolved
florimondmanca and others added 3 commits October 1, 2020 15:51
Co-authored-by: Florian Veaux <florian.veaux@datadoghq.com>
Drop security manager service check

Reorganize check as an OpenMetricsBaseCheck subclass

Fix E2E tests

Update docs

Fix service checks: can_connect -> prometheus.health
FlorianVeaux
FlorianVeaux previously approved these changes Oct 2, 2020
Copy link
Member

@FlorianVeaux FlorianVeaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job 💯 🚢

@ruthnaebeck ruthnaebeck added the editorial review Waiting on a more in-depth review from a docs team editor label Oct 2, 2020
Copy link
Contributor

@kayayarai kayayarai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some docs style edit comments. Thank you!

azure_iot_edge/README.md Outdated Show resolved Hide resolved
azure_iot_edge/README.md Outdated Show resolved Hide resolved
azure_iot_edge/README.md Outdated Show resolved Hide resolved
azure_iot_edge/README.md Outdated Show resolved Hide resolved
azure_iot_edge/README.md Outdated Show resolved Hide resolved
Florimond Manca and others added 3 commits October 9, 2020 16:35
athap
athap previously requested changes Oct 19, 2020
Copy link
Contributor

@athap athap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments about renotify_interval which needs to be an int not string. Thanks

@florimondmanca florimondmanca dismissed stale reviews from athap and kayayarai October 21, 2020 13:59

Addressed

@florimondmanca florimondmanca merged commit de79a68 into master Oct 22, 2020
@florimondmanca florimondmanca deleted the fm/iot_edge branch October 22, 2020 10:16
github-actions bot pushed a commit that referenced this pull request Oct 22, 2020
* Add skeleton

* Working Docker setup for 1.0.9

* Attempt 1.0.10-rc2 setup

* Finalize RC2 dev setup

* Fix double-endpoint setup, implement scraping of Prometheus endpoints

* Update CI config

* Add config class, add failing integration test

* Successfully collect and test metrics, improve env up/down robustness

* Make tests pass

* Use local mock server for CI tests

* Add Edge Agent metrics

* Update codecov config

* Tweak exclude_labels

* Fix invalid manifest

* Add edgeHub metrics

* Document mock server metrics generation

* Fix Python 2 tests compatibility

* Assert E2E tags

* Skip E2E tests if IOT_EDGE_CONNSTR is missing

* Use Windows-compatible mock server setup

* Add security daemon health service check

* Simplify prometheus url config, add config tests

* Fix style, fix Windows test compat

* Verify service check in e2e

* Fix check class name case

* Add config spec

* Add logs to config spec and test env

* Use auto-discovery for log collection

* Enable log collection via Docker labels

* Set required properties in config spec

* Reorganize config options order

* Loosen wait conditions

* Update namespace to azure.iot_edge

* Add version metadata collection

* Update manifest.json

* Check types

* Write up metadata.csv

* Fill in service_checks.json

* Add TLS support to E2E environment

* Add code comment about single-instance and composition approaches

* Drop note about setting certs in config.yaml

This is already done automatically by the E2E environment

* Write up README

* Lingo: security daemon -> security manager

* Add recommended monitors

* Apply no-brainer suggestions

Co-authored-by: Florian Veaux <florian.veaux@datadoghq.com>

* Update version metadata transformer

* Address feedback

Drop security manager service check

Reorganize check as an OpenMetricsBaseCheck subclass

Fix E2E tests

Update docs

Fix service checks: can_connect -> prometheus.health

* Move instance config to Edge Agent labels

* Apply suggestions from docs review

Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com>

* Fix type of renotify_interval in monitors json

Co-authored-by: Florian Veaux <florian.veaux@datadoghq.com>
Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com> de79a68
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dev/testing documentation editorial review Waiting on a more in-depth review from a docs team editor integration/azure_iot_edge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants