Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnitsAvailable alerts are firing constantly #564

Closed
facundofc opened this issue Mar 27, 2023 · 15 comments
Closed

UnitsAvailable alerts are firing constantly #564

facundofc opened this issue Mar 27, 2023 · 15 comments
Assignees
Labels
bug Something isn't working

Comments

@facundofc
Copy link

On a recent deployment we're seeing these alerts firing all the time (literally, stuck to "firing"):

  • ArgoUnitIsUnavailable
  • DexAuthUnitIsUnavailable
  • JupyterControllerUnitIsUnavailable
  • MetacontrollerUnitIsUnavailable
  • MinioUnitIsUnavailable
  • TrainingOperatorUnitIsUnavailable

Looking at the up metric (which these alert rules query), we see that these are alternating between 1 and 0 every 45 seconds (this is a sample from the argo controller, query being: up{juju_application="argo-controller",juju_..."="..."}[10m]):

1 @1679925736.77
0 @1679925781.26
1 @1679925796.77
0 @1679925841.26
1 @1679925856.77
0 @1679925901.26
1 @1679925916.77
0 @1679925961.26
1 @1679925976.77
0 @1679926021.26
1 @1679926036.77
0 @1679926081.26
1 @1679926096.77
0 @1679926141.26
1 @1679926156.77
0 @1679926201.26
1 @1679926216.77
0 @1679926261.26
1 @1679926276.77
0 @1679926321.26

Incidentally to this flapping behavior, the duration for these alerts (at least for argo) is set to 0m, which seems a bit too sensitive for production envs.

dparv added a commit to dparv/argo-operators that referenced this issue Mar 28, 2023
As stated in issue canonical/bundle-kubeflow#564
the duation for alerts for argo is set to 0m, which is too low for
prod environments. We need to change to at least 5m to prevent the
flapping behavior.

Closes-Bug: canonical/bundle-kubeflow#564
dparv added a commit to dparv/training-operator that referenced this issue Mar 28, 2023
As stated in issue canonical/bundle-kubeflow#564
the duation for alerts for argo is set to 0m, which is too low for prod
environments. We need to change to at least 5m to prevent the flapping behavior.

Partial-Bug: canonical/bundle-kubeflow#564
dparv added a commit to dparv/dex-auth-operator that referenced this issue Mar 28, 2023
As stated in issue canonical/bundle-kubeflow#564
the duation for alerts for argo is set to 0m, which is too low for prod
environments. We need to change to at least 5m to prevent the flapping behavior.

Partial-Bug: canonical/bundle-kubeflow#564
dparv added a commit to dparv/notebook-operators that referenced this issue Mar 28, 2023
As stated in issue canonical/bundle-kubeflow#564
the duation for alerts for argo is set to 0m, which is too low for prod
environments. We need to change to at least 5m to prevent the flapping behavior.

Partial-Bug: canonical/bundle-kubeflow#564
dparv added a commit to dparv/metacontroller-operator that referenced this issue Mar 28, 2023
As stated in issue canonical/bundle-kubeflow#564
the duation for alerts for argo is set to 0m, which is too low for prod
environments. We need to change to at least 5m to prevent the flapping behavior.

Partial-Bug: canonical/bundle-kubeflow#564
dparv added a commit to dparv/minio-operator that referenced this issue Mar 28, 2023
As stated in issue canonical/bundle-kubeflow#564
the duation for alerts for argo is set to 0m, which is too low for prod
environments. We need to change to at least 5m to prevent the flapping behavior.

Partial-Bug: canonical/bundle-kubeflow#564
beliaev-maksim pushed a commit to canonical/argo-operators that referenced this issue Mar 29, 2023
As stated in issue canonical/bundle-kubeflow#564
the duation for alerts for argo is set to 0m, which is too low for
prod environments. We need to change to at least 5m to prevent the
flapping behavior.

Closes-Bug: canonical/bundle-kubeflow#564

Co-authored-by: Diko Parvanov <diko.parvanov@canonical.com>
beliaev-maksim pushed a commit to canonical/metacontroller-operator that referenced this issue Mar 29, 2023
As stated in issue canonical/bundle-kubeflow#564
the duation for alerts for argo is set to 0m, which is too low for prod
environments. We need to change to at least 5m to prevent the flapping behavior.

Partial-Bug: canonical/bundle-kubeflow#564

Co-authored-by: Diko Parvanov <diko.parvanov@canonical.com>
beliaev-maksim pushed a commit to canonical/minio-operator that referenced this issue Mar 29, 2023
As stated in issue canonical/bundle-kubeflow#564
the duation for alerts for argo is set to 0m, which is too low for prod
environments. We need to change to at least 5m to prevent the flapping behavior.

Partial-Bug: canonical/bundle-kubeflow#564

Co-authored-by: Diko Parvanov <diko.parvanov@canonical.com>
beliaev-maksim pushed a commit to canonical/dex-auth-operator that referenced this issue Mar 29, 2023
As stated in issue canonical/bundle-kubeflow#564
the duation for alerts for argo is set to 0m, which is too low for prod
environments. We need to change to at least 5m to prevent the flapping behavior.

Partial-Bug: canonical/bundle-kubeflow#564

Co-authored-by: Diko Parvanov <diko.parvanov@canonical.com>
beliaev-maksim pushed a commit to canonical/training-operator that referenced this issue Mar 29, 2023
As stated in issue canonical/bundle-kubeflow#564
the duation for alerts for argo is set to 0m, which is too low for prod
environments. We need to change to at least 5m to prevent the flapping behavior.

Partial-Bug: canonical/bundle-kubeflow#564

Co-authored-by: Diko Parvanov <diko.parvanov@canonical.com>
beliaev-maksim pushed a commit to canonical/notebook-operators that referenced this issue Mar 29, 2023
As stated in issue canonical/bundle-kubeflow#564
the duation for alerts for argo is set to 0m, which is too low for prod
environments. We need to change to at least 5m to prevent the flapping behavior.

Partial-Bug: canonical/bundle-kubeflow#564

Co-authored-by: Diko Parvanov <diko.parvanov@canonical.com>
@i-chvets
Copy link
Contributor

FIx is merged.

@facundofc
Copy link
Author

@i-chvets, the changes pushed by @dparv (changing the for: from 0 to 5m) are not a fix for this issue. I believe this should be reopened as the metrics are flapping or directly stuck to 0 (as in the dex-auth case). That needs to be addressed or pointed out here where it was addressed.

Thanks!

@orfeas-k
Copy link
Contributor

orfeas-k commented Aug 9, 2023

Thank you @facundofc for letting us know about the issue not having been fixed. In order to better understand the issue, our team will need some more information.

  1. Are there specific steps to follow in order to reproduce the issue?
  2. Do you deploy these charms alone or through the bundle?
  3. Is there a specific a test environmnet or a CI where this has run? That would be of great help too
  4. What would be the expected behaviour? I understand that we don't want the flapping between 1 and 0, but what would we expect them to be? Also, should alerts should move from the firing stage to a next one?

@orfeas-k orfeas-k reopened this Aug 9, 2023
@orfeas-k orfeas-k added bug Something isn't working question Further information is requested from the issue opener and removed bug Something isn't working labels Aug 9, 2023
DnPlas added a commit to canonical/training-operator that referenced this issue Feb 13, 2024
* fix: expose metrics port using kubernetes_service_patch lib

This commit ensures the metrics port is exposed in the Kubernetes Service for the training-operator
using the kubernetes_service_patch lib. This makes the metrics endpoint reachable from external
prometheus scraper.
This commit also changes the unit tests slightly to adapt to the added service patcher.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit to canonical/notebook-operators that referenced this issue Feb 13, 2024
* fix: expose metrics port using kubernetes_service_patch lib

This commit ensures the metrics port is exposed in the Kubernetes Service for the jupyter-controller
using the kubernetes_service_patch lib. This makes the metrics endpoint reachable from external
prometheus scraper.
This commit also changes the unit tests slightly to adapt to the added service patcher.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit to canonical/dex-auth-operator that referenced this issue Feb 13, 2024
* fix: set telemetry config value, patch service, update tests

This commit ensures the configuration value for the telemetry setting is correctly passed to the
workload configuration value. With this we ensure the workload is correctly exposing metrics in the
desired endpoint so they can be scraped by prometheus.
With this change we also must ensure that the K8s Service correctly exposes the ports of the endpoints
that this workload has (for metrics and the actual dex service).
Finally, a bit of refactoring work had to be made in the unit test to match the recent changes and the
kubernetes_service_patch library this charm uses has been bumped v0 -> v1.

Part of canonical/bundle-kubeflow#563

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit to canonical/minio-operator that referenced this issue Feb 13, 2024
* fix: set prometheus authentication variable

This variable allows public access without authentication for prometheus metrics.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit to canonical/dex-auth-operator that referenced this issue Feb 13, 2024
* fix: set telemetry config value, patch service, update tests

This commit ensures the configuration value for the telemetry setting is correctly passed to the
workload configuration value. With this we ensure the workload is correctly exposing metrics in the
desired endpoint so they can be scraped by prometheus.
With this change we also must ensure that the K8s Service correctly exposes the ports of the endpoints
that this workload has (for metrics and the actual dex service).
Finally, a bit of refactoring work had to be made in the unit test to match the recent changes and the
kubernetes_service_patch library this charm uses has been bumped v0 -> v1.

Part of canonical/bundle-kubeflow#563

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit to canonical/minio-operator that referenced this issue Feb 13, 2024
* fix: set prometheus authentication variable

This variable allows public access without authentication for prometheus metrics.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit to canonical/notebook-operators that referenced this issue Feb 13, 2024
* fix: expose metrics port using kubernetes_service_patch lib

This commit ensures the metrics port is exposed in the Kubernetes Service for the jupyter-controller
using the kubernetes_service_patch lib. This makes the metrics endpoint reachable from external
prometheus scraper.
This commit also changes the unit tests slightly to adapt to the added service patcher.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit to canonical/training-operator that referenced this issue Feb 13, 2024
* fix: expose metrics port using kubernetes_service_patch lib

This commit ensures the metrics port is exposed in the Kubernetes Service for the training-operator
using the kubernetes_service_patch lib. This makes the metrics endpoint reachable from external
prometheus scraper.
This commit also changes the unit tests slightly to adapt to the added service patcher.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit to canonical/seldon-core-operator that referenced this issue Feb 13, 2024
The metrics endpoint configuration had two scrape jobs, one for the
regular metrics endpoint, and a second one based on a dynamic list of
targets. The latter was causing the prometheus scraper to try and scrape
metrics from *:80/metrics, which is not a valid endpoint. This was
causing the UnitsUnavailable alert to fire constantly because that job
was reporting back that the endpoint was not available.
This new job was introduced by #94
with no apparent justification. Because the seldon charm has changed
since that PR, and the endpoint it is configuring is not valid, this
commit will remove the extra job.

This commit also refactors the MetricsEndpointProvider instantiation and
removes the metrics-port config option as this value should not change.

Finally, this commit changes the alert rule interval from 0m to 5m, as
this interval is more appropriate for production environments.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit to canonical/seldon-core-operator that referenced this issue Feb 13, 2024
The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564

skip: fix test
DnPlas added a commit to canonical/seldon-core-operator that referenced this issue Feb 13, 2024
* fix: correctly configure one scrape job to avoid firig alerts

The metrics endpoint configuration had two scrape jobs, one for the
regular metrics endpoint, and a second one based on a dynamic list of
targets. The latter was causing the prometheus scraper to try and scrape
metrics from *:80/metrics, which is not a valid endpoint. This was
causing the UnitsUnavailable alert to fire constantly because that job
was reporting back that the endpoint was not available.
This new job was introduced by #94
with no apparent justification. Because the seldon charm has changed
since that PR, and the endpoint it is configuring is not valid, this
commit will remove the extra job.

This commit also refactors the MetricsEndpointProvider instantiation and
removes the metrics-port config option as this value should not change.

Finally, this commit changes the alert rule interval from 0m to 5m, as
this interval is more appropriate for production environments.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit to canonical/seldon-core-operator that referenced this issue Feb 13, 2024
* fix: correctly configure one scrape job to avoid firig alerts

The metrics endpoint configuration had two scrape jobs, one for the
regular metrics endpoint, and a second one based on a dynamic list of
targets. The latter was causing the prometheus scraper to try and scrape
metrics from *:80/metrics, which is not a valid endpoint. This was
causing the UnitsUnavailable alert to fire constantly because that job
was reporting back that the endpoint was not available.
This new job was introduced by #94
with no apparent justification. Because the seldon charm has changed
since that PR, and the endpoint it is configuring is not valid, this
commit will remove the extra job.

This commit also refactors the MetricsEndpointProvider instantiation and
removes the metrics-port config option as this value should not change.

Finally, this commit changes the alert rule interval from 0m to 5m, as
this interval is more appropriate for production environments.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit to canonical/training-operator that referenced this issue Feb 13, 2024
)

* fix: expose metrics port using kubernetes_service_patch lib

This commit ensures the metrics port is exposed in the Kubernetes Service for the training-operator
using the kubernetes_service_patch lib. This makes the metrics endpoint reachable from external
prometheus scraper.
This commit also changes the unit tests slightly to adapt to the added service patcher.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit to canonical/minio-operator that referenced this issue Feb 13, 2024
* fix: set prometheus authentication variable

This variable allows public access without authentication for prometheus metrics.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit to canonical/dex-auth-operator that referenced this issue Feb 13, 2024
…186)

* fix: set telemetry config value, patch service, update tests

This commit ensures the configuration value for the telemetry setting is correctly passed to the
workload configuration value. With this we ensure the workload is correctly exposing metrics in the
desired endpoint so they can be scraped by prometheus.
With this change we also must ensure that the K8s Service correctly exposes the ports of the endpoints
that this workload has (for metrics and the actual dex service).
Finally, a bit of refactoring work had to be made in the unit test to match the recent changes and the
kubernetes_service_patch library this charm uses has been bumped v0 -> v1.

Part of canonical/bundle-kubeflow#563

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit to canonical/metacontroller-operator that referenced this issue Feb 13, 2024
…101)

* fix: create a Service for the workload and fix the metrics collector

This charm was not deploying any Service for the workload container,
which is fine for its regular functions, but causes an issue when the
Prometheus scraper tries reaching out the metrics endpoint.
This commit adds a Service that is attached to the WORKLOAD (the
container inside the Pod that gets created by the StatefulSet we are
applying manually) so that the metrics from it can be reached correctly.
Because of that, the MetricsEndpointProvider's target has to be refactored
to point to the correct service. In a previous version of this charm,
the target was pointing to the charm's container, which does not have
any metrics endpoit, causing the issues reported in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit to canonical/metacontroller-operator that referenced this issue Feb 13, 2024
…101)

* fix: create a Service for the workload and fix the metrics collector

This charm was not deploying any Service for the workload container,
which is fine for its regular functions, but causes an issue when the
Prometheus scraper tries reaching out the metrics endpoint.
This commit adds a Service that is attached to the WORKLOAD (the
container inside the Pod that gets created by the StatefulSet we are
applying manually) so that the metrics from it can be reached correctly.
Because of that, the MetricsEndpointProvider's target has to be refactored
to point to the correct service. In a previous version of this charm,
the target was pointing to the charm's container, which does not have
any metrics endpoit, causing the issues reported in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit to canonical/metacontroller-operator that referenced this issue Feb 14, 2024
…101) (#102)

* fix: create a Service for the workload and fix the metrics collector

This charm was not deploying any Service for the workload container,
which is fine for its regular functions, but causes an issue when the
Prometheus scraper tries reaching out the metrics endpoint.
This commit adds a Service that is attached to the WORKLOAD (the
container inside the Pod that gets created by the StatefulSet we are
applying manually) so that the metrics from it can be reached correctly.
Because of that, the MetricsEndpointProvider's target has to be refactored
to point to the correct service. In a previous version of this charm,
the target was pointing to the charm's container, which does not have
any metrics endpoit, causing the issues reported in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit to canonical/seldon-core-operator that referenced this issue Feb 14, 2024
* fix: correctly configure one scrape job to avoid firig alerts

The metrics endpoint configuration had two scrape jobs, one for the
regular metrics endpoint, and a second one based on a dynamic list of
targets. The latter was causing the prometheus scraper to try and scrape
metrics from *:80/metrics, which is not a valid endpoint. This was
causing the UnitsUnavailable alert to fire constantly because that job
was reporting back that the endpoint was not available.
This new job was introduced by #94
with no apparent justification. Because the seldon charm has changed
since that PR, and the endpoint it is configuring is not valid, this
commit will remove the extra job.

This commit also refactors the MetricsEndpointProvider instantiation and
removes the metrics-port config option as this value should not change.

Finally, this commit changes the alert rule interval from 0m to 5m, as
this interval is more appropriate for production environments.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit to canonical/notebook-operators that referenced this issue Feb 14, 2024
)

* fix: expose metrics port using kubernetes_service_patch lib

This commit ensures the metrics port is exposed in the Kubernetes Service for the jupyter-controller
using the kubernetes_service_patch lib. This makes the metrics endpoint reachable from external
prometheus scraper.
This commit also changes the unit tests slightly to adapt to the added service patcher.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit to canonical/seldon-core-operator that referenced this issue Feb 14, 2024
* fix: correctly configure one scrape job to avoid firig alerts

The metrics endpoint configuration had two scrape jobs, one for the
regular metrics endpoint, and a second one based on a dynamic list of
targets. The latter was causing the prometheus scraper to try and scrape
metrics from *:80/metrics, which is not a valid endpoint. This was
causing the UnitsUnavailable alert to fire constantly because that job
was reporting back that the endpoint was not available.
This new job was introduced by #94
with no apparent justification. Because the seldon charm has changed
since that PR, and the endpoint it is configuring is not valid, this
commit will remove the extra job.

This commit also refactors the MetricsEndpointProvider instantiation and
removes the metrics-port config option as this value should not change.

Finally, this commit changes the alert rule interval from 0m to 5m, as
this interval is more appropriate for production environments.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
@DnPlas
Copy link
Contributor

DnPlas commented Feb 16, 2024

All PRs have been merged, we can close this issue. Feel free to re-open if this still an issue.

@DnPlas DnPlas closed this as completed Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

No branches or pull requests

7 participants