Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added prometheus alert RRM #938

Merged
merged 58 commits into from
Dec 22, 2023
Merged
Show file tree
Hide file tree
Changes from 49 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
17f27b3
Added prometheus alert RRM
ganeshrvel Jun 16, 2023
17dc4ff
Merge branch 'master' into feature/Robusta-Resources-Management
ganeshrvel Jun 16, 2023
6149da7
fixed MAX_ALLOWED_RULES_PER_CRD
ganeshrvel Jun 16, 2023
d82dfdb
Reorganized the code
ganeshrvel Jun 17, 2023
438f34a
Improved the RRM alert creation
ganeshrvel Jun 21, 2023
10cb2c3
Merge branch 'master' into feature/Robusta-Resources-Management
ganeshrvel Jun 21, 2023
6ff3de0
fixed the ModuleNotFoundError: No module named 'postgrest_py' error.
ganeshrvel Jun 21, 2023
68ccc67
Merge remote-tracking branch 'origin/feature/Robusta-Resources-Manage…
ganeshrvel Jun 21, 2023
7b37675
instead of delete crd object, replace has been added to prevent noisy…
ganeshrvel Jun 22, 2023
4ba43b6
refactored the alert config protocols to only accept the latest chang…
ganeshrvel Jun 26, 2023
d28d7bd
Merge branch 'master' into feature/Robusta-Resources-Management
ganeshrvel Jun 28, 2023
e56c2b5
optimized the alerts config to use fewer resources than before.
ganeshrvel Jun 29, 2023
d8d445f
improved the alerts config
ganeshrvel Jun 30, 2023
47976b7
Merge branch 'master' into feature/Robusta-Resources-Management
ganeshrvel Jun 30, 2023
7d53b7b
removed postgrest package
ganeshrvel Jun 30, 2023
a602ceb
Fixing SupabaseDal dependencies
ganeshrvel Jun 30, 2023
3e17354
work in progress
pablos44 Jul 2, 2023
4f13655
cleaned and reorganized the code
ganeshrvel Jul 5, 2023
891d9c1
Added better try catch errors
ganeshrvel Jul 5, 2023
2296ad7
Added some improvements to error handling
ganeshrvel Jul 6, 2023
aed984e
Added docs improvements
ganeshrvel Jul 6, 2023
1a75387
Testing module error - 1
ganeshrvel Jul 7, 2023
943f024
add account resource fecther interface
arikalon1 Jul 7, 2023
a80e9ab
add account resource fecther interface
arikalon1 Jul 7, 2023
a302cd1
add account resource fecther interface
arikalon1 Jul 7, 2023
b8005be
add account resource fecther interface
arikalon1 Jul 7, 2023
91febe6
Fixed `TypeError: 'type' object is not subscriptable`
ganeshrvel Jul 7, 2023
1a093a5
Merge branch 'master' into feature/Robusta-Resources-Management
ganeshrvel Jul 7, 2023
e64bf03
Removed bool return from init_resources instead used try catch
ganeshrvel Jul 10, 2023
861c10e
Merge remote-tracking branch 'origin/feature/Robusta-Resources-Manage…
ganeshrvel Jul 10, 2023
6831b45
Merge branch 'master' into feature/Robusta-Resources-Management
ganeshrvel Jul 10, 2023
2318427
Added RRM alerts recording rules handling
ganeshrvel Jul 20, 2023
e228d7c
Replaced the supabase col `has_alerts_config_installed` with runner v…
ganeshrvel Jul 21, 2023
5507f78
Merge remote-tracking branch 'origin/feature/Robusta-Resources-Manage…
ganeshrvel Jul 21, 2023
693c65f
replaced enable_prometheus_stack params in the create/accounts api wi…
ganeshrvel Jul 24, 2023
d5591c0
Merge branch 'master' into feature/Robusta-Resources-Management
ganeshrvel Jul 25, 2023
9f25d84
Merge branch 'master' into feature/Robusta-Resources-Management
ganeshrvel Jul 27, 2023
5b1b6d2
Added service account permissions
ganeshrvel Jul 28, 2023
5bf829a
tiny cleanup
ganeshrvel Aug 8, 2023
d2a13c8
Modified the rrm to suite the new data structure
ganeshrvel Sep 8, 2023
f4afa2d
Optimized the error handling and cache invalidation of alerts config.
ganeshrvel Sep 20, 2023
d9a4f08
fixed labels and annotations
ganeshrvel Sep 20, 2023
206620d
fixed bug related to clearing residual rules
ganeshrvel Sep 21, 2023
e08375c
improved max_iterations logic
ganeshrvel Sep 21, 2023
c1031cd
Merge branch 'master' into feature/Robusta-Resources-Management
ganeshrvel Sep 22, 2023
c3c6c1e
Merge branch 'master' into feature/Robusta-Resources-Management
ganeshrvel Oct 17, 2023
01171c0
Merge branch 'master' into feature/Robusta-Resources-Management
ganeshrvel Oct 26, 2023
0b15e04
Fixed a bug where enable managed prometheus alerts flag was not influ…
ganeshrvel Oct 26, 2023
7143c40
Merge branch 'master' into feature/Robusta-Resources-Management
ganeshrvel Nov 6, 2023
08c4978
Merge branch 'master' into feature/Robusta-Resources-Management
ganeshrvel Nov 20, 2023
780e93a
Added improvements to the RRM and alerts config.
ganeshrvel Nov 28, 2023
11d1f1e
Added MANAGED_CONFIGURATION_ENABLED checks
ganeshrvel Nov 29, 2023
45c4c9b
Merge branch 'master' into feature/Robusta-Resources-Management
ganeshrvel Dec 18, 2023
1656c35
Merge branch 'master' into feature/Robusta-Resources-Management
ganeshrvel Dec 19, 2023
36e597c
Improvements to Alerts config.
ganeshrvel Dec 20, 2023
7a78bde
Merge branch 'master' into feature/Robusta-Resources-Management
ganeshrvel Dec 20, 2023
b7c7b1d
CR fixes
arikalon1 Dec 22, 2023
978416b
Merge branch 'master' into feature/Robusta-Resources-Management
arikalon1 Dec 22, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions helm/robusta/templates/runner-service-account.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,20 @@ rules:
- watch
- patch

{{- if .Values.monitorHelmReleases }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this related to helm releases?

- apiGroups:
- "monitoring.coreos.com"
resources:
- prometheusrules
verbs:
- get
- list
- delete
- create
- patch
- update
{{end}}

- apiGroups:
- ""
resources:
Expand Down
2 changes: 2 additions & 0 deletions helm/robusta/templates/runner.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,8 @@ spec:
value: {{ .Release.Name | quote }}
- name: PROMETHEUS_ENABLED
value: {{ .Values.enablePrometheusStack | quote}}
- name: MANAGED_PROMETHEUS_ALERTS_ENABLED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to MANAGED_CONFIGURATION_ENABLED

value: {{ .Values.enableManagedPrometheusAlerts | quote}}
- name: SEND_ADDITIONAL_TELEMETRY
value: {{ .Values.runner.sendAdditionalTelemetry | quote }}
- name: LOG_LEVEL
Expand Down
1 change: 1 addition & 0 deletions helm/robusta/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ lightActions:

# install prometheus, alert-manager, and grafana along with Robusta?
enablePrometheusStack: false
enableManagedPrometheusAlerts: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be renamed to enabledManagedConfiguration
A single value to all types of managed configurations

enableServiceMonitors: true
monitorHelmReleases: true

Expand Down
2 changes: 1 addition & 1 deletion playbooks/robusta_playbooks/k8s_resource_enrichments.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ def to_pod_row(pod: Pod, cluster_name: str) -> List:
]


def get_related_pods(resource) -> list[Pod]:
def get_related_pods(resource) -> List[Pod]:
kind: str = resource.kind or ""
if kind not in supported_resources:
raise ActionException(ErrorCodes.RESOURCE_NOT_SUPPORTED, f"Related pods is not supported for resource {kind}")
Expand Down
2 changes: 1 addition & 1 deletion src/robusta/core/discovery/discovery.py
Original file line number Diff line number Diff line change
Expand Up @@ -362,7 +362,7 @@ def discovery_process() -> DiscoveryResults:
# we use map here to deduplicate and pick only the latest release data
helm_releases_map[decoded_release_row.get_service_key()] = decoded_release_row
except Exception as e:
logging.error(f"an error occured while decoding helm releases: {e}")
logging.error(f"an error occurred while decoding helm releases: {e}")

continue_ref = secrets.metadata._continue
if not continue_ref:
Expand Down
11 changes: 11 additions & 0 deletions src/robusta/core/model/cluster_status.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
from datetime import datetime

from pydantic.main import BaseModel, List, Optional


Expand All @@ -18,6 +20,7 @@ class ActivityStats(BaseModel):
alertManagerConnection: bool
prometheusConnection: bool
prometheusRetentionTime: str
managedPrometheusAlerts: bool


class ClusterStatus(BaseModel):
Expand All @@ -29,3 +32,11 @@ class ClusterStatus(BaseModel):
ttl_hours: int
stats: ClusterStats
activity_stats: ActivityStats


class Account(BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class is not in use, is it?

id: str
account_type: int
is_test_account: bool
name: Optional[str]
creation_date: datetime
5 changes: 5 additions & 0 deletions src/robusta/core/model/env_vars.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ def load_bool(env_var, default: bool):

PROMETHEUS_REQUEST_TIMEOUT_SECONDS = float(os.environ.get("PROMETHEUS_REQUEST_TIMEOUT_SECONDS", 90.0))
PROMETHEUS_ENABLED = os.environ.get("PROMETHEUS_ENABLED", "false").lower() == "true"
MANAGED_PROMETHEUS_ALERTS_ENABLED = os.environ.get("MANAGED_PROMETHEUS_ALERTS_ENABLED", "false").lower() == "true"
PROMETHEUS_SSL_ENABLED = os.environ.get("PROMETHEUS_SSL_ENABLED", "false").lower() == "true"

INCOMING_REQUEST_TIME_WINDOW_SECONDS = int(os.environ.get("INCOMING_REQUEST_TIME_WINDOW_SECONDS", 3600))
Expand Down Expand Up @@ -95,4 +96,8 @@ def load_bool(env_var, default: bool):

PROMETHEUS_ERROR_LOG_PERIOD_SEC = int(os.environ.get("DISCOVERY_MAX_BATCHES", 14400))

RRM_PERIOD_SEC = int(os.environ.get("RRM_PERIOD_SEC", 90))

MAX_ALLOWED_RULES_PER_CRD_ALERT = int(os.environ.get("MAX_ALLOWED_RULES_PER_CRD_ALERT", 600))

IMAGE_REGISTRY = os.environ.get("IMAGE_REGISTRY", "us-central1-docker.pkg.dev/genuine-flight-317411/devel")
4 changes: 2 additions & 2 deletions src/robusta/core/sinks/pagerduty/pagerduty_sink.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import logging
from typing import Dict, Any, Optional
from typing import Dict, Any, Optional, List

import requests

Expand Down Expand Up @@ -233,7 +233,7 @@ def __to_unformatted_text_for_alerts(block: BaseBlock) -> str:
return ""

@staticmethod
def __to_unformatted_text_for_changes(block: KubernetesDiffBlock) -> Optional[list[str]]:
def __to_unformatted_text_for_changes(block: KubernetesDiffBlock) -> Optional[List[str]]:
return list(map(
lambda diff: diff.formatted_path,
block.diffs,
Expand Down
142 changes: 116 additions & 26 deletions src/robusta/core/sinks/robusta/dal/supabase_dal.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
import json
import logging
import time
from typing import Any, Dict, List

from datetime import datetime
from typing import Any, Dict, List, Optional
import requests
from postgrest.base_request_builder import BaseFilterRequestBuilder
from postgrest.utils import sanitize_param
from postgrest.types import ReturnMethod
from supabase import create_client
from supabase.lib.client_options import ClientOptions

from robusta.core.model.cluster_status import ClusterStatus
from robusta.core.model.cluster_status import ClusterStatus, Account
from robusta.core.model.env_vars import SUPABASE_LOGIN_RATE_LIMIT_SEC, SUPABASE_TIMEOUT_SECONDS
from robusta.core.model.helm_release import HelmRelease
from robusta.core.model.jobs import JobInfo
Expand All @@ -19,8 +21,10 @@
from robusta.core.reporting.base import Finding
from robusta.core.reporting.blocks import EventsBlock, EventsRef, ScanReportBlock, ScanReportRow
from robusta.core.reporting.consts import EnrichmentAnnotation
from robusta.core.sinks.robusta import RobustaSinkParams
from robusta.core.sinks.robusta.dal.model_conversion import ModelConversion
from robusta.core.sinks.robusta.rrm.account_resource_fetcher import AccountResourceFetcher
from robusta.core.sinks.robusta.rrm.types import AccountResource, ResourceKind, \
AccountResourceStatusType, AccountResourceStatusInfo

SERVICES_TABLE = "Services"
NODES_TABLE = "Nodes"
Expand All @@ -33,19 +37,23 @@
UPDATE_CLUSTER_NODE_COUNT = "update_cluster_node_count"
SCANS_RESULT_TABLE = "ScansResults"
RESOURCE_EVENTS = "ResourceEvents"
ACCOUNT_RESOURCE_TABLE = "AccountResource"
ACCOUNT_RESOURCE_STATUS_TABLE = "AccountResourceStatus"
ACCOUNTS_TABLE = "Accounts"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used



class SupabaseDal:
class SupabaseDal(AccountResourceFetcher):
def __init__(
self,
url: str,
key: str,
account_id: str,
email: str,
password: str,
sink_params: RobustaSinkParams,
cluster_name: str,
signing_key: str,
self,
url: str,
key: str,
account_id: str,
email: str,
password: str,
sink_name: str,
persist_events: bool,
cluster_name: str,
signing_key: str,
):
self.url = url
self.key = key
Expand All @@ -58,7 +66,8 @@ def __init__(
self.sign_in_time = 0
self.sign_in()
self.client.auth.on_auth_state_change(self.__update_token_patch)
self.sink_params = sink_params
self.sink_name = sink_name
self.persist_events = persist_events
self.signing_key = signing_key

def __to_db_scanResult(self, scanResult: ScanReportRow) -> Dict[Any, Any]:
Expand Down Expand Up @@ -111,7 +120,7 @@ def persist_finding(self, finding: Finding):
evidence = ModelConversion.to_evidence_json(
account_id=self.account_id,
cluster_id=self.cluster,
sink_name=self.sink_params.name,
sink_name=self.sink_name,
signing_key=self.signing_key,
finding_id=finding.id,
enrichment=enrichment,
Expand Down Expand Up @@ -172,15 +181,8 @@ def get_active_services(self) -> List[ServiceInfo]:
res = (
self.client.table(SERVICES_TABLE)
.select(
"name",
"type",
"namespace",
"classification",
"config",
"ready_pods",
"total_pods",
"is_helm_release",
)
"name", "type", "namespace", "classification", "config", "ready_pods", "total_pods",
"is_helm_release")
.filter("account_id", "eq", self.account_id)
.filter("cluster", "eq", self.cluster)
.filter("deleted", "eq", False)
Expand Down Expand Up @@ -280,6 +282,14 @@ def publish_nodes(self, nodes: List[NodeInfo]):
self.handle_supabase_error()
raise

@staticmethod
def custom_filter_request_builder(frq: BaseFilterRequestBuilder, operator: str,
criteria: str) -> BaseFilterRequestBuilder:
key, val = sanitize_param(operator), f"{criteria}"
frq.params = frq.params.set(key, val)

return frq

def get_active_jobs(self) -> List[JobInfo]:
try:
res = (
Expand Down Expand Up @@ -495,13 +505,93 @@ def persist_events_block(self, block: EventsBlock):
def persist_platform_blocks(self, enrichment: Enrichment, finding_id):
blocks = enrichment.blocks
for i, block in enumerate(blocks):
if isinstance(block, EventsBlock) and self.sink_params.persist_events and block.events:
if isinstance(block, EventsBlock) and self.persist_events and block.events:
self.persist_events_block(block)
event = block.events[0]
blocks[i] = EventsRef(name=event.name, namespace=event.namespace, kind=event.kind.lower())
if isinstance(block, ScanReportBlock):
self.persist_scan(block)

def get_account_resources(
self, resource_kind: ResourceKind, updated_at: Optional[datetime]
) -> List[AccountResource]:
try:
query_builder = (
self.client.table(ACCOUNT_RESOURCE_TABLE)
.select("entity_id", "resource_kind", "clusters_target_set", "resource_state", "deleted", "enabled",
"updated_at")
.filter("resource_kind", "eq", resource_kind)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for equals, use .eq and not .filter("xx", "eq", ...)

.filter("account_id", "eq", self.account_id)
)
if updated_at:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be much easier to understand if this is called latest_revision or something similar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

query_builder.gt("updated_at", updated_at.isoformat())
else:
# in the initial db fetch don't include the deleted records.
# in the subsequent db fetch allow even the deleted records so that they can be removed from the cluster
query_builder.filter("deleted", "eq", False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use .eq(...)

query_builder.filter("enabled", "eq", True)

query_builder = SupabaseDal.custom_filter_request_builder(
query_builder,
operator="or",
criteria=f'(clusters_target_set.cs.["*"], clusters_target_set.cs.["{self.cluster}"])',
)
query_builder = query_builder.order(column="updated_at", desc=False)

res = query_builder.execute()
except Exception as e:
msg = f"Failed to get existing account resources (supabase) error: {e}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use exc_info=true, to get the real stack trace

logging.error(msg)
self.handle_supabase_error()
raise Exception(msg)

account_resources: List[AccountResource] = []
for data in res.data:
resource_state = data["resource_state"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why you initialize this fields differently than all others?

resource = AccountResource(
entity_id=data["entity_id"],
resource_kind=data["resource_kind"],
clusters_target_set=data["clusters_target_set"],
resource_state=resource_state,
deleted=data["deleted"],
enabled=data["enabled"],
updated_at=data["updated_at"],
)

account_resources.append(resource)

return account_resources

def __to_db_account_resource_status(self,
status_type: Optional[AccountResourceStatusType],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status_type seems to be mandatory, why is it marked as Optional?

info: Optional[AccountResourceStatusInfo]) -> Dict[Any, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if info is Optional, please provide a default value None, and call without info where not required
info: Optional[AccountResourceStatusInfo] = None


data = {
"account_id": self.account_id,
"cluster_id": self.cluster,
"status": status_type,
"info": info.dict() if info else None,
"updated_at": "now()",
"latest_revision": "now()"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest_revision shouldn't be the current time, it should be the date of the updated resources.
It should be the latest date of the resource synced from Account Resources table


if status_type != AccountResourceStatusType.error:
data["synced_revision"] = "now()"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. It's not now()
It should indicate which data revision is synchronized


return data

def set_account_resource_status(
self, status_type: Optional[AccountResourceStatusType],
info: Optional[AccountResourceStatusInfo]
):
try:
data = self.__to_db_account_resource_status(status_type=status_type, info=info)

self.client.table(ACCOUNT_RESOURCE_STATUS_TABLE).upsert(data).execute()
except Exception as e:
logging.error(f"Failed to persist resource events error: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use exc_info, and a correct error message

self.handle_supabase_error()
raise

def __rpc_patch(self, func_name: str, params: dict) -> Dict[str, Any]:
"""
Supabase client is async. Sync impl of rpc call
Expand Down
Loading
Loading