Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GH-1303] Google Cloud Logging Alerts #486

Merged
merged 4 commits into from
Aug 18, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 3 additions & 7 deletions api/src/wfl/api/routes.clj
Original file line number Diff line number Diff line change
Expand Up @@ -135,21 +135,17 @@
(defn exception-handler
"Top level exception handler. Prefer to use status and message
from EXCEPTION and fallback to the provided STATUS and MESSAGE."
[status message exception {:keys [uri] :as _request}]
[status _ exception _]
{:status (or (:status (ex-data exception)) status)
:body (-> (when-let [cause (.getCause exception)]
{:cause (ExceptionUtils/getRootCauseMessage cause)})
(merge {:uri uri
:message (or (.getMessage exception) message)
:details (-> exception ex-data (dissoc :status))}))})
:body "An internal error has occurred during this request. The development team has been notified of this error."})

(defn logging-exception-handler
"Like [[exception-handler]] but also log information about the exception."
[status message exception {:keys [uri] :as request}]
(let [{:keys [body status] :as result}
(exception-handler status message exception request)]
(log/error (format "Server %s error at occurred at %s :" status uri))
(log/error (util/make-map exception body))
(log/error (str (util/make-map exception body)))
result))

(def exception-middleware
Expand Down
28 changes: 28 additions & 0 deletions docs/md/dev-monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Workflow Launcher Monitoring
Logs from stdout and stderr are sent to Google Logging (Stackdriver) where they can be queried. With the logs, metrics can be created to see developments from those logs over time. From those metrics, we can create alerts that are sent to notification channels of our choosing (slack, email, sms, pubsub, etc.).

To create a metric via command line:
```
gcloud auth login
gcloud config set project PROJECT_ID
gcloud beta logging metrics create MY-METRIC-NAME --description="description goes here" --filter="filter goes here"
```

The log entries for WFL should be located under a container name of `workflow-launcher-api` so logging queries to find said logs should contain `resource.labels.container_name="workflow-launcher-api"`. To look for log severities of error and above, include `severity>=ERROR` in the metric filter as well. You can exclude specific items in the query with the `NOT` keyword (ex: `NOT "INFO: "` excludes messages that contain `"INFO: "`)

An example query for all wfl errors of severity ERROR and above:
```
resource.labels.container_name="workflow-launcher-api"
severity>=ERROR
```

To create an alert via command line:
```
gcloud auth login
gcloud config set project PROJECT_ID
gcloud alpha monitoring policies create --policy-from-file="path/to/file"
```

Example policies can be found here: https://cloud.google.com/monitoring/alerts/policies-in-json

When a metric goes over the threshold set by the policy, an alert is sent via the notification channels provided in the configuration. An incident is created in google cloud monitoring under alerts. These incidents will resolve themselves once the time series shows the metric condition of the alert going back under the configured threshold.
1 change: 1 addition & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ nav:
- Release Process: dev-release.md
- Logging: dev-logging.md
- Frontend: dev-frontend.md
- Monitoring: dev-monitoring.md
- Staged Workloads:
- Overview: staged-workload.md
- Source: source.md
Expand Down