Alerting for Cloud-native Applications with Prometheus

In the previous article we discussed the basics of Prometheus and the steps to integrate your application with Prometheus server. In this article we will explore another stream of an extensive monitoring system, which is triggering alerts for your production environment.

First let’s see all the different components and the interconnection between them in our overall monitoring system.

Different components in the monitoring system

Creating Prometheus Alerts

Based on the metrics created in your application, you can configure different alerts in Prometheus to fulfill your business requirements. Based on the metric created in the previous application, lets create an alert to be triggered when your service is down or not detached from Prometheus server and another alert to be triggered if the total API requests exceed a predetermined value.

First you need to have both your Prometheus and Alert manager services running. Then connect the alert manager server with the Prometheus server by configuring the service endpoint of alert manager inside the prometheus.yml config file of the Prometheus server. Below we have added the alert manager details and alert rule location details for Prometheus processor.

global:
scrape_interval: 15s
evaluation_interval: 15s
# Alertmanager configuration
alerting:
alertmanagers:
— static_configs:
— targets:
— localhost:9093
# Load rules once and periodically evaluate them according to the global ‘evaluation_interval’.
rule_files:
— ‘alerts\*.yml’

Now we have to create the Prometheus alert rule by creating an alert file inside the alerts folder defined above as given above.

groups:- name: ExampleAlertGroup
rules:
- alert: YourServiceDown
expr: up{job="your_service"} == 0
for: 1m
labels:
severity: "critical"
type: "service"
environment: "production"
annotations:
description: "Your Service {{ $labels.job }} instance {{ $labels.instance }} down"
summary: "your service is down."
- alert: RequestLimit
expr: sum(api_request_total[1m]) > 10
for: 1m
labels:
severity: warning
type: "service"
environment: "production"
annotations:
summary: "Total request count Limit Exceeded (instance {{ $labels.instance }})"
description: "Total request count Exceeded the Limit on node (> 10 / s) VALUE = {{ $value }} LABELS: {{ $labels }}"

Now we have to configure the alert-manager to send alerts based on the above alert triggers. Below I have used two alert endpoints for email alerts and Ms Teams alerts.

global:
smtp_smarthost: email-smtp.amazonaws.com:587

route:
receiver: 'default-receiver'
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 30m
repeat_interval: 5h
# All the above attributes are inherited by all child routes and can be overwritten on each.
routes:
- match_re:
service: ^.*
receiver: 'prometheus-msteams'
continue: true
routes:
- match:
severity: critical
receiver: 'default-receiver'
continue: true
receivers:
- name: 'default-receiver'
email_configs:
- send_resolved: false
to: 'mymail@abc.com'
from: 'alerts@abc.com'
auth_username: "XXXXXXXXXXXXX"
auth_identity: "XXXXXXXXXXXXX"
auth_password: "XXXXXXXXXXXXX"
- name: prometheus-msteams
webhook_configs:
- url: "http://prom2teams-server:8089/v2/Connector"
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster']

In the above config file, url endpoint configured for ‘prometheus-msteams’ is the endpoint of Prom2teams service which is explained below in this article.

group_by - To aggregate by all possible labels use ‘…’ as the sole label name. This effectively disables aggregation entirely, passing through all alerts as-is. But this is unlikely to be what you want, unless you have a very low alert volume or your upstream notification system performs its own grouping. So we aggregate alerts as given in above example

group_wait - When a new group of alerts is created by an incoming alert, wait at least ‘group_wait’ to send the initial notification. This way ensures that you get multiple alerts for the same group that start firing shortly after another are batched together on the first notification.

group_interval - When the first notification was sent, wait ‘group_interval’ to send a batch of new alerts that started firing for that group.

repeat_interval - If an alert has successfully been sent, wait ‘repeat_interval’ to resend them.

inhibit_rules - Inhibition rules allow to mute a set of alerts given that another alert is firing. We use this to mute any warning-level notifications if the same alert is already critical.

Prometheus to Ms Teams Integration

Integrating the alerts generated by alert-manager with Ms Teams is not straightforward. We need an intermediate component called Prom2teams service to interconnect the alert-manager with Ms Teams.

Prom2teams is a web server built with python that receives alert notifications from a previously configured alert-manager instance and forwards it to Ms Teams using defined connectors.

So first we need to create a Webhook connector in Ms Teams and configure that URL in Prom2teams config file.

Webhook connector creation in Ms Teams

The above generated webhook URL should be configured in the config.ini file of the Prom2teams server.

[HTTP Server]
Host:
Port: 8089
[Microsoft Teams]
Connector: https://outlook.office.com/webhook/1231232232323
[Group Alerts]
Field:
[Log]
Level: INFO
[Template]
Path: /opt/prom2teams/helmconfig/teams.j2

The default template file for the teams alert can be overwritten or changed accordingly as configured in the above Path parameter.

The URL for this Prom2teams service is the webhook URL configured under ‘prometheus-msteams’ in the above alert-manager alertmanager.yml file.

Now you should receive your alerts for the configured email address and Ms Teams channel.

Sample email alert
Sample Ms Teams Alert

Dead Man’s Switch

Dead man’s switch is a device/service designed in such a way that an action will occur upon a switch being opened or closed. In our case we use another service as a dead man’s switch to trigger an alert in case of Prometheus alert manager failure. So to achieve that, we configure a watchdog alert as given below for the dead man’s switch endpoint configured in the alert-manager.

The Watchdog alert is an “always firing” alert to ensure that the entire alerting pipeline is functional

groups:
- name: meta
rules:
- alert: WatchdogAlert
expr: vector(1)
labels:
severity: "critical"
environment: "Production"
annotations:
description: This is a Watchdog alert to ensure that the entire Alerting pipeline is functional.
summary: Watchdog Alerting

Below is the alert-manager configuration for the dead man’s switch.

routes:
- match_re:
alertname: WatchdogAlert
receiver: 'cole'
group_interval: 10s
repeat_interval: 4m
continue: false
receivers:
- name: cole
webhook_configs:
- url: "http://deadman-switch:8080/ping/bpbn2earafu3t25o2900"
send_resolved: false

We can use any application for this dead man’s switch which exhibits the expected behavior. This is a one such simple application written in go language to trigger an alert in case of a watch dog alert failure.

In the next article, we will integrate more sophisticated dashboards to our monitoring system.

Thanks!

Senior Software Engineer | BSc (Hons) Engineering | CIMA | Autodidact | Knowledge-Seeker

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store