Alerting for Cloud-native Applications with Prometheus

5 min readMar 8, 2020

In the previous article we discussed the basics of Prometheus and the steps to integrate your application with Prometheus server. In this article we will explore another stream of an extensive monitoring system, which is triggering alerts for your production environment.

First let’s see all the different components and the interconnection between them in our overall monitoring system.

Different components in the monitoring system

Creating Prometheus Alerts

Based on the metrics created in your application, you can configure different alerts in Prometheus to fulfill your business requirements. Based on the metric created in the previous application, lets create an alert to be triggered when your service is down or not detached from Prometheus server and another alert to be triggered if the total API requests exceed a predetermined value.

First you need to have both your Prometheus and Alert manager services running. Then connect the alert manager server with the Prometheus server by configuring the service endpoint of alert manager inside the prometheus.yml config file of the Prometheus server. Below we have added the alert manager details and alert rule location details for Prometheus processor.

global:
 scrape_interval: 15s 
 evaluation_interval: 15s # Alertmanager configuration
alerting:
 alertmanagers:
 — static_configs:
 — targets:
 — localhost:9093# Load rules once and periodically evaluate them according to the global ‘evaluation_interval’.
rule_files:
 — ‘alerts\*.yml’

Now we have to create the Prometheus alert rule by creating an alert file inside the alerts folder defined above as given above.

groups:- name: ExampleAlertGroup
  rules:
  - alert: YourServiceDown
    expr: up{job="your_service"} == 0
    for: 1m
    labels:
      severity: "critical"
      type: "service"
      environment: "production"
    annotations:
      description: "Your Service {{ $labels.job }} instance {{ $labels.instance }} down"
      summary: "your service is down."  - alert: RequestLimit
    expr: sum(api_request_total[1m]) > 10
    for: 1m
    labels:
      severity: warning
      type: "service"
      environment: "production"
    annotations:
      summary: "Total request count Limit Exceeded (instance {{ $labels.instance }})"
      description: "Total request count Exceeded the Limit on node (> 10 / s)  VALUE = {{ $value }}  LABELS: {{ $labels }}"

Now we have to configure the alert-manager to send alerts based on the above alert triggers. Below I have used two alert endpoints for email alerts and Ms Teams alerts.

global:
  smtp_smarthost: email-smtp.amazonaws.com:587
  
route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 30m
  repeat_interval: 5h
  # All the above attributes are inherited by all child routes and can be overwritten on each.
  routes:
  - match_re:
      service: ^.*
    receiver: 'prometheus-msteams'
    continue: true
    routes:
    - match:
        severity: critical
      receiver: 'default-receiver'
      continue: truereceivers:
- name: 'default-receiver'
  email_configs:
  - send_resolved: false
    to: 'mymail@abc.com'
    from: 'alerts@abc.com'
    auth_username: "XXXXXXXXXXXXX"
    auth_identity: "XXXXXXXXXXXXX"
    auth_password: "XXXXXXXXXXXXX"- name: prometheus-msteams
  webhook_configs:
  - url: "http://prom2teams-server:8089/v2/Connector"
    send_resolved: trueinhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster']

In the above config file, url endpoint configured for ‘prometheus-msteams’ is the endpoint of Prom2teams service which is explained below in this article.

group_by - To aggregate by all possible labels use ‘…’ as the sole label name. This effectively disables aggregation entirely, passing through all alerts as-is. But this is unlikely to be what you want, unless you have a very low alert volume or your upstream notification system performs its own grouping. So we aggregate alerts as given in above example

group_wait - When a new group of alerts is created by an incoming alert, wait at least ‘group_wait’ to send the initial notification. This way ensures that you get multiple alerts for the same group that start firing shortly after another are batched together on the first notification.

group_interval - When the first notification was sent, wait ‘group_interval’ to send a batch of new alerts that started firing for that group.

repeat_interval - If an alert has successfully been sent, wait ‘repeat_interval’ to resend them.

inhibit_rules - Inhibition rules allow to mute a set of alerts given that another alert is firing. We use this to mute any warning-level notifications if the same alert is already critical.

Prometheus to Ms Teams Integration

Integrating the alerts generated by alert-manager with Ms Teams is not straightforward. We need an intermediate component called Prom2teams service to interconnect the alert-manager with Ms Teams.

Prom2teams is a web server built with python that receives alert notifications from a previously configured alert-manager instance and forwards it to Ms Teams using defined connectors.

So first we need to create a Webhook connector in Ms Teams and configure that URL in Prom2teams config file.

The above generated webhook URL should be configured in the config.ini file of the Prom2teams server.

[HTTP Server]
Host: 
Port: 8089
[Microsoft Teams]
Connector: https://outlook.office.com/webhook/1231232232323
[Group Alerts]
Field: 
[Log]
Level: INFO
[Template]
Path: /opt/prom2teams/helmconfig/teams.j2

The default template file for the teams alert can be overwritten or changed accordingly as configured in the above Path parameter.

The URL for this Prom2teams service is the webhook URL configured under ‘prometheus-msteams’ in the above alert-manager alertmanager.yml file.

Now you should receive your alerts for the configured email address and Ms Teams channel.

Dead Man’s Switch

Dead man’s switch is a device/service designed in such a way that an action will occur upon a switch being opened or closed. In our case we use another service as a dead man’s switch to trigger an alert in case of Prometheus alert manager failure. So to achieve that, we configure a watchdog alert as given below for the dead man’s switch endpoint configured in the alert-manager.

The Watchdog alert is an “always firing” alert to ensure that the entire alerting pipeline is functional

groups:
  - name: meta
    rules:
    - alert: WatchdogAlert
      expr: vector(1)
      labels:
        severity: "critical"
        environment: "Production"            
      annotations:
        description: This is a Watchdog alert to ensure that the entire Alerting pipeline is functional.
        summary: Watchdog Alerting

Below is the alert-manager configuration for the dead man’s switch.

routes:
  - match_re:
      alertname: WatchdogAlert
    receiver: 'cole'
    group_interval: 10s
    repeat_interval: 4m
    continue: falsereceivers:
- name: cole
  webhook_configs:
  - url: "http://deadman-switch:8080/ping/bpbn2earafu3t25o2900"
    send_resolved: false

We can use any application for this dead man’s switch which exhibits the expected behavior. This is a one such simple application written in go language to trigger an alert in case of a watch dog alert failure.

In the next article, we will integrate more sophisticated dashboards to our monitoring system.

Thanks!

Alerting for Cloud-native Applications with Prometheus

Creating Prometheus Alerts

Prometheus to Ms Teams Integration

Dead Man’s Switch

Written by Danuka Praneeth