Prometheus Alert Enrichment

Introduction

Robusta has special features for handling Prometheus alerts in Kubernetes clusters including:

  1. Enrichers: actions that enrich alerts with extra information based on the alert type

  2. Silencers: actions that silence noisy alerts using more advanced methods than Prometheus/AlertManager's builtin silencing feature

When trying out these features, you can leave your existing alerting Slack channel in place and add a new channel for Robusta's improved Prometheus alerts. This will let you compare Robusta's alerting with Prometheus' builtin alerting.

Each triggered action will add enrichment data to the finding. After all the triggered actions are executed, the findings and enrichments will be sent to the configured sinks.

Configure Robusta

Configure Prometheus AlertManager

Before you can enrich prometheus alerts, you must forward Prometheus alerts to Robusta by adding a webhook receiver to AlertsManager.

See Prometheus for details.

Lets look at the simplest possible configuration in values.yaml which instructs Robusta to forward Prometheus alerts without any special enrichment:

builtinPlaybooks:
- triggers:
  - on_prometheus_alert: {}
  actions:
  - default_enricher: {}

The above configuration just forward prometheus alerts to the configured sinks. We didn't add any special enrichment yet. Below you can see how the default alert information looks in Slack:

Adding an Enricher

Now lets add an enricher to values.yaml which enriches the HostHighCPULoad alert:

builtinPlaybooks:
- triggers:
  - on_prometheus_alert:
      alert_name: HostHighCpuLoad
  actions:
  - node_cpu_enricher: {}
- triggers:
  - on_prometheus_alert: {}
  actions:
  - default_enricher: {}

When using the above yaml, all prometheus alerts are forwarded to the sinks unmodified except for the HostHighCPULoad alert which is enriched as you can see below.

Note that adding an enricher to a specific alert, doesn't stop other enrichers from running. Enrichers will run by the order they appear in the values file.

It's highly recommended to always leave the default_enricher last, to add the default information to all alerts.

Analysis of node cpu usage, breakdown by pods

Make sure to check out the full list of enrichers to see what you can add.

If for some reason, you would like to stop processing after some enricher, you can use the stop playbook parameter:

builtinPlaybooks:
- triggers:
  - on_prometheus_alert:
      alert_name: HostHighCpuLoad
  actions:
  - node_cpu_enricher: {}
  stop: True
- triggers:
  - on_prometheus_alert: {}
  actions:
  - default_enricher: {}

Using this configuration, the HostHighCpuLoad alert, will not include the default alert information.

Adding a Silencer

Lets silence KubePodCrashLooping alerts in the first ten minutes after a node (re)starts:

builtinPlaybooks:
- triggers:
  - on_prometheus_alert:
      alert_name: KubePodCrashLooping
  actions:
  - node_restart_silencer:
      post_restart_silence: 600 # seconds
- triggers:
  - on_prometheus_alert: {}
  actions:
  - default_enricher: {}

Full example

Here are all the above features working together:

builtinPlaybooks:
- triggers:
  - on_prometheus_alert:
      alert_name: KubePodCrashLooping
  actions:
  - node_restart_silencer:
      post_restart_silence: 600 # seconds
- triggers:
  - on_prometheus_alert:
      alert_name: HostHighCpuLoad
  actions:
  - node_cpu_enricher: {}
- triggers:
  - on_prometheus_alert: {}
  actions:
  - default_enricher: {}

Available Enrichers

General Enrichers

Default enricher

Playbook Action

Enrich an alert finding with the original message and labels.

By default, this enricher is last in the processing order, so it will be added to all alerts, that aren't silenced.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- default_enricher: {}
triggers:
- on_prometheus_alert: {}

The above is an example. Try customizing the trigger and parameters.

No action parameters

  • on_prometheus_alert

Template enricher

Playbook Action

Enrich an alert finding with a paragraph to the alert’s description containing templated markdown. You can inject any of the alert’s Prometheus labels into the markdown.

A variable like $foo will be replaced by the value of the Prometheus label foo. If a label isn’t present then the text “<missing>” will be used instead.

Common variables to use are $alertname, $deployment, $namespace, and $node

The template can include all markdown directives supported by Slack. Note that Slack markdown links use a different format than GitHub.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- template_enricher:
    template: The alertname is $alertname and the pod is $pod
triggers:
- on_prometheus_alert: {}

The above is an example. Try customizing the trigger and parameters.

optional:
template (str)

The enrichment templated markdown text

  • on_prometheus_alert

Logs enricher

Playbook Action

Enrich the alert finding with the pod logs, as a file. The pod to fetch logs for is determined by the alert’s pod label from Prometheus.

By default, if the alert has no pod this enricher will silently do nothing.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- logs_enricher: {}
triggers:
- on_pod_all_changes: {}

The above is an example. Try customizing the trigger and parameters.

optional:
warn_on_missing_label (bool)

Send a warning if the alert doesn't have a pod label

  • on_pod_create

  • on_pod_all_changes

  • on_pod_delete

  • on_prometheus_alert

  • on_pod_update

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger logs_enricher name=POD_NAME namespace=POD_NAMESPACE 

Alert definition enricher

Playbook Action

Enrich an alert finding with Prometheus query that triggered the alert.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- alert_definition_enricher: {}
triggers:
- on_prometheus_alert: {}

The above is an example. Try customizing the trigger and parameters.

No action parameters

  • on_prometheus_alert

Graph enricher

Playbook Action

Enrich the alert finding with a graph of the Prometheus query which triggered the alert.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- graph_enricher:
    prometheus_url: http://prometheus-k8s.monitoring.svc.cluster.local:9090
triggers:
- on_prometheus_alert: {}

The above is an example. Try customizing the trigger and parameters.

optional:
prometheus_url (str)

Prometheus url. If omitted, we will try to find a prometheus instance in the same cluster

  • on_prometheus_alert

Node Enrichers

Node cpu enricher

Playbook Action

Enrich the finding with analysis of the node's CPU usage. Collect information about pods running on this node, their CPU request configuration, their actual cpu usage etc. Provides insightful information regarding node high CPU usage.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- node_cpu_enricher:
    prometheus_url: http://prometheus-k8s.monitoring.svc.cluster.local:9090
triggers:
- on_node_all_changes: {}

The above is an example. Try customizing the trigger and parameters.

optional:
prometheus_url (str)

Prometheus url. If omitted, we will try to find a prometheus instance in the same cluster

  • on_node_create

  • on_node_delete

  • on_prometheus_alert

  • on_node_update

  • on_node_all_changes

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger node_cpu_enricher name=NODE_NAME 

Oom killer enricher

Playbook Action

Enrich the finding information regarding node OOM killer.

Add the list of pods on this node that we're killed by the OOM killer.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- oom_killer_enricher: {}
triggers:
- on_node_all_changes: {}

The above is an example. Try customizing the trigger and parameters.

No action parameters

  • on_node_create

  • on_node_delete

  • on_prometheus_alert

  • on_node_update

  • on_node_all_changes

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger oom_killer_enricher name=NODE_NAME 

Node status enricher

Playbook Action

Enrich the finding with the node's status conditions.

Can help troubleshooting Node issues.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- node_status_enricher: {}
triggers:
- on_node_all_changes: {}

The above is an example. Try customizing the trigger and parameters.

No action parameters

  • on_node_create

  • on_node_delete

  • on_prometheus_alert

  • on_node_update

  • on_node_all_changes

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger node_status_enricher name=NODE_NAME 

Node running pods enricher

Playbook Action

Enrich the finding with pods running on this node, along with the 'Ready' status of each pod.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- node_running_pods_enricher: {}
triggers:
- on_node_all_changes: {}

The above is an example. Try customizing the trigger and parameters.

No action parameters

  • on_node_create

  • on_node_delete

  • on_prometheus_alert

  • on_node_update

  • on_node_all_changes

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger node_running_pods_enricher name=NODE_NAME 

Node allocatable resources enricher

Playbook Action

Enrich the finding with the node resources available for allocation.

Can help troubleshooting node issues.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- node_allocatable_resources_enricher: {}
triggers:
- on_node_all_changes: {}

The above is an example. Try customizing the trigger and parameters.

No action parameters

  • on_node_create

  • on_node_delete

  • on_prometheus_alert

  • on_node_update

  • on_node_all_changes

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger node_allocatable_resources_enricher name=NODE_NAME 

Node bash enricher

Playbook Action

Execute the specified bash command on the target node. Enrich the finding with the command results.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- node_bash_enricher:
    bash_command: ls -l /etc/data/db
triggers:
- on_node_all_changes: {}

The above is an example. Try customizing the trigger and parameters.

required:
bash_command (str)

Bash command to execute on the target.

  • on_node_create

  • on_node_delete

  • on_prometheus_alert

  • on_node_update

  • on_node_all_changes

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger node_bash_enricher name=NODE_NAME  bash_command=BASH_COMMAND

Pod Enrichers

Pod bash enricher

Playbook Action

Execute the specified bash command on the target pod. Enrich the finding with the command results.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- pod_bash_enricher:
    bash_command: ls -l /etc/data/db
triggers:
- on_pod_all_changes: {}

The above is an example. Try customizing the trigger and parameters.

required:
bash_command (str)

Bash command to execute on the target.

  • on_pod_create

  • on_pod_all_changes

  • on_pod_delete

  • on_prometheus_alert

  • on_pod_update

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger pod_bash_enricher name=POD_NAME namespace=POD_NAMESPACE  bash_command=BASH_COMMAND

Cpu throttling analysis enricher

Playbook Action

Enrich the finding with a deep analysis for the cause of the CPU throttling.

Includes recommendations for the identified cause.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- cpu_throttling_analysis_enricher: {}
triggers:
- on_pod_all_changes: {}

The above is an example. Try customizing the trigger and parameters.

No action parameters

  • on_pod_create

  • on_pod_all_changes

  • on_pod_delete

  • on_prometheus_alert

  • on_pod_update

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger cpu_throttling_analysis_enricher name=POD_NAME namespace=POD_NAMESPACE 

Image pull backoff reporter

Playbook Action

Notify when an ImagePullBackoff occurs and determine the reason why.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- image_pull_backoff_reporter: {}
triggers:
- on_pod_all_changes: {}

The above is an example. Try customizing the trigger and parameters.

optional:
rate_limit (int) = 3600

Rate limit the execution of this action (Seconds).

  • on_pod_create

  • on_pod_all_changes

  • on_pod_delete

  • on_prometheus_alert

  • on_pod_update

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger image_pull_backoff_reporter name=POD_NAME namespace=POD_NAMESPACE 

Pod events enricher

Playbook Action

Enrich the finding with the pod events.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- pod_events_enricher: {}
triggers:
- on_pod_all_changes: {}

The above is an example. Try customizing the trigger and parameters.

No action parameters

  • on_pod_create

  • on_pod_all_changes

  • on_pod_delete

  • on_prometheus_alert

  • on_pod_update

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger pod_events_enricher name=POD_NAME namespace=POD_NAMESPACE 

Daemonset Enrichers

Daemonset status enricher

Playbook Action

Enrich the finding with daemon set stats.

Includes recommendations for the identified cause.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- daemonset_status_enricher: {}
triggers:
- on_daemonset_update: {}

The above is an example. Try customizing the trigger and parameters.

No action parameters

  • on_daemonset_delete

  • on_daemonset_create

  • on_prometheus_alert

  • on_daemonset_update

  • on_daemonset_all_changes

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger daemonset_status_enricher name=DAEMONSET_NAME namespace=DAEMONSET_NAMESPACE 

Daemonset misscheduled analysis enricher

Playbook Action

Enrich the alert finding with analysis and possible causes for the misscheduling, if the cause is identified.

<https://blog.florentdelannoy.com/blog/2020/kube-daemonset-misscheduled/|Learn more>

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- daemonset_misscheduled_analysis_enricher: {}
triggers:
- on_daemonset_update: {}

The above is an example. Try customizing the trigger and parameters.

No action parameters

  • on_daemonset_delete

  • on_daemonset_create

  • on_prometheus_alert

  • on_daemonset_update

  • on_daemonset_all_changes

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger daemonset_misscheduled_analysis_enricher name=DAEMONSET_NAME namespace=DAEMONSET_NAMESPACE 

Deployment Enrichers

Deployment status enricher

Playbook Action

Enrich the finding with deployment status conditions.

Usually these conditions can provide important information regarding possible issues.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- deployment_status_enricher: {}
triggers:
- on_deployment_update: {}

The above is an example. Try customizing the trigger and parameters.

No action parameters

  • on_deployment_create

  • on_prometheus_alert

  • on_deployment_delete

  • on_deployment_all_changes

  • on_deployment_update

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger deployment_status_enricher name=DEPLOYMENT_NAME namespace=DEPLOYMENT_NAMESPACE 

Other Enrichers

Stack overflow enricher

Playbook Action

Enrich the alert finding with a button, which clicking it will show the top StackOverflow search results on this alert name.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- stack_overflow_enricher: {}
triggers:
- on_prometheus_alert: {}

The above is an example. Try customizing the trigger and parameters.

No action parameters

  • on_prometheus_alert

Available Silencers

Severity silencer

Playbook Action

Silence alert findings with with the specified severity level.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- severity_silencer:
    severity: warning
triggers:
- on_prometheus_alert: {}

The above is an example. Try customizing the trigger and parameters.

optional:
severity (str) = none

severity level that should be silenced.

  • on_prometheus_alert

Name silencer

Playbook Action

Silence named alerts.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- name_silencer:
    names:
    - string
    - string
triggers:
- on_prometheus_alert: {}

The above is an example. Try customizing the trigger and parameters.

required:
names (str list)

List of alert names that should be silenced.

  • on_prometheus_alert

Node restart silencer

Playbook Action

Silence alert findings for pods, that are on a node that was recently restarted.

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- node_restart_silencer: {}
triggers:
- on_prometheus_alert: {}

The above is an example. Try customizing the trigger and parameters.

optional:
post_restart_silence (int) = 300

Period after restart to silence alerts. Seconds.

  • on_prometheus_alert

Daemonset misscheduled smart silencer

Playbook Action

Silence daemonset misscheduled alert finding if it's a known false alarm.

checks if the issue issue described here: https://blog.florentdelannoy.com/blog/2020/kube-daemonset-misscheduled/

This action can be run automatically.

Add this to your Robusta configuration (values.yaml when installing with Helm):

actions:
- daemonset_misscheduled_smart_silencer: {}
triggers:
- on_prometheus_alert: {}

The above is an example. Try customizing the trigger and parameters.

No action parameters

  • on_prometheus_alert