Custom Automations

In previous tutorials, we configured automations. We used builtin actions and configured them in YAML.

In this tutorial, we will write a custom action in Python code.

For educational purposes, we'll automate the investigation of a short and made-up (but realistic) error scenario.

Note

It is recommended to read Automation basics before starting this guide.

The scenario

You want to create a new pod with the ngnix image.

Being the smart person that you are, you decide to save time by copy-pasting an existing YAML file. You change the pod name and image to ngnix.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  nodeSelector:
    spoiler: alert

You start the pod by running the following command:

$ kubectl apply -f nginx-pod.yaml
pod/nginx created

For some reason the pod doesn't start (note it’s "Pending" Status):

$ kubectl get pods

NAME                                                     READY   STATUS    RESTARTS   AGE

nginx                                                    0/1     Pending   0          5h19m

You wait a few minutes, but it remains the same.

To investigate you look at the event log:

kubectl get event --field-selector involvedObject.name=nginx
LAST SEEN   TYPE      REASON             OBJECT      MESSAGE
64s         Warning   FailedScheduling   pod/nginx   0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.

Aha! "1 node(s) didn't match Pod's node affinity/selector." ALRIGHT!

Note

You can see this event on an informative timeline in Robusta UI. Check it out!

Wait, what does it mean? 😖 (Hint: Check the YAML config for the spoiler)

After searching online for some time, you find out that the YAML file you copied had a “nodeSelector” with the key-value "spoiler: alert", which means that it can only be scheduled on nodes (machines) that have this configuration 🤦‍♂️.

From the docs:

nodeSelector is the simplest recommended form of node selection constraint. nodeSelector is a field of PodSpec. It specifies a map of key-value pairs. For the pod to be eligible to run on a node, the node must have each of the indicated key-value pairs as labels (it can have additional labels as well). The most common usage is one key-value pair.

So you comment out those lines, run kubectl apply again, and all is well.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
#  nodeSelector:
#    spoiler: alert

Wouldn't it be nice if we could automate the detection of issues like this?

Automating the detection with a Robusta Playbook

What we need to do?

Note

Make sure to clean up the pod from the last section by running kubectl delete pod nginx

A playbook consists of two things:

  • Trigger - We’re going to use a built in trigger

  • Action - We’re going to write our own action!

Finding the correct trigger

What is the correct trigger for the job? We can think of two triggers that may fit:

  • Creation of a new pod (because we create a new pod, ‘ngnix’)

  • A Kubernetes Event is fired (because we ran kubectl get event to find out the scheduling error)

Let’s look at the Trigger section about Kubernetes (API Server), and try to find out triggers for both. Go ahead and try to find them!

Okay! We find on_pod_create and on_event_create

We'll use on_event_create in this tutorial because it will be easier to identify scheduling issues by looking at the event.

Writing the action

Now we need to write code that checks this event and reports it. To find the correct event class that matches our trigger on_event_create. please take a look at Events and Triggers.

Okay! We find out it’s EventEvent!

So we need to get the information, check for the scenario, and then report it (for more information about reporting it see Creating Findings)

Let’s name our action report_scheduling_failure, and write everything in a python file:

from robusta.api import *

@action
def report_scheduling_failure(event: EventEvent):
    actual_event = event.get_event()

    print(f"This print will be shown in the robusta logs={actual_event}")

    if actual_event.type.casefold() == f'Warning'.casefold() and \
        actual_event.reason.casefold() == f'FailedScheduling'.casefold() and \
        actual_event.involvedObject.kind.casefold() == f'Pod'.casefold():
        _report_failed_scheduling(event, actual_event.involvedObject.name, actual_event.message)

def _report_failed_scheduling(event: EventEvent, pod_name: str, message: str):
    # this is how you send data to slack or other destinations
    event.add_enrichment([
        MarkdownBlock(f"Failed to schedule a pod named '{pod_name}', error: {message}"),
    ])

Before we proceed, we need to enable local playbook repositories in Robusta.

Follow this quick guide to learn how to package your python file for Robusta: Custom playbook repositories

Use this useful debugging commands to make sure your action ( report_scheduling_failure) is loaded:

robusta logs # get robusta logs, see errors
robusta playbooks list-dirs  # get see if you custom action package was loaded

Let’s push the new action to Robusta, and then test it by triggering the action manually immediately.

robusta playbooks push <PATH_TO_LOCAL_PLAYBOOK_FOLDER>
robusta playbooks trigger report_scheduling_failure name=robusta-runner-8cd69f7cb-g5bkb namespace=default seconds=5

Check our slack channel:

Connection the trigger to the action - a Playbook is born!

We need to add a custom playbook that this action it in the generated_values.yaml.

# SNIP! existing contents of the file removed for clarity...

# This is your custom playbook
customPlaybooks:
- triggers:
  - on_event_create: {}
  actions:
  - report_scheduling_failure: {}

# This enables loading custom playbooks
playbooksPersistentVolume: true

Note

If you haven't already, make sure to clean up the pod from the last section by running kubectl delete pod nginx

Time to update Robusta’s config with the new generated_config.yaml:

helm upgrade robusta robusta/robusta --values=generated_values.yaml
robusta playbooks list # see all the playbooks. Run it after a few minutes

After a minute or two Robusta will be ready.

Let’s push the new action to Robusta:

robusta playbooks push <PATH_TO_PLAYBOOK_FOLDER>

After a minute or two Robusta will be ready.

Great!

Run the scenario from the first section again (creating a bad bad configuration), and you should see this in your slack:

Check our slack channel:

Cleaning up

kubectl delete pod nginx # delete the pod
robusta playbooks delete <PLAYBOOK_FOLDER> # remove the playbook we just added from Robusta

# Remove "customPlaybooks" and "playbooksPersistentVolume" from you config, and then run helm upgrade
helm upgrade robusta robusta/robusta --values=generated_values.yaml

Summary

We learned how to solve a real problem (pod not scheduling) only once and have Robusta automate it in the future for all our happy co-workers (and future us) to enjoy.

This example of an unschedulable pod is actually covered by Robusta out of the box (if you enable the builtin Prometheus stack) but you can see how easy it is to track any error you like and send it to a notifications system with extra data.