A Complete Python Actionยถ
Note
It is recommended to read The Basics before starting this guide.
In this tutorial, we will write a custom action. For educational purposes, we'll automate the investigation of a made-up (but realistic) error scenario.
The scenarioยถ
Note
It is recommended to follow along the described scenario and run all the commands on a test cluster.
Let's imagine that you want to create and deploy a new pod with the ngnix
image.
Being the smart person that you are, you decide to save time by copy-pasting an existing YAML file. You change the pod name and image to ngnix.
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
spoiler: alert
You start the pod by running the following command:
$ kubectl apply -f nginx-pod.yaml
pod/nginx created
For some reason the pod doesn't start (note itโs "Pending" Status):
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 0/1 Pending 0 5h19m
You wait a few minutes, but it remains the same.
To investigate you look at the event log:
kubectl get event --field-selector involvedObject.name=nginx
LAST SEEN TYPE REASON OBJECT MESSAGE
64s Warning FailedScheduling pod/nginx 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.
Aha! "1 node(s) didn't match Pod's node affinity/selector." ALRIGHT!
Note
You can see this event on an informative timeline in Robusta UI. Check it out!
Wait, what does it mean? ๐ (Hint: Check the YAML config for the spoiler)
After searching online for some time, you find out that the YAML file you copied had a โnodeSelectorโ with the key-value "spoiler: alert", which means that it can only be scheduled on nodes (machines) that have this configuration ๐คฆโโ๏ธ.
From the docs:
You can constrain a Pod so that it is restricted to run on particular node(s), or to prefer to run on particular nodes. There are several ways to do this and the recommended approaches all use label selectors to facilitate the selection. Often, you do not need to set any such constraints ...
So you comment out those lines, run kubectl apply again, and all is well.
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
# nodeSelector:
# spoiler: alert
Wouldn't it be nice if we could automate the detection of issues like this?
Note
Make sure to clean up the pod from this section by running kubectl delete pod nginx
Automating the detection with a Robusta Playbookยถ
What we need to do?ยถ
A playbook consists of two things:
Trigger - Weโre going to use a built in trigger
Action - Weโre going to write our own action!
Finding the correct triggerยถ
What is the correct trigger for the job? We can think of two triggers that may fit:
Creation of a new pod (because we create a new pod, โngnixโ)
A Kubernetes Event is fired (because we ran kubectl get event to find out the scheduling error)
Letโs look at the Trigger section about Kubernetes (API Server), and try to find out triggers for both. Go ahead and try to find them!
Okay! We find on_pod_create
and on_event_create
We'll use on_event_create
in this tutorial because it will be easier to identify scheduling issues by looking at the event.
Writing the actionยถ
Now we need to write code that checks this event and reports it. To find the correct event class that matches our trigger on_event_create
. please take a look at Events and Triggers.
Okay! We find out itโs EventEvent
!
So we need to get the information, check for the scenario, and then report it (for more information about reporting it see Creating Findings)
Letโs name our action report_scheduling_failure
, and write everything in a python file:
from robusta.api import *
@action
def report_scheduling_failure(event: EventEvent): # We use EventEvent to get the event object.
actual_event = event.get_event()
print(f"This print will be shown in the robusta logs={actual_event}")
if actual_event.type.casefold() == f'Warning'.casefold() and \
actual_event.reason.casefold() == f'FailedScheduling'.casefold() and \
actual_event.involvedObject.kind.casefold() == f'Pod'.casefold():
_report_failed_scheduling(event, actual_event.involvedObject.name, actual_event.message)
def _report_failed_scheduling(event: EventEvent, pod_name: str, message: str):
custom_message = ""
if "affinity/selector" in message:
custom_message = "Your pod has a node 'selector' configured, which means it can't just run on any node. For more info, see: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector"
# this is how you send data to slack or other destinations
# Note - is it sometimes better to create a Finding object instead of calling event.add_enrichment, but this is out of the scope of this tutorial
event.add_enrichment([
MarkdownBlock(f"Failed to schedule a pod named '{pod_name}'!\nerror: {message}\n\n{custom_message}"),
])
Before we proceed, we need to enable local playbook repositories in Robusta.
Follow this quick guide to learn how to package your python file for Robusta: Creating Playbook Repositories
Letโs push the new action to Robusta
robusta playbooks push <PATH_TO_LOCAL_PLAYBOOK_FOLDER>
Use this useful debugging commands to make sure your action ( report_scheduling_failure
) is loaded:
$ robusta logs # get robusta logs, see errors, see our playbook loaded
...
2022-08-03 10:53:14.116 INFO importing actions from my_playbook_repo.report_scheduling_failure
...
$ robusta playbooks list-dirs # get see if you custom action package was loaded
======================================================================
Listing playbooks directories
======================================================================
======================================================================
Stored playbooks directories:
robusta-pending-pod-playbook
======================================================================
Connecting the trigger to the action - a playbook is born!ยถ
We need to add a custom playbook that this action it in the generated_values.yaml.
# SNIP! existing contents of the file removed for clarity...
# This is your custom playbook
customPlaybooks:
- triggers:
- on_event_create: {}
actions:
- report_scheduling_failure: {}
# This enables loading custom playbooks
playbooksPersistentVolume: true
Time to update Robustaโs config with the new generated_config.yaml:
helm upgrade robusta robusta/robusta --values=generated_values.yaml
After a minute or two Robusta will be ready. Let's run this command to see that the new playbook is loaded:
$ robusta logs # get robusta logs, see no errors
...
...
$ robusta playbooks list # see all the playbooks. Run it after a few minutes
...
--------------------------------------
triggers:
- on_event_create: {}
actions:
- report_scheduling_failure: {}
--------------------------------------
...
Great!
Note
If you haven't already, make sure to clean up the pod from the last section by running kubectl delete pod nginx
Now for the final check, let's deploy the mis-configured pod again:
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
spoiler: alert
And start the pod by running the following command:
$ kubectl apply -f nginx-pod.yaml
pod/nginx created
Now, Check out the Slack channel (sink), for example:
Cleaning upยถ
kubectl delete pod nginx # delete the pod
robusta playbooks delete <PLAYBOOK_FOLDER> # remove the playbook we just added from Robusta
# Remove "customPlaybooks" and "playbooksPersistentVolume" from you config, and then run helm upgrade
helm upgrade robusta robusta/robusta --values=generated_values.yaml
Summaryยถ
We learned how to solve a real problem (pod not scheduling) only once and have Robusta automate it in the future for all our happy co-workers (and future us) to enjoy.
This example of an unschedulable pod is actually covered by Robusta out of the box (if you enable the builtin Prometheus stack) but you can see how easy it is to track any error you like and send it to a notifications system with extra data.