Custom Automations ###################################################### .. note:: It is recommended to read :ref:`Automation Basics` before starting this guide. In previous tutorials, we configured automations. We used builtin actions and configured them in YAML. In this tutorial, we will write a custom action in Python code. For educational purposes, we'll automate the investigation of a short and made-up (but realistic) error scenario. The scenario --------------------------------------- .. note:: It is recommended to follow along the described scenario and run all the commands on a test cluster. Let's imagine that you want to create and deploy a new pod with the ``ngnix`` image. Being the smart person that you are, you decide to save time by copy-pasting an existing YAML file. You change the pod name and image to ngnix. .. code-block:: yaml apiVersion: v1 kind: Pod metadata: name: nginx labels: env: test spec: containers: - name: nginx image: nginx imagePullPolicy: IfNotPresent nodeSelector: spoiler: alert You start the pod by running the following command: .. code-block:: bash $ kubectl apply -f nginx-pod.yaml pod/nginx created For some reason the pod doesn't start (note it’s "Pending" Status): .. code-block:: bash $ kubectl get pods NAME READY STATUS RESTARTS AGE nginx 0/1 Pending 0 5h19m You wait a few minutes, but it remains the same. To investigate you look at the event log: .. code-block:: bash kubectl get event --field-selector involvedObject.name=nginx LAST SEEN TYPE REASON OBJECT MESSAGE 64s Warning FailedScheduling pod/nginx 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. Aha! "1 node(s) didn't match Pod's node affinity/selector." ALRIGHT! .. note:: You can see this event on an informative timeline in `Robusta UI `_. Check it out! Wait, what does it mean? 😖 (Hint: Check the YAML config for the spoiler) After searching online for some time, you find out that the YAML file you copied had a “nodeSelector” with the key-value "spoiler: alert", which means that it can only be scheduled on nodes (machines) that have this configuration 🤦‍♂️. From the `docs `_: .. pull-quote:: You can constrain a Pod so that it is restricted to run on particular node(s), or to prefer to run on particular nodes. There are several ways to do this and the recommended approaches all use label selectors to facilitate the selection. **Often, you do not need to set any such constraints** ... So you comment out those lines, run kubectl apply again, and all is well. .. code-block:: yaml apiVersion: v1 kind: Pod metadata: name: nginx labels: env: test spec: containers: - name: nginx image: nginx imagePullPolicy: IfNotPresent # nodeSelector: # spoiler: alert Wouldn't it be nice if we could automate the detection of issues like this? .. note:: Make sure to clean up the pod from this section by running ``kubectl delete pod nginx`` Automating the detection with a Robusta Playbook -------------------------------------------------- What we need to do? --------------------- A playbook consists of two things: - Trigger - We’re going to use a built in trigger - Action - We’re going to write our own action! Finding the correct trigger ------------------------------ What is the correct trigger for the job? We can think of two triggers that may fit: - Creation of a new pod (because we create a new pod, ‘ngnix’) - A Kubernetes Event is fired (because we ran kubectl get event to find out the scheduling error) Let’s look at the Trigger section about :ref:`Kubernetes (API Server)`, and try to find out triggers for both. Go ahead and try to find them! Okay! We find ``on_pod_create`` and ``on_event_create`` We'll use ``on_event_create`` in this tutorial because it will be easier to identify scheduling issues by looking at the event. Writing the action -------------------- Now we need to write code that checks this event and reports it. To find the correct event class that matches our trigger ``on_event_create``. please take a look at :ref:`Events and Triggers`. Okay! We find out it’s ``EventEvent``! So we need to get the information, check for the scenario, and then report it (for more information about reporting it see :ref:`Creating Findings`) Let’s name our action ``report_scheduling_failure``, and write everything in a python file: .. code-block:: python from robusta.api import * @action def report_scheduling_failure(event: EventEvent): # We use EventEvent to get the event object. actual_event = event.get_event() print(f"This print will be shown in the robusta logs={actual_event}") if actual_event.type.casefold() == f'Warning'.casefold() and \ actual_event.reason.casefold() == f'FailedScheduling'.casefold() and \ actual_event.involvedObject.kind.casefold() == f'Pod'.casefold(): _report_failed_scheduling(event, actual_event.involvedObject.name, actual_event.message) def _report_failed_scheduling(event: EventEvent, pod_name: str, message: str): custom_message = "" if "affinity/selector" in message: custom_message = "Your pod has a node 'selector' configured, which means it can't just run on any node. For more info, see: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector" # this is how you send data to slack or other destinations # Note - is it sometimes better to create a Finding object instead of calling event.add_enrichment, but this is out of the scope of this tutorial event.add_enrichment([ MarkdownBlock(f"Failed to schedule a pod named '{pod_name}'!\nerror: {message}\n\n{custom_message}"), ]) Before we proceed, we need to enable local playbook repositories in Robusta. Follow this quick guide to learn how to package your python file for Robusta: :ref:`Custom playbook repositories` Let’s push the new action to Robusta .. code-block:: bash robusta playbooks push Use this useful debugging commands to make sure your action ( ``report_scheduling_failure``) is loaded: .. code-block:: bash $ robusta logs # get robusta logs, see errors, see our playbook loaded ... 2022-08-03 10:53:14.116 INFO importing actions from my_playbook_repo.report_scheduling_failure ... $ robusta playbooks list-dirs # get see if you custom action package was loaded ====================================================================== Listing playbooks directories ====================================================================== ====================================================================== Stored playbooks directories: robusta-pending-pod-playbook ====================================================================== Connecting the trigger to the action - a **playbook** is born! --------------------------------------------------------------- We need to add a custom playbook that this action it in the generated_values.yaml. .. code-block:: yaml # SNIP! existing contents of the file removed for clarity... # This is your custom playbook customPlaybooks: - triggers: - on_event_create: {} actions: - report_scheduling_failure: {} # This enables loading custom playbooks playbooksPersistentVolume: true Time to update Robusta’s config with the new generated_config.yaml: .. code-block:: bash helm upgrade robusta robusta/robusta --values=generated_values.yaml After a minute or two Robusta will be ready. Let's run this command to see that the new playbook is loaded: .. code-block:: bash $ robusta logs # get robusta logs, see no errors ... ... $ robusta playbooks list # see all the playbooks. Run it after a few minutes ... -------------------------------------- triggers: - on_event_create: {} actions: - report_scheduling_failure: {} -------------------------------------- ... Great! .. note:: If you haven't already, make sure to clean up the pod from the last section by running ``kubectl delete pod nginx`` Now for the final check, let's deploy the mis-configured pod again: .. code-block:: yaml apiVersion: v1 kind: Pod metadata: name: nginx labels: env: test spec: containers: - name: nginx image: nginx imagePullPolicy: IfNotPresent nodeSelector: spoiler: alert And start the pod by running the following command: .. code-block:: bash $ kubectl apply -f nginx-pod.yaml pod/nginx created Now, Check out the Slack channel (sink), for example: .. admonition:: Example Slack Message .. image:: /images/example_report_scheduling_failure.png Cleaning up -------------- .. code-block:: bash kubectl delete pod nginx # delete the pod robusta playbooks delete # remove the playbook we just added from Robusta # Remove "customPlaybooks" and "playbooksPersistentVolume" from you config, and then run helm upgrade helm upgrade robusta robusta/robusta --values=generated_values.yaml Summary ------------------------------------- We learned how to solve a real problem (pod not scheduling) only once and have Robusta automate it in the future for all our happy co-workers (and future us) to enjoy. This example of an unschedulable pod is actually covered by Robusta out of the box (if you enable the builtin Prometheus stack) but you can see how easy it is to track any error you like and send it to a notifications system with extra data.