Remediation - Robusta documentation

Playbook Action: alert_handling_job

Description

Create a kubernetes job with the specified parameters

In addition, the job pod receives the following alert parameters as environment variables

ALERT_NAME

ALERT_STATUS

ALERT_OBJ_KIND - oneof pod/deployment/node/job/daemonset or None in case it's unknown

ALERT_OBJ_NAME

ALERT_OBJ_NAMESPACE (If present)

ALERT_OBJ_NODE (If present)

ALERT_LABEL_{LABEL_NAME} for every label on the alert. For example a label named foo becomes ALERT_LABEL_FOO

Example Config

Add this to your Robusta configuration (Helm values.yaml):

customPlaybooks:
- actions:
  - alert_handling_job:
      command:
      - perl
      - -Mbignum=bpi
      - -wle
      - print bpi(2000)
      image: string
  triggers:
  - on_prometheus_alert: {}

The above is an example. Try customizing the trigger and parameters.

Parameters

required:

image (str): The job image.

command (str list): The job command as array of strings

optional:

name (str) = robusta-action-job: Custom name for the job and job container.

namespace (str) = default: The created job namespace.

service_account (str): Job pod service account. If omitted, default is used.

restart_policy (str) = OnFailure: Job container restart policy

job_ttl_after_finished (int) = 120: Delete finished job ttl (seconds). If omitted, jobs will not be deleted automatically.

notify (bool): Add a notification for creating the job.

wait_for_completion (bool) = True: Wait for the job to complete and attach it's output. Only relevant when notify=true.

completion_timeout (int) = 300: Maximum seconds to wait for job to complete. Only relevant when wait_for_completion=true.

backoff_limit (int): Specifies the number of retries before marking this job failed. Defaults to 6

active_deadline_seconds (int)

Specifies the duration in seconds relative to the startTime

that the job may be active before the system tries to terminate it; value must be

positive integer

env (envvar list): Inject environment variables and secrets just like you do with a Kubernetes Job.

Supported Triggers

on_prometheus_alert

Playbook Action: delete_pod

Description

Deletes a pod

Example Config

Add this to your Robusta configuration (Helm values.yaml):

customPlaybooks:
- actions:
  - delete_pod: {}
  triggers:
  - on_pod_delete: {}

The above is an example. Try customizing the trigger and parameters.

Parameters

No action parameters

Supported Triggers

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger delete_pod name=POD_NAME namespace=POD_NAMESPACE 

Playbook Action: delete_job

Description

Delete the job from the cluster

Example Config

Add this to your Robusta configuration (Helm values.yaml):

customPlaybooks:
- actions:
  - delete_job: {}
  triggers:
  - on_job_failure: {}

The above is an example. Try customizing the trigger and parameters.

Parameters

No action parameters

Supported Triggers

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger delete_job name=JOB_NAME namespace=JOB_NAMESPACE 

Playbook Action: alert_on_hpa_reached_limit

Description

Notify when the HPA reaches its maximum replicas and allow fixing it.

Example Config

Add this to your Robusta configuration (Helm values.yaml):

customPlaybooks:
- actions:
  - alert_on_hpa_reached_limit: {}
  triggers:
  - on_horizontalpodautoscaler_update: {}

The above is an example. Try customizing the trigger and parameters.

Parameters

optional:

increase_pct (int) = 20: Increase the HPA max_replicas by this percentage.

Supported Triggers

Playbook Action: rollout_restart

Description

Performs rollout restart on a kubernetes workload. Supports deployments, deploymentconfig, daemonsets and statefulsets related events.

Example Config

Add this to your Robusta configuration (Helm values.yaml):

customPlaybooks:
- actions:
  - rollout_restart: {}
  triggers:
  - on_prometheus_alert: {}

The above is an example. Try customizing the trigger and parameters.

Parameters

No action parameters

Supported Triggers

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger rollout_restart kind=RESOURCE_KIND name=RESOURCE_NAME 

Playbook Action: restart_named_rollout

Description

Performs rollout restart on a named argo rollout.

Example Config

Add this to your Robusta configuration (Helm values.yaml):

customPlaybooks:
- actions:
  - restart_named_rollout:
      name: string
      namespace: string
  triggers:
  - on_prometheus_alert: {}

The above is an example. Try customizing the trigger and parameters.

Parameters

required:

name (str): Resource name

namespace (str): Resource namespace

Supported Triggers

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger restart_named_rollout  name=NAME namespace=NAMESPACE

Playbook Action: cordon

Description

Cordon, Taints a node as unschedulable.

Example Config

Add this to your Robusta configuration (Helm values.yaml):

customPlaybooks:
- actions:
  - cordon: {}
  triggers:
  - on_node_create: {}

The above is an example. Try customizing the trigger and parameters.

Parameters

No action parameters

Supported Triggers

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger cordon name=NODE_NAME 

Playbook Action: uncordon

Description

Unordon, Taints a node as schedulable.

Example Config

Add this to your Robusta configuration (Helm values.yaml):

customPlaybooks:
- actions:
  - uncordon: {}
  triggers:
  - on_node_create: {}

The above is an example. Try customizing the trigger and parameters.

Parameters

No action parameters

Supported Triggers

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger uncordon name=NODE_NAME 

Playbook Action: drain

Description

Drain, taints a node as unschedulable, and evicts all pods from the node. DaemonSets pods are skipped, as they tolerant unschedulable nodes by default.

Example Config

Add this to your Robusta configuration (Helm values.yaml):

customPlaybooks:
- actions:
  - drain: {}
  triggers:
  - on_node_create: {}

The above is an example. Try customizing the trigger and parameters.

Parameters

No action parameters

Supported Triggers

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger drain name=NODE_NAME 

Playbook Action: kubectl_command

Description

Runs a custom kubectl command inside a Kubernetes pod using a Job.

Use kubectl_command to run kubectl with dynamic placeholders: - $namespace: resource namespace - $kind: resource kind (e.g., Pod, Deployment) - $name: resource name

Example: Scale Down Deployment on Crash Loop

customPlaybooks:
- name: CrashLoopScaleDown
  triggers:
  - on_pod_crash_loop:
      restart_count: 3
  actions:
    - kubectl_command:
        description: "Scale Down Deployment"
        command: kubectl scale --replicas=0 deployment/payment-processing-worker -n $namespace

If the pod is in the production namespace, the command will be:

kubectl scale --replicas=0 deployment/payment-processing-worker -n production

Example: Delete Crashing Resource

This deletes the crashing resource by kind, name, and namespace:

kubectl delete $kind $name -n $namespace

For example, deleting a crashing pod named api-worker-1 in the staging namespace:

kubectl delete Pod api-worker-1 -n staging

Example Config

Add this to your Robusta configuration (Helm values.yaml):

customPlaybooks:
- actions:
  - kubectl_command: {}
  triggers:
  - on_pod_create: {}

The above is an example. Try customizing the trigger and parameters.

Parameters

optional:

custom_annotations (str dict): custom annotations to be used for the running pod/job

command (str): The full kubectl command to run, formatted as a shell command string.

description (str): A description of the command ran.

timeout (int) = 3600: The maximum time (in seconds) to wait for the kubectl command to complete. Default is 3600 seconds.

Supported Triggers

any trigger

This action can be manually triggered using the Robusta CLI:

robusta playbooks trigger kubectl_command 

Remediation¶

Alert handling job¶

Delete pod¶

Delete job¶

Alert on hpa reached limit¶

Rollout restart¶

Restart named rollout¶

Node¶

Cordon¶

Uncordon¶

Drain¶

Kubectl¶

Kubectl command¶