< psritej.com / blog />

Design Patterns for Self-Healing Infrastructure: An Event-Driven Approach

Sritej Panchumarthi · Published: March 10, 2026 · Technical Report · 30 min read

Abstract
As infrastructure scale increases, manual incident response becomes unsustainable. This report details the architecture of a "Self-Healing Risk Engine" that automates the detection and remediation of common infrastructure failures. We present an event-driven design using AWS EventBridge, Step Functions, and Systems Manager to close the loop between observability and action.
Key takeaway: Self-healing infrastructure should automate known, reversible failure modes. The mature pattern is not "let automation do anything"; it is observe, classify, remediate with guardrails, record evidence, and escalate when confidence is low.

1. Introduction

The transition from Senior to Staff Engineering is marked by a shift from incident resolution to systemic prevention. The goal is to create autonomous systems that can recover from known failure modes without human intervention.

This document outlines a production-grade architecture for automated remediation, addressing drift detection, security violations, and resource exhaustion.

2. The Architecture of Autonomy

We move beyond static alerting to an event-driven control loop. Infrastructure state changes are captured as discrete events, which trigger state machines responsible for diagnosis and remediation.

Fig 1. The Self-Healing Loop
[ AWS Resource ] --(State Change)--> [ EventBridge ]
                                          |
                                          v
                                  [ Step Functions ] <--- (State Machine)
                                          |
                                  +-------+-------+
                                  |               |
                            [ Diagnosis ]   [ Remediation ]
                            (Lambda/Py)     (SSM Document)
                                  |               |
                                  v               v
                            [ Jira Ticket ]  [ Slack Alert ]
                            (Auto-Closed)    (Notification)

3. Case Study: Automated Disk Space Recovery

Scenario: A failure in log rotation leads to root volume exhaustion on a critical EC2 instance.

3.1 Detection

The CloudWatch Agent is configured to emit the `disk_used_percent` metric. An alarm is defined with a threshold of 85%.

3.2 Event Routing

The alarm state change triggers an EventBridge rule, routing the event payload to the remediation engine.

# Terraform: Event Rule
resource "aws_cloudwatch_event_rule" "disk_space" {
  name        = "auto-remediate-disk-space"
  description = "Triggered when disk usage > 85%"
  event_pattern = jsonencode({
    source      = ["aws.cloudwatch"]
    detail-type = ["CloudWatch Alarm State Change"]
    detail      = {
      state = { value = ["ALARM"] }
      configuration = {
        metrics = [{ metricStat = { metric = { name = "disk_used_percent" } } }]
      }
    }
  })
}

3.3 Remediation Logic

A Lambda function extracts the Instance ID from the event and invokes an AWS Systems Manager (SSM) Run Command to clear archived logs.

# Python: Remediation Logic
import boto3
ssm = boto3.client('ssm')

def handler(event, context):
    instance_id = event['detail']['configuration']['metrics'][0]['metricStat']['metric']['dimensions']['InstanceId']
    
    # 1. Try to clean logs first
    ssm.send_command(
        InstanceIds=[instance_id],
        DocumentName='AWS-RunShellScript',
        Parameters={'commands': ['journalctl --vacuum-time=1d', 'rm -rf /var/log/*.gz']}
    )
    
    print(f"Remediation sent to {instance_id}")

4. Case Study: Security Group Drift

A developer accidentally opens port 22 (SSH) to 0.0.0.0/0. This is a critical severity finding.

  1. Detection: AWS Config Rule `restricted-ssh` detects the non-compliant security group.
  2. Remediation: AWS Systems Manager Automation (`AWS-DisablePublicAccessForSecurityGroup`) is triggered automatically.
  3. Notification: The developer gets a Slack DM saying "I closed port 22 for you. Please use the VPN."

5. Safety Mechanisms: The Circuit Breaker

Danger: What if the remediation logic is flawed and starts rebooting every server in production?

We implement a Circuit Breaker in DynamoDB.

def check_circuit_breaker(instance_id):
    # Check how many times we remediated this instance in the last hour
    count = ddb.get_item(Key={'id': instance_id})['count']
    
    if count > 3:
        raise Exception("Circuit Breaker Open: Manual Intervention Required")
    
    ddb.update_item(Key={'id': instance_id}, UpdateExpression="ADD count :1")

6. Operationalizing Self-Healing Safely

Automated remediation needs the same engineering discipline as application code. Every action should have a known owner, a dry-run mode, an audit trail, a rollback plan, and a confidence threshold. If the system cannot classify the problem, it should create context for humans instead of guessing.

Failure typeSafe automated responseEscalate when
Disk pressureClean known cache/log paths and report freed spaceUsage remains high or repeats more than three times per hour
Public SSH exposureRemove public ingress and notify ownerThe rule belongs to a break-glass account or protected security group
Certificate expirationRenew managed certificate or open ticket with exact dependency pathExternal DNS validation or third-party ownership is required
Production readiness checklist:
  • Every remediation has idempotent behavior and a maximum retry count.
  • All actions write to CloudTrail, logs, ticket history, and alert notifications.
  • Critical resources can opt out through tags such as AutoRemediate=false.
  • Humans can disable the remediation engine quickly during incidents.

7. Conclusion

Self-healing infrastructure transforms the SRE role from reactive incident response to proactive system design. By automating the resolution of known failure modes, organizations can achieve higher availability and allow engineers to focus on complex, novel problems.

Related Writings