What failure modes are safe to auto-remediate?

Automate failures that are known, frequent, reversible, and diagnosable by machine: disk pressure from log growth, security group drift, expired managed certificates, stuck deployments with clean rollback paths, and unhealthy instances behind load balancers. Never automate novel failures, irreversible actions (data deletion, DNS cutover), or anything whose diagnosis requires human judgment.

Why use Step Functions instead of a single Lambda for remediation?

Remediation is a workflow, not a function: verify the alarm is real, check the circuit breaker, check opt-out tags, attempt the fix, verify the fix worked, record evidence, escalate on failure. Step Functions gives you explicit states, retries with backoff, timeouts, human-approval branches, and a visual audit trail of exactly what the machine decided and when — all of which a monolithic Lambda hides in log lines.

What is a remediation circuit breaker and why is it mandatory?

A circuit breaker counts remediation attempts per resource per time window (for example, 3 per hour in DynamoDB with TTL). When the count exceeds the threshold, automation stops and pages a human. It is mandatory because a flawed remediation in a feedback loop is an outage generator: the fix triggers the alarm, which triggers the fix. The breaker converts an infinite loop into a single page.

How do I safely roll out automated remediation to production?

Three stages. Stage 1: observe-only — the engine runs the full workflow but the action step only logs what it would do. Stage 2: notify-and-act with opt-out — remediation runs on resources unless tagged AutoRemediate=false, with every action posted to Slack. Stage 3: full automation with the circuit breaker and a global kill switch (one SSM parameter that pauses the engine). Promote between stages on evidence, typically two weeks per stage.

How should self-healing automation be tested?

Chaos-style fault injection in a staging environment: fill a disk with fallocate and verify the cleanup path frees space and resolves the alarm; open port 22 on a test security group and verify Config detects and the automation closes it; force the circuit breaker by triggering four remediations in an hour and verify it opens and pages. Every remediation path gets a game day before it gets production.

Does self-healing replace on-call engineers?

No — it changes what pages them. Automation handles the known, repetitive 60-80% of incidents (disk, drift, restarts), and the pages that remain are genuinely novel problems worth human attention. The on-call experience improves precisely because the boring pages disappear; the engineering skill shifts from firefighting to designing and auditing the automation.

Design Patterns for Self-Healing Infrastructure: An Event-Driven Approach

Sritej Panchumarthi · Published: March 10, 2026 · Updated: July 7, 2026 · Technical Report · 50 min read

Abstract
As infrastructure scale increases, manual incident response becomes unsustainable. This report details the complete architecture of a "Self-Healing Risk Engine" that automates detection and remediation of common infrastructure failures. We present an event-driven design using AWS EventBridge, Step Functions, Lambda, and Systems Manager that closes the loop between observability and action — specified to L3 depth, with the full state machine, working remediation code, layered safety mechanisms (circuit breakers, opt-out tags, kill switches, dry-run modes), and a chaos-style game-day practical to prove the system before production trusts it.

Key takeaway: Self-healing infrastructure should automate known, reversible failure modes. The mature pattern is not "let automation do anything"; it is observe, classify, remediate with guardrails, record evidence, and escalate when confidence is low. An automation you cannot instantly stop is not an automation — it is a liability with an API.

1. Introduction: From Incident Resolution to Systemic Prevention

The transition from senior to staff-level engineering is marked by a shift from resolving incidents to preventing their recurrence structurally. Consider the arithmetic: if your platform generates twenty disk-pressure pages, twelve security-group drift findings, and eight stuck-deployment alerts per month — each costing 30–60 minutes of interrupt-driven engineering time — that is roughly 25 engineer-hours monthly spent executing procedures a machine could execute better, faster, and with a perfect audit trail.

This document outlines a production-grade architecture for automated remediation, addressing drift detection, security violations, and resource exhaustion — and, just as importantly, the safety engineering that makes such automation trustworthy.

2. The Architecture of Autonomy — L3 View

We move beyond static alerting to an event-driven control loop. Infrastructure state changes are captured as discrete events, which trigger state machines responsible for diagnosis, guarded remediation, verification, and evidence capture. The L3 diagram below shows every component, the IAM principal behind each action, and — critically — the safety interlocks (shaded) that gate every destructive step.

Fig 1. L3 Component Architecture — The Self-Healing Control Loop with Safety Interlocks

  SIGNAL SOURCES                         CONTROL PLANE
┌──────────────────┐
│ CloudWatch Alarms│──┐
│ (disk, CPU, 5xx) │  │   ┌─────────────────────────────────────────┐
├──────────────────┤  │   │ EventBridge default bus                 │
│ AWS Config Rules │──┼──►│  Rules (pattern-matched):               │
│ (drift findings) │  │   │  · alarm-state-change  → remediation SM │
├──────────────────┤  │   │  · config-noncompliant → remediation SM │
│ GuardDuty        │──┤   │  · guardduty-finding   → triage SM      │
│ (threat findings)│  │   │  · health-event        → notify only    │
├──────────────────┤  │   └────────────────┬────────────────────────┘
│ AWS Health /     │──┘                    │ (event payload as input)
│ Personal Health  │                       ▼
└──────────────────┘      ┌─────────────────────────────────────────┐
                          │ Step Functions "remediation-engine"     │
                          │ (Standard workflow — full audit trail)  │
                          │                                         │
                          │  [1] ValidateEvent (Lambda)             │
                          │       · re-query source: is it real?    │
                          │  [2] ▓ CheckKillSwitch ▓                │
                          │       · SSM param /selfheal/enabled     │
                          │  [3] ▓ CheckOptOutTag ▓                 │
                          │       · AutoRemediate=false → escalate  │
                          │  [4] ▓ CheckCircuitBreaker ▓            │
                          │       · DynamoDB attempts table         │
                          │       · >3/hr per resource → OPEN       │
                          │  [5] Classify (Lambda)                  │
                          │       · map finding → runbook ID        │
                          │       · unknown → escalate, never guess │
                          │  [6] Remediate                          │
                          │       · SSM Run Command / Automation    │
                          │       · scoped IAM: ONLY the documented │
                          │         actions for this runbook        │
                          │  [7] Verify (Lambda, after delay)       │
                          │       · did the metric/finding clear?   │
                          │  [8] RecordEvidence                     │
                          │       · DynamoDB history + S3 artifacts │
                          └───────┬─────────────────┬───────────────┘
                                  │ success         │ any guard fails
                                  ▼                 ▼
                       ┌──────────────────┐  ┌───────────────────────┐
                       │ Slack #ops-auto  │  │ PagerDuty escalation  │
                       │ "healed X on Y,  │  │ + Jira ticket with    │
                       │  evidence: link" │  │   full SM execution   │
                       │ Jira auto-closed │  │   history attached    │
                       └──────────────────┘  └───────────────────────┘

  ▓ shaded ▓ = safety interlock: evaluated BEFORE any action, every time
  IAM: engine role can ONLY invoke allow-listed SSM documents;
       each document's role can ONLY touch its documented resources

Three design decisions in this diagram carry most of the safety weight. Validation before action (step 1) re-queries the source of truth rather than trusting the event — alarms flap, and events arrive late. Classification maps to runbooks, never to improvisation (step 5) — if the finding doesn't match a documented runbook, the machine's only move is to escalate with context. Verification after action (step 7) closes the loop — a remediation that didn't fix the metric is a failure even if every API call succeeded.

2.1 The State Machine, Concretely

{
  "Comment": "Self-healing remediation engine",
  "StartAt": "ValidateEvent",
  "States": {
    "ValidateEvent": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:validate-event",
      "Retry": [{"ErrorEquals": ["States.TaskFailed"], "MaxAttempts": 2, "BackoffRate": 2}],
      "Next": "CheckKillSwitch"
    },
    "CheckKillSwitch": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:check-kill-switch",
      "Next": "KillSwitchChoice"
    },
    "KillSwitchChoice": {
      "Type": "Choice",
      "Choices": [{"Variable": "$.engineEnabled", "BooleanEquals": false, "Next": "Escalate"}],
      "Default": "CheckGuards"
    },
    "CheckGuards": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:check-guards",
      "Comment": "Opt-out tag + circuit breaker in one call",
      "Next": "GuardChoice"
    },
    "GuardChoice": {
      "Type": "Choice",
      "Choices": [{"Variable": "$.guardsPassed", "BooleanEquals": false, "Next": "Escalate"}],
      "Default": "Classify"
    },
    "Classify": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:classify-finding",
      "Next": "KnownRunbook"
    },
    "KnownRunbook": {
      "Type": "Choice",
      "Choices": [{"Variable": "$.runbookId", "StringEquals": "UNKNOWN", "Next": "Escalate"}],
      "Default": "Remediate"
    },
    "Remediate": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ssm:startAutomationExecution",
      "Parameters": {
        "DocumentName.$": "$.runbookId",
        "Parameters.$": "$.runbookParams"
      },
      "TimeoutSeconds": 600,
      "Catch": [{"ErrorEquals": ["States.ALL"], "Next": "Escalate"}],
      "Next": "WaitForSettle"
    },
    "WaitForSettle": { "Type": "Wait", "Seconds": 120, "Next": "Verify" },
    "Verify": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:verify-remediation",
      "Next": "VerifyChoice"
    },
    "VerifyChoice": {
      "Type": "Choice",
      "Choices": [{"Variable": "$.resolved", "BooleanEquals": true, "Next": "RecordSuccess"}],
      "Default": "Escalate"
    },
    "RecordSuccess": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:record-evidence",
      "End": true
    },
    "Escalate": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:escalate-to-human",
      "Comment": "PagerDuty + Jira with full execution history",
      "End": true
    }
  }
}

3. Case Study: Automated Disk Space Recovery

Scenario: A failure in log rotation leads to root volume exhaustion on a critical EC2 instance. Untreated, this progresses from degraded writes to a crashed service in hours.

3.1 Detection

The CloudWatch Agent emits disk_used_percent per mount point. Two thresholds matter: an 85% warning that triggers automation, and a 95% critical that pages a human immediately in parallel — automation gets a head start, but nobody bets the service on it.

3.2 Event Routing

# Terraform: Event rule → Step Functions target
resource "aws_cloudwatch_event_rule" "disk_space" {
  name        = "auto-remediate-disk-space"
  description = "Triggered when disk usage > 85%"
  event_pattern = jsonencode({
    source      = ["aws.cloudwatch"]
    detail-type = ["CloudWatch Alarm State Change"]
    detail = {
      state = { value = ["ALARM"] }
      alarmName = [{ prefix = "disk-used-" }]
    }
  })
}

resource "aws_cloudwatch_event_target" "to_engine" {
  rule     = aws_cloudwatch_event_rule.disk_space.name
  arn      = aws_sfn_state_machine.remediation_engine.arn
  role_arn = aws_iam_role.eventbridge_to_sfn.arn
}

3.3 Remediation Logic — Diagnose Before Deleting

The naive fix (rm -rf /var/log/*.gz) treats every disk-pressure event identically. Production remediation diagnoses first, acts on known-safe paths only, and reports exactly what it freed:

import boto3, json, time

ssm = boto3.client("ssm")

# The ONLY paths automation may touch. Anything else → escalate.
SAFE_CLEANUP = [
    "journalctl --vacuum-time=2d",
    "find /var/log -name '*.gz' -mtime +3 -delete",
    "find /tmp -type f -atime +2 -delete",
    "docker system prune -f --filter 'until=48h' 2>/dev/null || true",
]

def handler(event, context):
    instance_id = event["instanceId"]

    # Phase 1: diagnose — what is actually consuming the disk?
    diag = ssm.send_command(
        InstanceIds=[instance_id],
        DocumentName="AWS-RunShellScript",
        Parameters={"commands": [
            "df -h / | tail -1",
            "du -xhd1 /var 2>/dev/null | sort -rh | head -10",
        ]},
    )
    diagnosis = wait_for_output(diag["Command"]["CommandId"], instance_id)

    # Phase 2: act on safe paths only
    fix = ssm.send_command(
        InstanceIds=[instance_id],
        DocumentName="AWS-RunShellScript",
        Parameters={"commands": SAFE_CLEANUP + ["df -h / | tail -1"]},
    )
    result = wait_for_output(fix["Command"]["CommandId"], instance_id)

    return {
        "instanceId": instance_id,
        "diagnosis": diagnosis,      # evidence: what was consuming space
        "postCleanup": result,       # evidence: what freed, current usage
    }

def wait_for_output(command_id, instance_id, timeout=120):
    for _ in range(timeout // 5):
        time.sleep(5)
        inv = ssm.get_command_invocation(CommandId=command_id,
                                         InstanceId=instance_id)
        if inv["Status"] in ("Success", "Failed", "TimedOut"):
            return {"status": inv["Status"],
                    "stdout": inv["StandardOutputContent"][-2000:]}
    return {"status": "Timeout"}

The Verify step (state machine step 7) then re-reads disk_used_percent. If usage is still above threshold after cleanup, the problem is not stale logs — it might be a runaway application writing data — and that is a human problem. The machine's diagnosis output becomes the first comment on the page.

4. Case Study: Security Group Drift

A developer accidentally opens port 22 (SSH) to 0.0.0.0/0 — a critical-severity finding with a mean-time-to-exploitation measured in minutes on the public internet (SSH scanners find new open ports nearly instantly).

Fig 2. Security Drift Remediation Path — Detection to Closure in Under Two Minutes

  t+0s    Developer: authorize-security-group-ingress 0.0.0.0/0:22
             │
  t+~30s  AWS Config rule "restricted-ssh" evaluates → NON_COMPLIANT
             │
             ├──► EventBridge: config compliance-change event
             │        │
             ▼        ▼
  t+~60s  Remediation engine (guards pass, runbook = SG-001)
             │
             ├─ SSM Automation: AWS-DisablePublicAccessForSecurityGroup
             │    · removes ONLY the offending 0.0.0.0/0 rule
             │    · leaves all other rules intact
             │
  t+~90s  Verify: re-evaluate Config rule → COMPLIANT
             │
  t+~120s Evidence + notify:
             · Slack DM to rule author (from CloudTrail identity):
               "I closed public SSH on sg-0abc (rule SG-001).
                Use SSM Session Manager or the VPN. Evidence: "
             · Jira ticket created and auto-closed with diff
             · If the SG carries tag Protected=true → NO ACTION,
               page security on-call instead (break-glass SGs exist)

The notification detail matters more than it looks: the Slack DM goes to the specific engineer whose CloudTrail identity made the change, names the exact rule applied, and links the evidence. Automation that communicates like a competent colleague gets adopted; automation that silently mutates infrastructure gets disabled by the first team it surprises.

5. Safety Mechanisms in Depth

The nightmare scenario: flawed remediation logic in a feedback loop — the "fix" triggers the alarm, which triggers the "fix" — rebooting its way through production at machine speed. Every layer below exists to make that scenario structurally impossible.

5.1 The Circuit Breaker (DynamoDB, Atomic, With TTL)

import time
import boto3
from botocore.exceptions import ClientError

ddb = boto3.client("dynamodb")
TABLE = "remediation-attempts"
MAX_ATTEMPTS = 3          # per resource
WINDOW_SECONDS = 3600     # per hour

def check_circuit_breaker(resource_id: str, runbook_id: str) -> bool:
    """Returns True if remediation may proceed. Atomic — safe under
    concurrent executions for the same resource."""
    now = int(time.time())
    try:
        ddb.update_item(
            TableName=TABLE,
            Key={"pk": {"S": f"{resource_id}#{runbook_id}"}},
            UpdateExpression=
                "SET attempts = if_not_exists(attempts, :zero) + :one, "
                "expires_at = :ttl",
            ConditionExpression=
                "attribute_not_exists(attempts) OR attempts < :max",
            ExpressionAttributeValues={
                ":zero": {"N": "0"},
                ":one":  {"N": "1"},
                ":max":  {"N": str(MAX_ATTEMPTS)},
                ":ttl":  {"N": str(now + WINDOW_SECONDS)},  # DynamoDB TTL resets window
            },
        )
        return True
    except ClientError as e:
        if e.response["Error"]["Code"] == "ConditionalCheckFailedException":
            return False   # breaker OPEN → escalate to human
        raise

The conditional write is the point: two concurrent executions cannot both take the third slot, and the TTL attribute lets DynamoDB expire the window with no cleanup job.

5.2 The Full Safety Stack

Layer	Mechanism	Protects against
Kill switch	SSM parameter `/selfheal/enabled`, checked first, every execution	Any misbehavior — one CLI command pauses the entire engine
Opt-out tags	`AutoRemediate=false` honored before any action	Special-case resources (break-glass SGs, canary hosts, DBs mid-migration)
Circuit breaker	3 attempts / resource / hour, atomic in DynamoDB	Feedback loops and flapping resources
Runbook allow-list	Classifier maps to documented runbooks or escalates	Improvised, untested remediations
Scoped IAM	Engine may only invoke allow-listed SSM documents; each document's role touches only its documented resources	Blast radius of a compromised or buggy engine
Verification step	Re-check source of truth after acting	"Succeeded" API calls that didn't fix anything
Evidence trail	Every decision + output → DynamoDB history, S3 artifacts, CloudTrail	Unauditable automation; enables post-incident learning

6. Hands-On Practical: Game Day for Your Automation

Never let a remediation path meet production before it has survived a staged fault injection. This game day takes about two hours in a staging account and proves all three critical properties: the fix works, the guards hold, and the kill switch kills.

6.1 Drill A — Disk Pressure (prove the fix works)

# On a staging instance, force disk usage past the threshold:
sudo fallocate -l 25G /var/log/gameday-filler.log
watch -n 30 df -h /     # observe usage cross 85%

# Expected within ~5 minutes:
#  1. CloudWatch alarm fires → state machine executes
#  2. Slack #ops-auto: diagnosis (top consumers) + freed space report
#  3. Verify step: if fallocate file wasn't in a safe path (it isn't!),
#     usage stays high → engine ESCALATES with the diagnosis attached.
#     ✔ This is the correct outcome: unknown consumer → human decision.

# Now test the happy path with reclaimable files:
sudo bash -c 'for i in $(seq 1 500); do
  gzip -c /var/log/syslog > /var/log/old-$i.gz; done'
sudo touch -d "5 days ago" /var/log/old-*.gz
# → alarm → cleanup deletes aged .gz files → verify passes → auto-close

6.2 Drill B — Security Drift (prove detection-to-closure time)

aws ec2 authorize-security-group-ingress \
  --group-id sg-STAGING123 --protocol tcp --port 22 --cidr 0.0.0.0/0
date +%s   # start the clock

# Expected: rule removed and Slack DM received in < 2 minutes.
# Then verify the protected-resource guard:
aws ec2 create-tags --resources sg-BREAKGLASS --tags Key=Protected,Value=true
aws ec2 authorize-security-group-ingress \
  --group-id sg-BREAKGLASS --protocol tcp --port 22 --cidr 0.0.0.0/0
# → NO auto-remediation; security on-call paged instead  ✔

6.3 Drill C — Circuit Breaker and Kill Switch (prove the guards hold)

# Trigger the same alarm 4 times within an hour (re-fill disk after each fix):
# attempts 1-3 → remediated; attempt 4 → breaker OPEN, PagerDuty page  ✔

# Kill switch: pause the entire engine in one command —
aws ssm put-parameter --name /selfheal/enabled --value "false" --overwrite
# trigger any alarm → execution short-circuits to Escalate  ✔
aws ssm put-parameter --name /selfheal/enabled --value "true" --overwrite

Game-day exit criteria (all must pass):

Happy path: fault injected → remediated → verified → evidence recorded → auto-closed.
Unknown-consumer path: engine escalates with diagnosis rather than guessing.
Protected resources: opt-out tag respected, humans paged.
Circuit breaker opens on the 4th attempt within the window.
Kill switch stops all remediation within one execution cycle.

7. Operationalizing Self-Healing Safely

Automated remediation needs the same engineering discipline as application code — and a staged rollout. Stage 1 (observe-only): the engine runs end-to-end but the Remediate step only logs its intended action; you review two weeks of "would have done" against reality. Stage 2 (act with opt-out): remediation runs unless tagged out, every action announced in Slack. Stage 3 (full automation): breaker + kill switch active, quarterly game days keep it honest.

Failure type	Safe automated response	Escalate when
Disk pressure	Clean known cache/log paths; report freed space with diagnosis	Usage remains high, unknown consumer, or >3 events/hour
Public SSH exposure	Remove public ingress; DM the author with evidence	Security group tagged Protected (break-glass) or owned by security tooling
Certificate expiration	Renew managed cert or open ticket with the exact dependency path	External DNS validation or third-party ownership required
Unhealthy ASG instance	Cordon, capture logs + heap/thread dumps to S3, then terminate-and-replace	More than 20% of the fleet unhealthy (systemic, not instance-local)
Stuck deployment	Roll back via pipeline API if canary metrics regressed	Database migration was part of the release (rollback ≠ revert)

Production readiness checklist:

Every remediation is idempotent and carries a maximum retry count.
All actions write to CloudTrail, execution history, ticket history, and alert notifications.
Critical resources can opt out via AutoRemediate=false.
Humans can disable the engine in one command during incidents (and have practiced doing so).
Each runbook names an owner and was game-day tested in the last quarter.

8. FAQ

What's safe to auto-remediate?
Known, frequent, reversible, machine-diagnosable failures: disk pressure, SG drift, managed cert renewal, unhealthy-instance replacement, clean rollbacks. Never: novel failures, irreversible actions, anything requiring judgment.

Why Step Functions instead of one Lambda?
Remediation is a workflow with guards, retries, verification, and escalation branches. Step Functions makes every decision visible and auditable; a monolithic Lambda hides them in logs.

Why is the circuit breaker mandatory?
A flawed fix in a feedback loop is an outage generator. The breaker converts an infinite loop into a single page. The DynamoDB conditional write in §5.1 makes it race-proof.

How do I roll this out safely?
Observe-only → act-with-opt-out → full automation, promoting on evidence roughly every two weeks. The kill switch exists from day one.

How is it tested?
Game days (§6): inject the fault, watch the loop close, force the breaker, throw the kill switch. Quarterly, forever.

Does this replace on-call?
No — it deletes the boring 60–80% of pages so the remaining ones deserve a human. The skill shifts from firefighting to automation design and audit.

9. Conclusion

Self-healing infrastructure transforms the SRE role from reactive incident response to proactive system design. The architecture is honest about its limits: it automates the known and escalates the novel, it acts only through documented, tested, scoped runbooks, and every action it takes is guarded, verified, and evidenced. Build the safety stack first, prove it on a game day, and the automation earns the only currency that matters in production: trust.