Why do AI assistants hallucinate Terraform arguments?

Training corpora mix multiple incompatible generations of Terraform: 0.11 HCL, 0.12+ syntax, community modules like terraform-aws-modules/eks, and raw provider resources across many provider versions. The model statistically blends them, producing arguments like enable_irsa = true on a raw aws_eks_cluster resource — an input that belongs to a community module, not the resource. The output is syntactically fluent and functionally invalid, which is precisely what makes it dangerous.

What is the confidence-competence gap in AI code generation?

AI assistants express uniform confidence regardless of actual competence: the tone of a correct boilerplate answer and a hallucinated IAM policy is identical. Human experts signal uncertainty; models do not. Reviewers therefore cannot use fluency as a proxy for correctness, and every claim about provider behavior, security semantics, or API contracts must be validated mechanically.

What kinds of infrastructure tasks are AI assistants reliably good at?

Tasks with stable, well-documented schemas heavily represented in training data: Kubernetes manifests (NetworkPolicies, Deployments), unit test scaffolding, boilerplate Python/Bash, SQL queries, regex, and one-shot format conversions. Reliability collapses on tasks requiring current provider-version knowledge, cross-resource security reasoning (IAM blast radius), or organization-specific context the model cannot see.

How should teams govern AI-generated infrastructure code?

Four controls: zero-trust adoption (AI code gets the same review as code from an unknown forum); data-leakage prevention (no secrets, PII, or network maps in prompts; enterprise instances with retention opt-outs); mandatory automated validation (tflint, checkov, kubeconform in pre-commit and CI catch hallucinated syntax mechanically); and the junior-engineer heuristic — treat output as work from a talented but inexperienced engineer with no security intuition and no memory of your production environment.

What prompt patterns produce better infrastructure code from AI assistants?

Constraint-rich prompts: pin the provider version, state security requirements explicitly (no public endpoints, no secret logging), demand a dry-run mode, and require validation commands in the answer. Vague prompts sample from the model's entire mixed-quality training distribution; constrained prompts collapse the distribution toward reviewable, current, secure output.

Do AI coding assistants make DevOps engineers less necessary?

They shift the premium skill from writing code to auditing it. Generating a plausible EKS module now takes seconds; knowing that its node group lacks a launch template with IMDSv2 enforcement still requires an engineer who understands the underlying system. Teams that pair assistants with strong review culture and mechanical guardrails get the productivity without inheriting the failure modes.

A Comparative Evaluation of AI Coding Assistants in Infrastructure-as-Code Workflows

Sritej Panchumarthi · Published: May 5, 2026 · Updated: July 7, 2026 · Technical Report · 50 min read

Abstract
Large Language Models have permeated software development, yet their efficacy in specialized domains like DevOps and cloud security remains under-explored. This study benchmarks leading AI assistants against real-world Infrastructure-as-Code tasks: Terraform module generation, IAM automation, and Kubernetes policy authoring. We document a significant "Confidence-Competence Gap" — a high incidence of hallucinated resource arguments and insecure default configurations delivered with uniform fluency. We then present the complete countermeasure: an L3 governance architecture that wraps assistants in mechanical validation, corrected reference implementations for every failure we found, constraint-based prompt patterns, and a reusable evaluation harness so teams can benchmark assistants against their own stack rather than trusting anyone's leaderboard.

Key takeaway: AI assistants are excellent accelerators for boilerplate, examples, and review checklists. They are not reliable authorities for provider-specific infrastructure behavior, IAM blast radius, or production security decisions unless paired with automated validation and human review. The premium skill has shifted from writing code to auditing it.

1. Introduction

The promise of AI-assisted coding is increased velocity. However, in the context of DevSecOps, velocity without verification is a liability. Unlike general-purpose application logic — where a bug throws an exception in staging — errors in Infrastructure-as-Code can mean an internet-exposed database, a leaked credential in build logs, or a five-figure cost overrun, all deployed at automation speed.

This report documents a six-month longitudinal study evaluating the reliability of AI assistants on staff-engineer-level infrastructure problems, followed by the governance architecture we now use to capture the productivity while containing the risk.

2. Methodology

We defined three task categories, each run repeatedly across assistants and scored on four axes — syntactic validity (does it parse?), functional validity (does it apply/deploy?), security posture (what would a pentester say?), and self-awareness (did the model flag its own uncertainty?):

Task A (IaC): Generate a multi-AZ AWS EKS cluster in Terraform with IRSA (IAM Roles for Service Accounts) enabled.
Task B (Security): Write a Python/boto3 script to rotate IAM access keys for users older than 90 days.
Task C (Orchestration): Create a Kubernetes NetworkPolicy isolating a frontend namespace.

Task	Syntactic validity	Functional validity	Security posture	Flagged own uncertainty?
A — Terraform EKS + IRSA	High	Low (hallucinated arguments)	Medium (public endpoint defaults)	Never
B — IAM key rotation	High	High	Low (secret logging)	Never
C — NetworkPolicy	High	High	High	N/A

3. The "Confidence-Competence" Gap

The biggest risk with AI assistants isn't that they are wrong — it's that they are confidently wrong. A human expert hedges when uncertain; a model delivers a hallucinated provider argument in the same authoritative tone as a textbook answer. Fluency stops being a signal of correctness, which quietly breaks the heuristics human reviewers have relied on for their entire careers.

Fig 1. Observed Assistant Reliability by Task Type — the Gap Widens with Context Depth

 Reliability
 (functional + secure)
   ▲
   │ ██ Boilerplate Python / Bash          ◄─ stable syntax,
   │ ██ Unit test scaffolding                 massive corpus
   │ ██ SQL queries · regex · conversions
   │ ▓▓ Kubernetes manifests               ◄─ stable schema,
   │ ▓▓ Dockerfiles                           well documented
   │ ▒▒ Terraform (current providers)      ◄─ corpus spans many
   │ ▒▒ Helm charts w/ deps                   incompatible versions
   │ ░░ IAM policies (blast radius)        ◄─ requires reasoning
   │ ░░ Cross-service security wiring         about YOUR context,
   │ ░░ Multi-resource failure modes          which model cannot see
   │ ·· Production debugging
   │ ·· Cost/perf tradeoff decisions
   └────────────────────────────────────────────────► Task depth
        ▲                                    ▲
        │ SAFE: generate, then lint          │ DANGER ZONE: generate,
        │                                    │ then EXPERT review +
        │                                    │ mechanical validation
   Confidence expressed by model: ████████ UNIFORM across all rows —
   that flat line IS the confidence-competence gap.

4. Case Study A: Terraform Hallucinations

Objective: Create an AWS EKS cluster module with IRSA enabled.

Observation: Assistants repeatedly emitted arguments that were deprecated or nonexistent in AWS Provider v4.0+:

# AI-generated (hallucination)
resource "aws_eks_cluster" "main" {
  name = "my-cluster"
  # This argument does not exist on this resource!
  enable_irsa = true
}

Root cause analysis: The training corpus mixes outdated Terraform (0.11/0.12 era), the popular community module terraform-aws-modules/eks (where enable_irsa is a valid input), and raw provider documentation. The model statistically conflates the module's interface with the raw resource's schema — producing code that is syntactically fluent and functionally invalid. terraform validate catches it in two seconds; a tired reviewer at 5 p.m. does not.

The corrected reference implementation — what IRSA actually requires (an OIDC provider bound to the cluster's issuer URL):

# Correct: IRSA = OIDC provider wired to the cluster issuer
resource "aws_eks_cluster" "main" {
  name     = "my-cluster"
  role_arn = aws_iam_role.cluster.arn
  version  = "1.29"

  vpc_config {
    subnet_ids              = module.vpc.private_subnets
    endpoint_public_access  = false          # AI default was true
    endpoint_private_access = true
  }
}

data "tls_certificate" "oidc" {
  url = aws_eks_cluster.main.identity[0].oidc[0].issuer
}

resource "aws_iam_openid_connect_provider" "eks" {
  client_id_list  = ["sts.amazonaws.com"]
  thumbprint_list = [data.tls_certificate.oidc.certificates[0].sha1_fingerprint]
  url             = aws_eks_cluster.main.identity[0].oidc[0].issuer
}

# Per-service role example: scoped trust to ONE service account
data "aws_iam_policy_document" "app_trust" {
  statement {
    actions = ["sts:AssumeRoleWithWebIdentity"]
    principals {
      type        = "Federated"
      identifiers = [aws_iam_openid_connect_provider.eks.arn]
    }
    condition {
      test     = "StringEquals"
      variable = "${replace(aws_eks_cluster.main.identity[0].oidc[0].issuer, "https://", "")}:sub"
      values   = ["system:serviceaccount:app:app-sa"]
    }
  }
}

Note what the hallucination concealed: IRSA is not a boolean — it is an OIDC federation design with a per-service trust condition. The one-line fake answer didn't just fail to work; it hid an entire security architecture the engineer needed to understand.

5. Case Study B: Security & IAM Risks

Objective: Write a Python script to rotate IAM access keys.

Critical finding: A leading assistant generated a script that logged the newly created secret key to stdout "for verification purposes":

# AI-generated (dangerous)
response = iam.create_access_key(UserName=user)
print(f"New Key Created: {response['AccessKey']['SecretAccessKey']}")  # ← SECURITY RISK

Security implication: Executed in CI (Jenkins, GitLab), the credential persists in plaintext build logs — readable by anyone with project read access, shipped to log aggregators, retained by backup systems. One generated line converts a routine rotation into a multi-system credential-exposure incident.

The corrected reference implementation — secrets flow only into Secrets Manager, with dry-run and old-key lifecycle handling:

import boto3
from datetime import datetime, timezone, timedelta

iam = boto3.client("iam")
secrets = boto3.client("secretsmanager")
MAX_AGE = timedelta(days=90)

def rotate(user: str, dry_run: bool = True) -> dict:
    keys = iam.list_access_keys(UserName=user)["AccessKeyMetadata"]
    actions = []
    for key in keys:
        age = datetime.now(timezone.utc) - key["CreateDate"]
        if age < MAX_AGE:
            continue
        actions.append(f"rotate {key['AccessKeyId'][:8]}… (age {age.days}d)")
        if dry_run:
            continue

        new = iam.create_access_key(UserName=user)["AccessKey"]
        # secret goes to the vault — NEVER to stdout, logs, or return values
        secrets.put_secret_value(
            SecretId=f"iam/{user}/access-key",
            SecretString=(
                '{"AccessKeyId":"%s","SecretAccessKey":"%s"}'
                % (new["AccessKeyId"], new["SecretAccessKey"])
            ),
        )
        # deactivate (don't delete) the old key: consumers get a grace
        # window to pick up the new secret; deletion is a later, separate job
        iam.update_access_key(UserName=user,
                              AccessKeyId=key["AccessKeyId"],
                              Status="Inactive")
    return {"user": user, "dry_run": dry_run, "actions": actions}

Every difference between the two versions is a security decision the assistant silently made wrong: where secrets go, whether old keys die instantly (breaking running consumers) or gracefully, and whether the operator can preview actions before execution.

6. Case Study C: Kubernetes Manifests — Where Models Shine

Objective: Create a NetworkPolicy denying all ingress except from the "frontend" namespace.

Result: High proficiency across assistants — correct use of matchLabels, namespaceSelector, and the subtle deny-by-default semantics (an empty ingress array vs. an absent one). The lesson generalizes: AI models excel at declarative configuration with stable, versioned, well-documented schemas. The Kubernetes API changes deliberately and is documented exhaustively; Terraform providers change monthly across a fragmented corpus. Schema stability predicts assistant reliability better than task "difficulty" does.

7. The Governance Framework — L3 View

Based on these findings, we operate AI assistants inside a layered control architecture. The insight: you cannot review your way out of hallucinations at scale — you must make validation mechanical, and reserve human attention for what machines cannot check (blast radius, intent, context).

Fig 2. L3 Governance Architecture — AI-Generated Code from Prompt to Production

 ┌─ POLICY BOUNDARY: what may enter a prompt ──────────────────────┐
 │  ✗ secrets/tokens   ✗ PII   ✗ internal network maps             │
 │  ✓ enterprise instance · retention opt-out · audit logging      │
 └──────────────────────────────┬──────────────────────────────────┘
                                ▼
        Developer + AI assistant (IDE / chat)
        constraint-rich prompt (see §8): pinned versions,
        security requirements, demanded validation steps
                                │  generated code
                                ▼
 ┌─ TIER 1 · MECHANICAL (pre-commit, seconds) ─────────────────────┐
 │  terraform validate · tflint --minimum-failure-severity=error   │
 │  kubeconform -strict · pylint/ruff · gitleaks                   │
 │  CATCHES: hallucinated args, schema errors, planted secrets     │
 │  (Case A dies here, 2 s after being written)                    │
 └──────────────────────────────┬──────────────────────────────────┘
                                ▼
 ┌─ TIER 2 · SEMANTIC (CI, minutes) ───────────────────────────────┐
 │  checkov / tfsec  → public exposure, unencrypted storage        │
 │  semgrep custom rules → secret-logging patterns:                │
 │     "print.*SecretAccessKey" · "console.log.*password"          │
 │  (Case B dies here — mechanically, every time)                  │
 │  OPA policies → org rules AI cannot know (tagging, regions,     │
 │     approved module registry only)                              │
 └──────────────────────────────┬──────────────────────────────────┘
                                ▼
 ┌─ TIER 3 · HUMAN (MR review, focused) ───────────────────────────┐
 │  Reviewer attention is now SPENT ONLY on what machines can't:   │
 │  · blast radius: "what can this role/SG actually reach?"        │
 │  · intent match: "is this what the ticket asked for?"           │
 │  · context: "does this respect OUR failure domains?"            │
 │  MR template flags: [x] contains AI-generated code              │
 │  → routes to reviewer with domain expertise, not rubber stamp   │
 └──────────────────────────────┬──────────────────────────────────┘
                                ▼
                    plan/apply · progressive rollout
                    (same pipeline as human-written code —
                     provenance recorded, no separate lane)

Zero-trust code adoption: AI-generated code undergoes the same rigor as code from a public forum. It is not "correct by default" — it is plausible by default, which is worse.
Data-leakage prevention: strict prohibition on secrets, PII, or internal network maps in prompts; enterprise instances with retention opt-outs.
Automated validation: linters and policy scanners in pre-commit and CI catch hallucinated syntax and insecure patterns mechanically (Tiers 1–2 above).
The "junior engineer" heuristic: treat output as the work of a talented but inexperienced engineer — syntactically fluent, zero security intuition, no memory of your environment.

8. Prompt Patterns That Work Better for DevOps

The highest-quality outputs came from prompts that force the assistant to operate inside explicit constraints: provider versions, security requirements, failure criteria, and validation commands. A vague prompt samples from the model's entire mixed-quality training distribution; a constrained prompt collapses that distribution toward current, secure, reviewable output.

Weak prompt	Better prompt
Create an EKS cluster in Terraform.	Create an EKS cluster using AWS provider v5, no public endpoint, IRSA enabled through an OIDC provider (raw resources, not the community module), managed node groups with IMDSv2 required, and include validation commands with `terraform validate` and `tflint`.
Write an IAM rotation script.	Write a boto3 IAM access-key rotation script that never logs secrets, stores the new secret only in Secrets Manager, deactivates (not deletes) old keys, supports dry-run mode, and emits CloudWatch metrics.
Make a Kubernetes NetworkPolicy.	Generate a deny-by-default NetworkPolicy and a separate allow rule from namespace label `app=frontend`, then explain how to test it with a temporary pod.

Review checklist for AI-generated infrastructure code:

Pin provider and module versions before trusting syntax.
Run linters and schema validators before reading the code deeply — don't spend human attention on what a machine checks in seconds.
Search for accidental secret logging, wildcard IAM permissions, and public network exposure.
Ask the model for failure modes, then verify them independently.
Treat generated explanations as hypotheses, not documentation.

9. Hands-On Practical: Build Your Own Evaluation Harness

Leaderboards measure other people's tasks. Your stack has its own providers, versions, and conventions — so benchmark assistants against it. This harness (~2 hours to stand up) scores any assistant on your representative tasks, mechanically:

#!/usr/bin/env bash
# eval-harness/run.sh — score AI-generated IaC on YOUR stack
# Layout:  tasks/<name>/prompt.md        (what you asked)
#          tasks/<name>/response/        (what the model produced)
#          tasks/<name>/expected.txt     (assertions, optional)
set -uo pipefail
declare -A SCORES

for task in tasks/*/; do
  name=$(basename "$task"); cd "$task/response"
  s=0

  # Gate 1 — parses at all (catches Case-A hallucinations)
  if ls *.tf &>/dev/null; then
    terraform init -backend=false -input=false &>/dev/null
    terraform validate &>/dev/null            && s=$((s+25))
    tflint --minimum-failure-severity=error   && s=$((s+15))
    checkov -d . --compact --quiet            && s=$((s+25))
  fi
  if ls *.yaml &>/dev/null; then
    kubeconform -strict -summary *.yaml       && s=$((s+40))
  fi
  if ls *.py &>/dev/null; then
    ruff check . &>/dev/null                  && s=$((s+20))
    # Gate 2 — the Case-B detector: secrets in output paths
    ! grep -rE "print.*(Secret|Password|Token)" . && s=$((s+20))
  fi

  # Gate 3 — task-specific assertions (e.g. "endpoint_public_access = false")
  if [[ -f ../expected.txt ]]; then
    while read -r assertion; do
      grep -rq "$assertion" . && s=$((s+5)) || echo "  MISS: $assertion"
    done < ../expected.txt
  fi

  SCORES[$name]=$s; cd - &>/dev/null
done

echo "== Assistant scorecard =="
for k in "${!SCORES[@]}"; do printf "%-40s %s/100\n" "$k" "${SCORES[$k]}"; done

Write 5–10 prompt.md tasks that mirror your actual work (your provider versions, your module registry, your naming rules).
Paste each prompt into the assistants you're evaluating; save outputs under response/.
Add expected.txt assertions for the security properties that matter to you (endpoint_public_access = false, metadata_options, no 0.0.0.0/0).
Run the harness. Re-run quarterly — assistants change under you without release notes.

Teams that run this exercise stop arguing about which assistant is "best" in the abstract and start knowing which one fails their Gate 2 least often — which is the only question that matters.

10. FAQ

Why do assistants hallucinate Terraform arguments?
The corpus blends incompatible generations of HCL, community modules, and provider versions; the model statistically merges them into fluent, invalid code. Schema-stable domains (Kubernetes) don't trigger this failure nearly as often.

What is the confidence-competence gap?
Model confidence is flat across task types while competence varies wildly — so fluency can't be used as a correctness signal. Validation must be mechanical.

What are assistants reliably good at?
Stable-schema declarative config, boilerplate, tests, SQL, conversions. Reliability collapses where current-version knowledge or your private context is required.

What's the core of the governance framework?
Three tiers: mechanical validation (seconds), semantic scanning (minutes), and human review focused exclusively on blast radius, intent, and context. Plus a policy boundary on what may enter prompts.

Do better prompts really help?
Measurably. Pinned versions, explicit security requirements, and demanded validation steps collapse the output distribution toward current and secure. §8 has the before/after patterns.

Does this make engineers less necessary?
It shifts the premium from writing to auditing. Generating a plausible module takes seconds; knowing its node group lacks IMDSv2 enforcement still takes an engineer.

11. Conclusion

AI coding assistants represent a genuine paradigm shift in developer productivity — and in the infrastructure domain, an equally genuine shift in where risk concentrates. The failures documented here are not exotic: a hallucinated argument, a logged secret, an insecure default, each delivered with perfect confidence. The remedy is not abstinence but architecture: mechanical validation tiers that catch what machines can catch, human review spent only where judgment is irreplaceable, prompts engineered as constraint systems, and a standing evaluation harness that measures assistants against your stack instead of a leaderboard. Success requires the shift from "writing code" to "auditing code" — placing a higher premium than ever on the engineer's fundamental understanding of the underlying systems.