← Back to Blog

Building an AWS Incident Response Playbook: From Detection to Recovery

6 min read

When a security incident hits your AWS environment, the difference between a contained breach and a catastrophic one is preparation. A well-rehearsed incident response playbook turns panic into procedure. This guide walks through building a comprehensive AWS incident response capability grounded in the NIST framework.

The NIST Framework Applied to AWS

The NIST Computer Security Incident Handling Guide (SP 800-61) defines four phases: Preparation, Detection and Analysis, Containment Eradication and Recovery, and Post-Incident Activity. Each phase maps directly to AWS services and capabilities.

Preparation means having the right tooling deployed before an incident occurs. Detection relies on services like GuardDuty, Security Hub, and CloudTrail. Containment leverages Lambda automation and IAM controls. Recovery uses infrastructure-as-code to rebuild clean environments. Post-incident analysis uses CloudTrail Lake for forensic querying.

Automated Containment with Lambda

Manual containment is too slow. By the time a human reviews an alert and logs into the console, an attacker with compromised credentials can pivot across accounts. Automated containment functions should trigger within seconds of detection.

Isolating a Compromised EC2 Instance

import boto3
import json
from datetime import datetime

def isolate_instance(event, context):
    """Automatically isolate a compromised EC2 instance."""
    ec2 = boto3.client('ec2')
    finding = event['detail']
    instance_id = finding['resource']['instanceDetails']['instanceId']
    vpc_id = finding['resource']['instanceDetails']['networkInterfaces'][0]['vpcId']

    # Create a quarantine security group that blocks all traffic
    try:
        sg_response = ec2.create_security_group(
            GroupName=f'quarantine-{instance_id}-{datetime.utcnow().strftime("%Y%m%d%H%M")}',
            Description=f'Quarantine SG for compromised instance {instance_id}',
            VpcId=vpc_id
        )
        quarantine_sg_id = sg_response['GroupId']

        # Revoke the default egress rule to block all outbound traffic
        ec2.revoke_security_group_egress(
            GroupId=quarantine_sg_id,
            IpPermissions=[{
                'IpProtocol': '-1',
                'IpRanges': [{'CidrIp': '0.0.0.0/0'}]
            }]
        )
    except ec2.exceptions.ClientError:
        # Use existing quarantine SG if creation fails
        quarantine_sg_id = get_existing_quarantine_sg(ec2, vpc_id)

    # Replace all security groups with the quarantine group
    instance = ec2.describe_instances(InstanceIds=[instance_id])
    network_interfaces = instance['Reservations'][0]['Instances'][0]['NetworkInterfaces']

    for eni in network_interfaces:
        ec2.modify_network_interface_attribute(
            NetworkInterfaceId=eni['NetworkInterfaceId'],
            Groups=[quarantine_sg_id]
        )

    # Tag the instance for tracking
    ec2.create_tags(
        Resources=[instance_id],
        Tags=[
            {'Key': 'SecurityStatus', 'Value': 'Quarantined'},
            {'Key': 'IncidentDate', 'Value': datetime.utcnow().isoformat()},
            {'Key': 'OriginalSecurityGroups', 'Value': json.dumps(
                [sg['GroupId'] for eni in network_interfaces for sg in eni['Groups']]
            )}
        ]
    )

    return {
        'statusCode': 200,
        'body': f'Instance {instance_id} quarantined with SG {quarantine_sg_id}'
    }

Revoking Compromised IAM Credentials

When IAM credentials are compromised, you need to revoke all active sessions immediately, not just delete the access keys:

import boto3
from datetime import datetime

def revoke_iam_sessions(user_name):
    """Revoke all active sessions for a compromised IAM user."""
    iam = boto3.client('iam')

    # Deactivate all access keys
    keys = iam.list_access_keys(UserName=user_name)
    for key in keys['AccessKeyMetadata']:
        iam.update_access_key(
            UserName=user_name,
            AccessKeyId=key['AccessKeyId'],
            Status='Inactive'
        )

    # Attach a deny-all inline policy to immediately revoke sessions
    # This is faster than waiting for existing tokens to expire
    deny_policy = {
        "Version": "2012-10-17",
        "Statement": [{
            "Effect": "Deny",
            "Action": "*",
            "Resource": "*",
            "Condition": {
                "DateLessThan": {
                    "aws:TokenIssueTime": datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
                }
            }
        }]
    }

    iam.put_user_policy(
        UserName=user_name,
        PolicyName='RevokeOlderSessions',
        PolicyDocument=json.dumps(deny_policy)
    )

    # Delete any console login profile
    try:
        iam.delete_login_profile(UserName=user_name)
    except iam.exceptions.NoSuchEntityException:
        pass

    return f"All sessions revoked for user {user_name}"

Evidence Preservation

Before you remediate anything, preserve evidence. You cannot investigate an incident after you have destroyed the artifacts.

Creating Forensic Snapshots

#!/bin/bash
# Forensic evidence collection script
INSTANCE_ID=$1
INCIDENT_ID=$2
FORENSICS_BUCKET="forensics-evidence-123456789012"

# Capture EBS snapshots of all attached volumes
VOLUME_IDS=$(aws ec2 describe-volumes \
  --filters "Name=attachment.instance-id,Values=${INSTANCE_ID}" \
  --query 'Volumes[].VolumeId' --output text)

for VOL_ID in $VOLUME_IDS; do
  SNAPSHOT_ID=$(aws ec2 create-snapshot \
    --volume-id "$VOL_ID" \
    --description "Forensic snapshot - Incident ${INCIDENT_ID}" \
    --tag-specifications "ResourceType=snapshot,Tags=[{Key=IncidentId,Value=${INCIDENT_ID}},{Key=Purpose,Value=Forensics},{Key=DoNotDelete,Value=true}]" \
    --query 'SnapshotId' --output text)
  echo "Created snapshot ${SNAPSHOT_ID} for volume ${VOL_ID}"
done

# Export instance console output
aws ec2 get-console-output \
  --instance-id "$INSTANCE_ID" \
  --output json > "/tmp/${INCIDENT_ID}-console-output.json"

aws s3 cp "/tmp/${INCIDENT_ID}-console-output.json" \
  "s3://${FORENSICS_BUCKET}/incidents/${INCIDENT_ID}/console-output.json"

# Capture instance metadata
aws ec2 describe-instances \
  --instance-ids "$INSTANCE_ID" \
  --output json > "/tmp/${INCIDENT_ID}-instance-metadata.json"

aws s3 cp "/tmp/${INCIDENT_ID}-instance-metadata.json" \
  "s3://${FORENSICS_BUCKET}/incidents/${INCIDENT_ID}/instance-metadata.json"

echo "Evidence collection complete for incident ${INCIDENT_ID}"

Store forensic evidence in a dedicated S3 bucket with object lock enabled, versioning turned on, and a bucket policy that prevents deletion by anyone except the security team. This ensures chain-of-custody integrity.

CloudTrail Analysis During Incidents

CloudTrail is the single most important service during a security investigation. Every API call in your AWS account is recorded, providing the raw material to reconstruct an attacker's exact path.

Querying CloudTrail Lake

CloudTrail Lake lets you run SQL queries against your event data without setting up Athena tables:

# Find all actions performed by a compromised access key
aws cloudtrail start-query \
  --query-statement "
    SELECT eventTime, eventName, sourceIPAddress,
           userAgent, requestParameters, errorCode
    FROM my-event-data-store
    WHERE userIdentity.accessKeyId = 'AKIAIOSFODNN7EXAMPLE'
      AND eventTime > '2025-12-01 00:00:00'
    ORDER BY eventTime ASC
  "

# Identify lateral movement - look for AssumeRole calls
aws cloudtrail start-query \
  --query-statement "
    SELECT eventTime, requestParameters, responseElements,
           sourceIPAddress, userIdentity.arn
    FROM my-event-data-store
    WHERE eventName = 'AssumeRole'
      AND sourceIPAddress = '203.0.113.42'
      AND eventTime BETWEEN '2025-12-04 00:00:00' AND '2025-12-05 23:59:59'
    ORDER BY eventTime ASC
  "

Look for these key indicators during analysis: AssumeRole calls to roles the user does not normally assume, CreateAccessKey or CreateLoginProfile events indicating persistence, PutBucketPolicy or PutRolePolicy changes that widen access, and API calls from IP addresses or user agents that differ from the user's normal pattern.

Runbook Automation with Systems Manager

AWS Systems Manager Automation documents codify your response procedures so they execute consistently every time. Define runbooks for your most common incident types and test them regularly.

Build runbooks for credential compromise, data exfiltration detection, cryptomining containment, and unauthorized resource creation. Each runbook should include automated steps for containment, evidence collection, notification, and initial remediation, with human approval gates before destructive actions like instance termination.

Post-Incident Activity

The post-incident review is where lasting security improvements happen. Document the timeline, root cause, blast radius, and remediation actions. Critically, update your IAM policies based on what the incident revealed. If an attacker exploited an overpermissive role, that is a signal that your permission model needs tightening.

Securing Incident Response with AccessLens

Most AWS security incidents begin with compromised or overpermissive IAM credentials. An attacker who obtains a set of access keys immediately inherits every permission attached to that identity, and in many organizations, those permissions far exceed what the identity actually needs.

AccessLens strengthens your incident response posture by:

  • Mapping permission blast radius so you can instantly understand what a compromised identity can access across accounts
  • Identifying overpermissive roles before they become the entry point for an attacker
  • Visualizing cross-account trust relationships to reveal lateral movement paths an attacker could exploit
  • Continuous risk scoring that highlights the IAM configurations most likely to lead to incidents
  • Accelerating forensic investigation by providing a pre-built map of who has access to what

The best incident response is the incident that never happens. AccessLens helps you eliminate the IAM misconfigurations that attackers exploit in the first place.

Strengthen your incident response with AccessLens and gain the IAM visibility needed to contain threats before they escalate.

Ready to secure your AWS environment?

Get comprehensive IAM visibility across all your AWS accounts in minutes.