Building an AWS Incident Response Playbook: From Detection to Recovery
When a security incident hits your AWS environment, the difference between a contained breach and a catastrophic one is preparation. A well-rehearsed incident response playbook turns panic into procedure. This guide walks through building a comprehensive AWS incident response capability grounded in the NIST framework.
The NIST Framework Applied to AWS
The NIST Computer Security Incident Handling Guide (SP 800-61) defines four phases: Preparation, Detection and Analysis, Containment Eradication and Recovery, and Post-Incident Activity. Each phase maps directly to AWS services and capabilities.
Preparation means having the right tooling deployed before an incident occurs. Detection relies on services like GuardDuty, Security Hub, and CloudTrail. Containment leverages Lambda automation and IAM controls. Recovery uses infrastructure-as-code to rebuild clean environments. Post-incident analysis uses CloudTrail Lake for forensic querying.
Automated Containment with Lambda
Manual containment is too slow. By the time a human reviews an alert and logs into the console, an attacker with compromised credentials can pivot across accounts. Automated containment functions should trigger within seconds of detection.
Isolating a Compromised EC2 Instance
import boto3
import json
from datetime import datetime
def isolate_instance(event, context):
"""Automatically isolate a compromised EC2 instance."""
ec2 = boto3.client('ec2')
finding = event['detail']
instance_id = finding['resource']['instanceDetails']['instanceId']
vpc_id = finding['resource']['instanceDetails']['networkInterfaces'][0]['vpcId']
# Create a quarantine security group that blocks all traffic
try:
sg_response = ec2.create_security_group(
GroupName=f'quarantine-{instance_id}-{datetime.utcnow().strftime("%Y%m%d%H%M")}',
Description=f'Quarantine SG for compromised instance {instance_id}',
VpcId=vpc_id
)
quarantine_sg_id = sg_response['GroupId']
# Revoke the default egress rule to block all outbound traffic
ec2.revoke_security_group_egress(
GroupId=quarantine_sg_id,
IpPermissions=[{
'IpProtocol': '-1',
'IpRanges': [{'CidrIp': '0.0.0.0/0'}]
}]
)
except ec2.exceptions.ClientError:
# Use existing quarantine SG if creation fails
quarantine_sg_id = get_existing_quarantine_sg(ec2, vpc_id)
# Replace all security groups with the quarantine group
instance = ec2.describe_instances(InstanceIds=[instance_id])
network_interfaces = instance['Reservations'][0]['Instances'][0]['NetworkInterfaces']
for eni in network_interfaces:
ec2.modify_network_interface_attribute(
NetworkInterfaceId=eni['NetworkInterfaceId'],
Groups=[quarantine_sg_id]
)
# Tag the instance for tracking
ec2.create_tags(
Resources=[instance_id],
Tags=[
{'Key': 'SecurityStatus', 'Value': 'Quarantined'},
{'Key': 'IncidentDate', 'Value': datetime.utcnow().isoformat()},
{'Key': 'OriginalSecurityGroups', 'Value': json.dumps(
[sg['GroupId'] for eni in network_interfaces for sg in eni['Groups']]
)}
]
)
return {
'statusCode': 200,
'body': f'Instance {instance_id} quarantined with SG {quarantine_sg_id}'
}
Revoking Compromised IAM Credentials
When IAM credentials are compromised, you need to revoke all active sessions immediately, not just delete the access keys:
import boto3
from datetime import datetime
def revoke_iam_sessions(user_name):
"""Revoke all active sessions for a compromised IAM user."""
iam = boto3.client('iam')
# Deactivate all access keys
keys = iam.list_access_keys(UserName=user_name)
for key in keys['AccessKeyMetadata']:
iam.update_access_key(
UserName=user_name,
AccessKeyId=key['AccessKeyId'],
Status='Inactive'
)
# Attach a deny-all inline policy to immediately revoke sessions
# This is faster than waiting for existing tokens to expire
deny_policy = {
"Version": "2012-10-17",
"Statement": [{
"Effect": "Deny",
"Action": "*",
"Resource": "*",
"Condition": {
"DateLessThan": {
"aws:TokenIssueTime": datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
}
}
}]
}
iam.put_user_policy(
UserName=user_name,
PolicyName='RevokeOlderSessions',
PolicyDocument=json.dumps(deny_policy)
)
# Delete any console login profile
try:
iam.delete_login_profile(UserName=user_name)
except iam.exceptions.NoSuchEntityException:
pass
return f"All sessions revoked for user {user_name}"
Evidence Preservation
Before you remediate anything, preserve evidence. You cannot investigate an incident after you have destroyed the artifacts.
Creating Forensic Snapshots
#!/bin/bash
# Forensic evidence collection script
INSTANCE_ID=$1
INCIDENT_ID=$2
FORENSICS_BUCKET="forensics-evidence-123456789012"
# Capture EBS snapshots of all attached volumes
VOLUME_IDS=$(aws ec2 describe-volumes \
--filters "Name=attachment.instance-id,Values=${INSTANCE_ID}" \
--query 'Volumes[].VolumeId' --output text)
for VOL_ID in $VOLUME_IDS; do
SNAPSHOT_ID=$(aws ec2 create-snapshot \
--volume-id "$VOL_ID" \
--description "Forensic snapshot - Incident ${INCIDENT_ID}" \
--tag-specifications "ResourceType=snapshot,Tags=[{Key=IncidentId,Value=${INCIDENT_ID}},{Key=Purpose,Value=Forensics},{Key=DoNotDelete,Value=true}]" \
--query 'SnapshotId' --output text)
echo "Created snapshot ${SNAPSHOT_ID} for volume ${VOL_ID}"
done
# Export instance console output
aws ec2 get-console-output \
--instance-id "$INSTANCE_ID" \
--output json > "/tmp/${INCIDENT_ID}-console-output.json"
aws s3 cp "/tmp/${INCIDENT_ID}-console-output.json" \
"s3://${FORENSICS_BUCKET}/incidents/${INCIDENT_ID}/console-output.json"
# Capture instance metadata
aws ec2 describe-instances \
--instance-ids "$INSTANCE_ID" \
--output json > "/tmp/${INCIDENT_ID}-instance-metadata.json"
aws s3 cp "/tmp/${INCIDENT_ID}-instance-metadata.json" \
"s3://${FORENSICS_BUCKET}/incidents/${INCIDENT_ID}/instance-metadata.json"
echo "Evidence collection complete for incident ${INCIDENT_ID}"
Store forensic evidence in a dedicated S3 bucket with object lock enabled, versioning turned on, and a bucket policy that prevents deletion by anyone except the security team. This ensures chain-of-custody integrity.
CloudTrail Analysis During Incidents
CloudTrail is the single most important service during a security investigation. Every API call in your AWS account is recorded, providing the raw material to reconstruct an attacker's exact path.
Querying CloudTrail Lake
CloudTrail Lake lets you run SQL queries against your event data without setting up Athena tables:
# Find all actions performed by a compromised access key
aws cloudtrail start-query \
--query-statement "
SELECT eventTime, eventName, sourceIPAddress,
userAgent, requestParameters, errorCode
FROM my-event-data-store
WHERE userIdentity.accessKeyId = 'AKIAIOSFODNN7EXAMPLE'
AND eventTime > '2025-12-01 00:00:00'
ORDER BY eventTime ASC
"
# Identify lateral movement - look for AssumeRole calls
aws cloudtrail start-query \
--query-statement "
SELECT eventTime, requestParameters, responseElements,
sourceIPAddress, userIdentity.arn
FROM my-event-data-store
WHERE eventName = 'AssumeRole'
AND sourceIPAddress = '203.0.113.42'
AND eventTime BETWEEN '2025-12-04 00:00:00' AND '2025-12-05 23:59:59'
ORDER BY eventTime ASC
"
Look for these key indicators during analysis: AssumeRole calls to roles the user does not normally assume, CreateAccessKey or CreateLoginProfile events indicating persistence, PutBucketPolicy or PutRolePolicy changes that widen access, and API calls from IP addresses or user agents that differ from the user's normal pattern.
Runbook Automation with Systems Manager
AWS Systems Manager Automation documents codify your response procedures so they execute consistently every time. Define runbooks for your most common incident types and test them regularly.
Build runbooks for credential compromise, data exfiltration detection, cryptomining containment, and unauthorized resource creation. Each runbook should include automated steps for containment, evidence collection, notification, and initial remediation, with human approval gates before destructive actions like instance termination.
Post-Incident Activity
The post-incident review is where lasting security improvements happen. Document the timeline, root cause, blast radius, and remediation actions. Critically, update your IAM policies based on what the incident revealed. If an attacker exploited an overpermissive role, that is a signal that your permission model needs tightening.
Securing Incident Response with AccessLens
Most AWS security incidents begin with compromised or overpermissive IAM credentials. An attacker who obtains a set of access keys immediately inherits every permission attached to that identity, and in many organizations, those permissions far exceed what the identity actually needs.
AccessLens strengthens your incident response posture by:
- Mapping permission blast radius so you can instantly understand what a compromised identity can access across accounts
- Identifying overpermissive roles before they become the entry point for an attacker
- Visualizing cross-account trust relationships to reveal lateral movement paths an attacker could exploit
- Continuous risk scoring that highlights the IAM configurations most likely to lead to incidents
- Accelerating forensic investigation by providing a pre-built map of who has access to what
The best incident response is the incident that never happens. AccessLens helps you eliminate the IAM misconfigurations that attackers exploit in the first place.
Strengthen your incident response with AccessLens and gain the IAM visibility needed to contain threats before they escalate.