← Back to Blog

Amazon Macie Data Protection: Automated Sensitive Data Discovery and Classification

5 min read

Amazon Macie Data Protection: Automated Sensitive Data Discovery and Classification

Amazon Macie uses machine learning and pattern matching to discover and protect sensitive data stored in Amazon S3. As organizations accumulate terabytes of unstructured data across hundreds of buckets, manual classification becomes impossible. Macie automates the process, identifying PII, financial data, credentials, and custom data patterns at scale. This guide covers practical deployment strategies, custom identifiers, Security Hub integration, and cost optimization techniques.

S3 Bucket Inventory and Security Posture

Automated Bucket Assessment

When you enable Macie, it immediately inventories all S3 buckets in the account and evaluates their security configuration. This inventory surfaces buckets with public access, missing encryption, or overly permissive policies before you even run a discovery job.

import boto3
import json
from datetime import datetime

class MacieSecurityPosture:
    def __init__(self):
        self.macie = boto3.client('macie2')
        self.s3 = boto3.client('s3')

    def assess_bucket_security(self):
        """Retrieve Macie bucket inventory and flag risky configurations."""
        paginator = self.macie.get_paginator('describe_buckets')
        risky_buckets = []

        for page in paginator.paginate():
            for bucket in page['buckets']:
                risk_factors = []

                if bucket.get('publicAccess', {}).get('effectivePermission') == 'PUBLIC':
                    risk_factors.append('PUBLIC_ACCESS')

                encryption = bucket.get('serverSideEncryption', {})
                if encryption.get('type') == 'NONE':
                    risk_factors.append('NO_ENCRYPTION')

                if not bucket.get('versioning', False):
                    risk_factors.append('NO_VERSIONING')

                if bucket.get('sharedAccess') in ['EXTERNAL', 'UNKNOWN']:
                    risk_factors.append('EXTERNAL_SHARED_ACCESS')

                if risk_factors:
                    risky_buckets.append({
                        'bucket_name': bucket['bucketName'],
                        'region': bucket['region'],
                        'risk_factors': risk_factors,
                        'object_count': bucket.get('objectCount', 0),
                        'size_bytes': bucket.get('sizeInBytes', 0)
                    })

        return sorted(risky_buckets, key=lambda b: len(b['risk_factors']), reverse=True)

    def get_unclassified_summary(self):
        """Identify buckets with large volumes of unclassified objects."""
        stats = self.macie.get_bucket_statistics()
        return {
            'total_buckets': stats['bucketsCount'],
            'classifiable_objects': stats['objectCount'],
            'classifiable_size_gb': round(stats['sizeInBytes'] / (1024**3), 2),
            'unclassifiable_objects': stats.get('unclassifiableObjectCount', {})
        }

This initial inventory is free and runs continuously. Use it as the foundation for prioritizing which buckets need active discovery jobs.

Custom Data Identifiers

Building Organization-Specific Classifiers

Macie's built-in managed data identifiers cover common PII and financial patterns, but most organizations have proprietary data formats that require custom identifiers. Custom identifiers use regex patterns with optional keyword proximity requirements to reduce false positives.

{
  "CustomDataIdentifiers": [
    {
      "name": "InternalProjectCode",
      "description": "Matches internal project codes like PROJ-2026-XXXX",
      "regex": "PROJ-\\d{4}-[A-Z0-9]{4,8}",
      "keywords": ["project", "initiative", "program"],
      "maximumMatchDistance": 50
    },
    {
      "name": "InternalCustomerAccountId",
      "description": "Matches customer account identifiers in CRM exports",
      "regex": "CUST-[A-Z]{2}\\d{6,10}",
      "keywords": ["customer", "account", "client"],
      "maximumMatchDistance": 30
    },
    {
      "name": "AWSAccessKeyInConfig",
      "description": "Detects AWS access keys near config-like context",
      "regex": "AKIA[0-9A-Z]{16}",
      "keywords": ["aws_access_key", "AccessKeyId", "credential"],
      "maximumMatchDistance": 100
    }
  ]
}

The maximumMatchDistance parameter is key to reducing noise. It limits how far (in characters) a keyword must appear from the regex match. Setting this too high produces false positives; setting it too low misses legitimate matches. Start with 50 characters and adjust based on findings.

Sensitive Data Discovery Jobs

Cost-Effective Job Configuration

Macie charges per gigabyte of data scanned, so running full scans across every bucket monthly can become expensive. Use sampling, scoping, and scheduling to keep costs predictable.

def create_optimized_discovery_job(self, bucket_names, sample_percentage=20):
    """Create a Macie job scoped to high-risk buckets with sampling."""
    bucket_definitions = []
    for name in bucket_names:
        bucket_definitions.append({
            'bucketDefinitions': [{
                'accountId': boto3.client('sts').get_caller_identity()['Account'],
                'buckets': [name]
            }]
        })

    job_response = self.macie.create_classification_job(
        name=f"sensitive-data-scan-{datetime.now().strftime('%Y%m%d')}",
        jobType='SCHEDULED',
        scheduleFrequencyUpdate={
            'monthlySchedule': {'dayOfMonth': 1}
        },
        s3JobDefinition={
            'bucketDefinitions': [{
                'accountId': boto3.client('sts').get_caller_identity()['Account'],
                'buckets': bucket_names
            }],
            'scoping': {
                'includes': {
                    'and': [
                        {
                            'simpleScopeTerm': {
                                'comparator': 'STARTS_WITH',
                                'key': 'OBJECT_KEY',
                                'values': ['exports/', 'uploads/', 'data/']
                            }
                        }
                    ]
                }
            }
        },
        samplingPercentage=sample_percentage,
        customDataIdentifierIds=self.get_custom_identifier_ids(),
        tags={'CostCenter': 'Security', 'Schedule': 'Monthly'}
    )

    return job_response['jobId']

A 20% sampling rate across targeted prefixes typically catches 90% or more of sensitive data patterns at a fraction of the cost of full scans. Run full scans quarterly and sampled scans monthly for an effective balance.

Security Hub Integration and Automated Remediation

EventBridge Rules for Critical Findings

Macie publishes findings to both Security Hub and EventBridge. Use EventBridge rules to trigger automated remediation when Macie discovers exposed sensitive data.

#!/bin/bash
# Create EventBridge rule for high-severity Macie findings

aws events put-rule \
  --name "macie-critical-findings" \
  --event-pattern '{
    "source": ["aws.macie"],
    "detail-type": ["Macie Finding"],
    "detail": {
      "severity": {
        "description": ["High", "Critical"]
      },
      "type": [{
        "prefix": "SensitiveData"
      }]
    }
  }' \
  --state ENABLED \
  --description "Route critical Macie findings to remediation"

# Create target to invoke remediation Lambda
aws events put-targets \
  --rule "macie-critical-findings" \
  --targets '[{
    "Id": "macie-remediation-lambda",
    "Arn": "arn:aws:lambda:us-east-1:123456789012:function:macie-auto-remediate",
    "InputTransformer": {
      "InputPathsMap": {
        "bucket": "$.detail.resourcesAffected.s3Bucket.name",
        "finding_type": "$.detail.type",
        "severity": "$.detail.severity.description"
      },
      "InputTemplate": "{\"bucket\": <bucket>, \"finding_type\": <finding_type>, \"severity\": <severity>}"
    }
  }]'

echo "EventBridge rule configured for Macie critical findings"

For the most critical findings, such as credentials or PII in public buckets, the remediation Lambda should immediately block public access and notify the security team. Less critical findings can be routed to a ticketing system for human review.

Cost Monitoring

Track Macie spending with CloudWatch metrics. Macie publishes usage estimates that let you forecast costs before jobs complete. Set billing alarms at 80% of your monthly budget to avoid surprises. Review the Macie usage page monthly to identify buckets where scanning costs are disproportionate to the data's sensitivity, and exclude low-risk buckets from future jobs.

Securing Data Classification with AccessLens

Macie reveals what sensitive data exists in your S3 buckets, but understanding who can access that data requires deep IAM analysis. A bucket might contain PII that Macie correctly flags, but the real risk depends on which IAM principals have s3:GetObject permissions, whether cross-account roles can reach the bucket, and whether overly broad policies grant unintended access.

AccessLens complements Macie by providing:

  • IAM policy analysis that maps which principals can access buckets containing sensitive data flagged by Macie
  • Cross-account access visualization showing trust relationships that could expose classified data to external accounts
  • Permission drift detection that alerts when IAM changes expand access to sensitive S3 resources
  • Risk scoring that combines Macie data classification with IAM exposure analysis for a complete picture of data risk
  • Continuous monitoring that tracks access patterns to sensitive data buckets over time

Knowing that a bucket contains sensitive data is only half the equation. Understanding who can reach it completes the picture.

Get complete data access visibility with AccessLens and pair your Macie data classification with the IAM analysis needed to truly protect sensitive data in AWS.

Ready to secure your AWS environment?

Get comprehensive IAM visibility across all your AWS accounts in minutes.