Cloud MDRAWS SecurityDetection EngineeringThreat DetectionManaged Detection and Response

Business baseline based alerting with Transilience MDR skills

Venkat PothamsettyJune 29, 202616 min read
Business baseline based alerting with Transilience MDR skills

Baseline-driven MDR is easiest to understand through alerts. This is a companion post to the announcement of Transilience MDR Rainer, focused on what the baseline framework produces in practice.

The alerting framework is the same for every customer:

  • Build a business-as-usual model from normalized CloudTrail observations.
  • Compare new events against account, actor, event, source, region, and business-family tuples.
  • Keep critical override detections alertable even if similar activity appeared before.
  • Explain why the event differs from baseline.
  • Require the right evidence for triage.

The interesting part is that the alerts do not look the same across customers. They should not. Two companies can both run AWS, both use SSO, both rely on KMS, and both generate CloudTrail events, but their operating models can be completely different.

This post uses two anonymized customer examples with dummy names:

  • Customer Atlas: a large, automation-heavy cloud estate.
  • Customer Beacon: a smaller, service-heavy environment dominated by inventory, KMS, SSM, monitoring, and data-platform activity.

The examples below are representative of the real baseline patterns. Customer names, account IDs, user names, domains, exact IPs, and customer-specific role names have been replaced.

Customer Atlas: Automation-Heavy Cloud Operations

Customer Atlas had a broad AWS footprint: 42,258 CloudTrail observations across 78 accounts in the baseline window. The normal operating model was dominated by deployment and identity activity.

Baseline familyShareInterpretation
CI/CD and deployment automation46.1%Infrastructure-as-code and service automation routinely created roles, attached policies, launched instances, and terminated instances.
Human, SSO, and console access17.6%Console and SSO usage existed, but it was not the main source of control-plane activity.
Identity and privilege management15.2%IAM writes were common, mostly through automation and service roles.
Network and perimeter changes6.7%Security group changes were frequent and often tied to known deployment controllers.
Storage, data, and KMS6.4%KMS grants and storage control-plane changes were routine but sensitive.
Security visibility2.7%CloudTrail and Config changes were rare enough to keep as strong overrides.
Database operations0.6%Database changes were low-volume and high-signal.

For Atlas, the framework had to avoid noisy rules like "alert on every CreateRole." Role creation was normal. The better question was: did this role creation happen through the right automation, in the right account, from the right source, in the right region?

Alert 1: Static Access Key Creation by a Human-Looking Actor

  • Customer: Atlas
  • Alert: CreateAccessKey outside designated rotation
  • Severity: Critical
  • Business risk: credential_or_privilege_risk
  • Account: shared-tools-prod
  • Actor: named-admin-session
  • Source: new unmanaged public IP
  • Region: us-east-1
  • Event: CreateAccessKey
  • Count: 1

Why it differed from baseline

Atlas had plenty of IAM activity, but that activity was normally produced by deployment automation, SSO service roles, or account-management workflows. Static access-key creation by a human-looking actor was not part of the normal tuple for this account and source.

The alert fired for two reasons:

  • CreateAccessKey is a critical override. It remains alertable even if access-key creation appeared somewhere in history.
  • The actor/source/event tuple was new for the account.

Why the framework worked

A naive "IAM activity is common here" suppression would have hidden this event. A naive "all IAM changes are critical" rule would have buried the team in deployment noise. The baseline framework split the difference: deployment IAM writes could be treated as normal only when the exact tuple matched; static credential creation stayed critical.

Evidence required

  • Raw CloudTrail event.
  • Target IAM user.
  • Caller identity and session issuer.
  • Source ownership.
  • Rotation job, ticket, or emergency approval.

Alert 2: Deployment Role Creates IAM Resources in a New Account

  • Customer: Atlas
  • Alert: Deployment IAM write in a new account tuple
  • Severity: High
  • Business risk: credential_or_privilege_risk
  • Account: newly-onboarded-app-prod
  • Actor: stack-management-exec-role
  • Source: cloudformation.amazonaws.com
  • Region: us-east-1
  • Events: CreateRole, AttachRolePolicy, PutRolePolicy
  • Count: multiple events in one deployment burst

Why it differed from baseline

This is the sort of alert that makes baseline-driven detection useful. CreateRole, AttachRolePolicy, and PutRolePolicy were all normal in Atlas. The dominant baseline family was CI/CD and deployment automation. But this exact account/actor/event/source tuple was new.

That does not automatically mean compromise. It could be a legitimate onboarding of a new account into StackSets or infrastructure-as-code. But it is exactly the kind of change MDR should see.

Why the framework worked

The alert did not say "CloudFormation is bad." It said:

  • this CloudFormation-style actor normally performs these events elsewhere,
  • the events are privilege-bearing,
  • the account tuple is new,
  • and the baseline has not yet approved this path.

That gives the analyst a focused question: was this account intentionally onboarded into the deployment framework?

Evidence required

  • Change request for account onboarding.
  • Stack or StackSet operation metadata.
  • Policies attached.
  • Target roles created.
  • Confirmation from platform engineering.

Alert 3: Route Table Change by an SSO Administrator

  • Customer: Atlas
  • Alert: Network exposure change outside normal automation
  • Severity: High
  • Business risk: internet_exposure
  • Account: content-platform-prod
  • Actor: sso-admin-role
  • Source: corporate-egress-new
  • Region: ap-southeast-1
  • Event: CreateRoute
  • Count: 1

Why it differed from baseline

Atlas had frequent network changes, but the normal network-change pattern was tied to known automation: container load-balancer controllers, deployment roles, or service-managed infrastructure. A route change from an SSO administrator was different.

The event was alertable because:

  • route-table writes can alter exposure,
  • the actor was outside the normal network automation class,
  • the source was new for the actor/event tuple,
  • and the event carried internet-exposure risk.

Why the framework worked

The same baseline that suppressed routine controller-driven security group churn still caught a human-driven route change. That is the desired behavior. High-volume network operations do not become a blanket allowlist.

Evidence required

  • Route table diff.
  • Destination CIDR and target gateway/attachment.
  • Change ticket.
  • Actor owner confirmation.
  • Exposure assessment.

Alert 4: KMS Grant Creation by an Unknown or Non-Standard Actor

  • Customer: Atlas
  • Alert: KMS grant change by unusual actor
  • Severity: High
  • Business risk: credential_or_privilege_risk
  • Account: regional-workload-prod
  • Actor: unknown-or-nonstandard-actor
  • Source: ec2-frontend-api.amazonaws.com
  • Region: eu-west-2
  • Event: CreateGrant
  • Count: repeated over several days

Why it differed from baseline

KMS CreateGrant was not rare in Atlas. The baseline showed thousands of KMS grant operations. That is normal in encrypted cloud environments.

The anomaly was not the API name. The anomaly was the actor and source context. The event looked like deployment or service activity, but the actor did not match the normal deployment-role pattern for that account.

Why the framework worked

A static KMS rule would either fire constantly or be suppressed globally. The baseline rule fired only when the KMS actor/source tuple drifted from the approved model.

Evidence required

  • KMS key ARN.
  • Grantee principal.
  • Grant operations.
  • Calling service.
  • Workload owner attestation.

Alert 5: Destructive Database Control-Plane Action

  • Customer: Atlas
  • Alert: DeleteDBInstance outside expected database workflow
  • Severity: High to Critical
  • Business risk: data_loss_or_outage
  • Account: application-prod
  • Actor: database-service-or-admin-role
  • Source: rds.application-autoscaling.amazonaws.com or corporate-egress
  • Region: workload-region
  • Event: DeleteDBInstance
  • Count: 1

Why it differed from baseline

Database control-plane activity was low-volume in Atlas. That made destructive database events high-signal even when the actor was a service role.

The framework treated the event as alertable because:

  • database deletion can create outage or data-loss risk,
  • the exact actor/event/source tuple was rare or new,
  • and destructive database actions require evidence even when they are planned.

Why the framework worked

The event did not disappear under "service role" logic. Service roles are normal only inside the expected account, event, source, and workload context.

Evidence required

  • DB identifier.
  • Final snapshot setting.
  • Backup status.
  • Decommission ticket.
  • Workload owner approval.

Customer Beacon: Service-Heavy and Inventory-Heavy Operations

Customer Beacon had a smaller AWS footprint: 357,233 deduplicated CloudTrail events across 6 accounts in the broader baseline, with a 60,000-observation MDR run producing 1,683 candidate deviations.

Beacon's baseline looked nothing like Atlas.

Baseline categoryObservationsInterpretation
Compute and network control plane84,216Mostly EC2, VPC, load balancer, and inventory-style control-plane activity.
Identity and session63,667Heavy STS role assumption and service-to-service access.
Governance and monitoring52,505Trusted Advisor, Service Quotas, Config, and monitoring checks.
Systems management46,821SSM managed-instance inventory and status updates.
Encryption and KMS33,493KMS decrypt and data-key operations in the normal service path.
Logging and observability14,062CloudWatch Logs and application/service telemetry.
Database11,670Mostly database API and data-platform control-plane activity.
Data lake / ETL6,034Glue, Redshift Data API, and related service workflows.

Beacon did not show the same high-volume classic CI/CD signature as Atlas. Routine behavior was dominated by STS, KMS, SSM, monitoring, CloudWatch Logs, data-platform services, and low-volume SAML access.

That changed the alert interpretation. A new deployment service could be more interesting at Beacon than at Atlas. A read-heavy Direct Connect or Transit Gateway discovery pattern might be a medium-priority candidate, not a critical incident. A new KMS grant writer or destructive snapshot event still deserved escalation.

Alert 1: New Monitoring or Optimization Service Assumes a Role

  • Customer: Beacon
  • Alert: New service role assumption tuple
  • Severity: Medium
  • Business risk: operational_change_risk
  • Account: app-prod
  • Actor: AWS service principal
  • Source: compute-optimizer.amazonaws.com
  • Region: us-east-1
  • Event: AssumeRole
  • Count: burst of role assumptions

Why it differed from baseline

Beacon had a lot of AssumeRole activity. In the broader baseline, AssumeRole was the top event. The alert did not fire because role assumption was suspicious by itself.

It fired because the service/account/role tuple was new. Beacon's baseline had known service roles for data, monitoring, and operations. This service principal was not part of the approved model for that account.

Why the framework worked

In Atlas, a new deployment automation role might be expected during account onboarding. In Beacon, a new service integration was more significant because the baseline was stable and service-heavy. The same tuple-drift logic worked, but the triage question changed: "Was this AWS service intentionally enabled?"

Evidence required

  • Service enablement record.
  • Role trust policy.
  • Permissions granted.
  • Account owner confirmation.
  • First-seen timestamp.

Alert 2: Transit and Direct Connect Inventory From a New Source

  • Customer: Beacon
  • Alert: Network inventory drift from new collector source
  • Severity: Medium
  • Business risk: network_visibility_or_exposure_context
  • Account: network-shared-services
  • Actor: inventory-role
  • Source: new-private-collector-address
  • Region: us-east-1
  • Events: DescribeTransitGatewayAttachments, DescribeDirectConnectGatewayAttachments, DescribeVirtualInterfaces
  • Count: repeated read-only calls

Why it differed from baseline

Beacon had heavy network and infrastructure inventory. Read-only Describe* events were common. The drift was the source and tuple: the collector was querying network topology from a source not present in the approved baseline.

This is not the same as a route creation or security group write. It is a medium-priority candidate because the event class is read-only, but the network domain is sensitive.

Why the framework worked

The framework did not inflate read-only discovery into a critical exposure event. It still produced a useful alert because new network collectors can indicate:

  • a new monitoring tool,
  • a moved collector,
  • a compromised role performing reconnaissance,
  • or an untracked infrastructure scanner.

Evidence required

  • Collector ownership.
  • Source host or NAT mapping.
  • Change record for monitoring migration.
  • Role session details.
  • Confirmation that no network write APIs followed.

Alert 3: KMS CreateGrant by a Non-Standard Actor

  • Customer: Beacon
  • Alert: KMS grant creation outside normal service tuple
  • Severity: High
  • Business risk: data_access_storage
  • Account: application-prod
  • Actor: unknown-assumed-role
  • Source: rds.amazonaws.com
  • Region: us-east-1
  • Event: CreateGrant
  • Count: 2

Why it differed from baseline

Beacon had heavy KMS usage. Decrypt and GenerateDataKey were normal because encrypted storage, applications, backup, and service credentials depended on KMS.

But KMS grant creation is different from KMS decrypt. CreateGrant changes who can use a key. In this case, the event came from a tuple that was not normal for the account.

Why the framework worked

The framework understood that KMS volume alone should not suppress KMS administration. Read/use operations and grant/key changes are different risk classes.

Evidence required

  • KMS key ARN.
  • Grantee principal.
  • Grant constraints and allowed operations.
  • RDS or workload context.
  • Baseline comparison for the actor/source tuple.

Alert 4: CloudWatch Log Stream Creation From an Application Service

  • Customer: Beacon
  • Alert: New logging tuple for application service
  • Severity: Medium
  • Business risk: observability_or_operational_change
  • Account: app-prod
  • Actor: application-service-role
  • Source: elasticbeanstalk.amazonaws.com
  • Region: us-east-1
  • Event: CreateLogStream
  • Count: high-volume burst

Why it differed from baseline

Beacon had a logging-heavy baseline. CreateLogStream was not inherently suspicious. But the burst came from a service/account/source tuple that was rare in the approved model.

This is a good example of a candidate deviation rather than an incident. It might be a new application deployment, a new environment, or a restarted service. It could also indicate an unexpected workload creating telemetry in production.

Why the framework worked

The alert was medium severity, not critical. It created a review path without pretending that log stream creation was malicious. The framework captured operational drift with enough context for the analyst to decide whether to promote it into the baseline.

Evidence required

  • Application deployment record.
  • Environment name.
  • Log group and stream names.
  • Service owner confirmation.
  • Whether a new workload was created.

Alert 5: Backup or Snapshot Deletion in a Service-Heavy Account

  • Customer: Beacon
  • Alert: Backup or database snapshot deletion tuple drift
  • Severity: High
  • Business risk: data_loss_or_outage
  • Account: app-prod
  • Actor: backup-or-service-role
  • Source: backup.amazonaws.com or private-workload-source
  • Region: us-east-1
  • Events: BackupDeleted, DeleteDBSnapshot, DeleteSnapshot
  • Count: low-volume destructive actions

Why it differed from baseline

Beacon had normal backup and database activity, but destructive backup/snapshot events are not treated as ordinary inventory. The baseline captured that the event was rare and tied it to a high business-risk category.

Why the framework worked

The framework did not let service-heavy operations dilute destructive action. Even if backup automation is normal, deletion of recovery points or snapshots requires confirmation.

Evidence required

  • Backup vault or snapshot identifier.
  • Retention policy.
  • Lifecycle policy.
  • Decommission or retention exception ticket.
  • Restore-point coverage after deletion.

Alert 6: SAML Login or Data-Platform Query From a New Source

  • Customer: Beacon
  • Alert: Data-platform access tuple drift
  • Severity: Medium
  • Business risk: operational_change_risk
  • Account: data-platform-prod
  • Actor: federated-data-operator
  • Source: new-workstation-or-egress
  • Region: us-east-1
  • Events: AssumeRoleWithSAML, ExecuteStatement, GetStatementResult
  • Count: low-volume interactive session

Why it differed from baseline

Beacon had low-volume human access compared with service traffic. SAML-backed access was normal only for known users, accounts, and source patterns. Data-platform APIs were also normal, but usually through known service sessions and sources.

This alert joined those facts: a human or interactive session touched the data platform from a source not present in the baseline.

Why the framework worked

The framework did not need a malware signature or impossible-travel model. The source and actor tuple was enough to make the session reviewable.

Evidence required

  • IdP authentication record.
  • MFA status.
  • Source ownership.
  • SQL or statement metadata if available.
  • Data-platform owner confirmation.

Same Framework, Different Customer Meaning

The most important lesson is that the same event name can mean different things for Atlas and Beacon.

EventAtlas interpretationBeacon interpretation
CreateRoleCommon under deployment automation; alert when the account/actor/source tuple is new or human-driven.More unusual because classic CI/CD was not dominant; new deployment service or role deserves closer review.
AttachRolePolicyFrequent but sensitive; normal only for approved automation tuples.Less central to the baseline; likely higher signal if observed from a new actor.
CreateGrantCommon due to KMS-heavy workloads; alert on actor/source drift.KMS use is common, but grant creation still stands out from decrypt/data-key usage.
CreateRouteHigh-signal when performed by SSO or non-network automation.Would be high-signal too, but most network candidates were read-only inventory drift.
CreateLogStreamUsually lower interest unless tied to new deployment or suspicious workload.Meaningful operational drift because logging and service telemetry are core to the baseline.
AssumeRoleExpected in deployment and cross-account operations; tuple matters.Extremely common, so only new service/account/source combinations are interesting.
DeleteDBInstance / snapshot deletionRare and high-risk; require change and backup evidence.Rare and high-risk despite service-heavy operations; require retention and recovery evidence.

This is where baseline-driven MDR earns its keep. It does not ask analysts to memorize every customer's cloud architecture. It turns each customer's observed operating model into detection context.

What the Alerting Framework Got Right

For Atlas, the framework handled automation without going blind:

  • It recognized that IAM writes were common.
  • It allowed known deployment tuples to be treated differently from human or unknown actors.
  • It still elevated CreateAccessKey, database deletion, route changes, and bucket policy changes.
  • It generated reviewable alerts when deployment automation appeared in a new account or region.

For Beacon, the framework handled service-heavy noise without flattening everything into "normal":

  • It treated high-volume STS, KMS, SSM, monitoring, and inventory as the background operating model.
  • It separated read-only network discovery from network writes.
  • It flagged new collector/source tuples without labeling them as confirmed incidents.
  • It escalated KMS grant changes, backup deletion, snapshot deletion, and unusual data-platform access.

The framework worked because it made alerting relative to customer behavior, while keeping critical outcomes absolute.

A Good Baseline Alert Should Say Three Things

Every baseline-driven alert should answer:

What happened?

The event, actor, account, source, region, resource, and business family.

Why is this different?

The specific baseline comparison: new actor/event pair, new source for a known actor, new account, new region, new service, or override class.

What evidence closes it?

The change ticket, owner attestation, raw CloudTrail event, resource diff, policy document, snapshot state, MFA record, or service-enablement record.

Without those three parts, the alert is just another anomaly notification. With them, it becomes an MDR investigation starting point.

Continue the conversation

Get Access to SecurityOS

Start private access for your security team and evaluate autonomous triage, compliance, and exposure workflows in one place.

Share this post:

Related Posts