Business baseline based alerting with Transilience MDR skills

Baseline-driven MDR is easiest to understand through alerts. This is a companion post to the announcement of Transilience MDR Rainer, focused on what the baseline framework produces in practice.

The alerting framework is the same for every customer:

Build a business-as-usual model from normalized CloudTrail observations.
Compare new events against account, actor, event, source, region, and business-family tuples.
Keep critical override detections alertable even if similar activity appeared before.
Explain why the event differs from baseline.
Require the right evidence for triage.

The interesting part is that the alerts do not look the same across customers. They should not. Two companies can both run AWS, both use SSO, both rely on KMS, and both generate CloudTrail events, but their operating models can be completely different.

This post uses two anonymized customer examples with dummy names:

Customer Atlas: a large, automation-heavy cloud estate.
Customer Beacon: a smaller, service-heavy environment dominated by inventory, KMS, SSM, monitoring, and data-platform activity.

The examples below are representative of the real baseline patterns. Customer names, account IDs, user names, domains, exact IPs, and customer-specific role names have been replaced.

Customer Atlas: Automation-Heavy Cloud Operations

Customer Atlas had a broad AWS footprint: 42,258 CloudTrail observations across 78 accounts in the baseline window. The normal operating model was dominated by deployment and identity activity.

Baseline family	Share	Interpretation
CI/CD and deployment automation	46.1%	Infrastructure-as-code and service automation routinely created roles, attached policies, launched instances, and terminated instances.
Human, SSO, and console access	17.6%	Console and SSO usage existed, but it was not the main source of control-plane activity.
Identity and privilege management	15.2%	IAM writes were common, mostly through automation and service roles.
Network and perimeter changes	6.7%	Security group changes were frequent and often tied to known deployment controllers.
Storage, data, and KMS	6.4%	KMS grants and storage control-plane changes were routine but sensitive.
Security visibility	2.7%	CloudTrail and Config changes were rare enough to keep as strong overrides.
Database operations	0.6%	Database changes were low-volume and high-signal.

For Atlas, the framework had to avoid noisy rules like "alert on every CreateRole." Role creation was normal. The better question was: did this role creation happen through the right automation, in the right account, from the right source, in the right region?

Alert 1: Static Access Key Creation by a Human-Looking Actor

Customer: Atlas
Alert: CreateAccessKey outside designated rotation
Severity: Critical
Business risk: credential_or_privilege_risk
Account: shared-tools-prod
Actor: named-admin-session
Source: new unmanaged public IP
Region: us-east-1
Event: CreateAccessKey
Count: 1

Why it differed from baseline

Atlas had plenty of IAM activity, but that activity was normally produced by deployment automation, SSO service roles, or account-management workflows. Static access-key creation by a human-looking actor was not part of the normal tuple for this account and source.

The alert fired for two reasons:

CreateAccessKey is a critical override. It remains alertable even if access-key creation appeared somewhere in history.
The actor/source/event tuple was new for the account.

Why the framework worked

A naive "IAM activity is common here" suppression would have hidden this event. A naive "all IAM changes are critical" rule would have buried the team in deployment noise. The baseline framework split the difference: deployment IAM writes could be treated as normal only when the exact tuple matched; static credential creation stayed critical.

Evidence required

Raw CloudTrail event.
Target IAM user.
Caller identity and session issuer.
Source ownership.
Rotation job, ticket, or emergency approval.

Alert 2: Deployment Role Creates IAM Resources in a New Account

Customer: Atlas
Alert: Deployment IAM write in a new account tuple
Severity: High
Business risk: credential_or_privilege_risk
Account: newly-onboarded-app-prod
Actor: stack-management-exec-role
Source: cloudformation.amazonaws.com
Region: us-east-1
Events: CreateRole, AttachRolePolicy, PutRolePolicy
Count: multiple events in one deployment burst

Why it differed from baseline

This is the sort of alert that makes baseline-driven detection useful. CreateRole, AttachRolePolicy, and PutRolePolicy were all normal in Atlas. The dominant baseline family was CI/CD and deployment automation. But this exact account/actor/event/source tuple was new.

That does not automatically mean compromise. It could be a legitimate onboarding of a new account into StackSets or infrastructure-as-code. But it is exactly the kind of change MDR should see.

Why the framework worked

The alert did not say "CloudFormation is bad." It said:

this CloudFormation-style actor normally performs these events elsewhere,
the events are privilege-bearing,
the account tuple is new,
and the baseline has not yet approved this path.

That gives the analyst a focused question: was this account intentionally onboarded into the deployment framework?

Evidence required

Change request for account onboarding.
Stack or StackSet operation metadata.
Policies attached.
Target roles created.
Confirmation from platform engineering.

Alert 3: Route Table Change by an SSO Administrator

Customer: Atlas
Alert: Network exposure change outside normal automation
Severity: High
Business risk: internet_exposure
Account: content-platform-prod
Actor: sso-admin-role
Source: corporate-egress-new
Region: ap-southeast-1
Event: CreateRoute
Count: 1

Why it differed from baseline

Atlas had frequent network changes, but the normal network-change pattern was tied to known automation: container load-balancer controllers, deployment roles, or service-managed infrastructure. A route change from an SSO administrator was different.

The event was alertable because:

route-table writes can alter exposure,
the actor was outside the normal network automation class,
the source was new for the actor/event tuple,
and the event carried internet-exposure risk.

Why the framework worked

The same baseline that suppressed routine controller-driven security group churn still caught a human-driven route change. That is the desired behavior. High-volume network operations do not become a blanket allowlist.

Evidence required

Route table diff.
Destination CIDR and target gateway/attachment.
Change ticket.
Actor owner confirmation.
Exposure assessment.

Alert 4: KMS Grant Creation by an Unknown or Non-Standard Actor

Customer: Atlas
Alert: KMS grant change by unusual actor
Severity: High
Business risk: credential_or_privilege_risk
Account: regional-workload-prod
Actor: unknown-or-nonstandard-actor
Source: ec2-frontend-api.amazonaws.com
Region: eu-west-2
Event: CreateGrant
Count: repeated over several days

Why it differed from baseline

KMS CreateGrant was not rare in Atlas. The baseline showed thousands of KMS grant operations. That is normal in encrypted cloud environments.

The anomaly was not the API name. The anomaly was the actor and source context. The event looked like deployment or service activity, but the actor did not match the normal deployment-role pattern for that account.

Why the framework worked

A static KMS rule would either fire constantly or be suppressed globally. The baseline rule fired only when the KMS actor/source tuple drifted from the approved model.

Evidence required

KMS key ARN.
Grantee principal.
Grant operations.
Calling service.
Workload owner attestation.

Alert 5: Destructive Database Control-Plane Action

Customer: Atlas
Alert: DeleteDBInstance outside expected database workflow
Severity: High to Critical
Business risk: data_loss_or_outage
Account: application-prod
Actor: database-service-or-admin-role
Source: rds.application-autoscaling.amazonaws.com or corporate-egress
Region: workload-region
Event: DeleteDBInstance
Count: 1

Why it differed from baseline

Database control-plane activity was low-volume in Atlas. That made destructive database events high-signal even when the actor was a service role.

The framework treated the event as alertable because:

database deletion can create outage or data-loss risk,
the exact actor/event/source tuple was rare or new,
and destructive database actions require evidence even when they are planned.

Why the framework worked

The event did not disappear under "service role" logic. Service roles are normal only inside the expected account, event, source, and workload context.

Evidence required

DB identifier.
Final snapshot setting.
Backup status.
Decommission ticket.
Workload owner approval.

Customer Beacon: Service-Heavy and Inventory-Heavy Operations

Customer Beacon had a smaller AWS footprint: 357,233 deduplicated CloudTrail events across 6 accounts in the broader baseline, with a 60,000-observation MDR run producing 1,683 candidate deviations.

Beacon's baseline looked nothing like Atlas.

Baseline category	Observations	Interpretation
Compute and network control plane	84,216	Mostly EC2, VPC, load balancer, and inventory-style control-plane activity.
Identity and session	63,667	Heavy STS role assumption and service-to-service access.
Governance and monitoring	52,505	Trusted Advisor, Service Quotas, Config, and monitoring checks.
Systems management	46,821	SSM managed-instance inventory and status updates.
Encryption and KMS	33,493	KMS decrypt and data-key operations in the normal service path.
Logging and observability	14,062	CloudWatch Logs and application/service telemetry.
Database	11,670	Mostly database API and data-platform control-plane activity.
Data lake / ETL	6,034	Glue, Redshift Data API, and related service workflows.

Beacon did not show the same high-volume classic CI/CD signature as Atlas. Routine behavior was dominated by STS, KMS, SSM, monitoring, CloudWatch Logs, data-platform services, and low-volume SAML access.

That changed the alert interpretation. A new deployment service could be more interesting at Beacon than at Atlas. A read-heavy Direct Connect or Transit Gateway discovery pattern might be a medium-priority candidate, not a critical incident. A new KMS grant writer or destructive snapshot event still deserved escalation.

Alert 1: New Monitoring or Optimization Service Assumes a Role

Customer: Beacon
Alert: New service role assumption tuple
Severity: Medium
Business risk: operational_change_risk
Account: app-prod
Actor: AWS service principal
Source: compute-optimizer.amazonaws.com
Region: us-east-1
Event: AssumeRole
Count: burst of role assumptions

Why it differed from baseline

Beacon had a lot of AssumeRole activity. In the broader baseline, AssumeRole was the top event. The alert did not fire because role assumption was suspicious by itself.

It fired because the service/account/role tuple was new. Beacon's baseline had known service roles for data, monitoring, and operations. This service principal was not part of the approved model for that account.

Why the framework worked

In Atlas, a new deployment automation role might be expected during account onboarding. In Beacon, a new service integration was more significant because the baseline was stable and service-heavy. The same tuple-drift logic worked, but the triage question changed: "Was this AWS service intentionally enabled?"

Evidence required

Service enablement record.
Role trust policy.
Permissions granted.
Account owner confirmation.
First-seen timestamp.

Alert 2: Transit and Direct Connect Inventory From a New Source

Customer: Beacon
Alert: Network inventory drift from new collector source
Severity: Medium
Business risk: network_visibility_or_exposure_context
Account: network-shared-services
Actor: inventory-role
Source: new-private-collector-address
Region: us-east-1
Events: DescribeTransitGatewayAttachments, DescribeDirectConnectGatewayAttachments, DescribeVirtualInterfaces
Count: repeated read-only calls

Why it differed from baseline

Beacon had heavy network and infrastructure inventory. Read-only Describe* events were common. The drift was the source and tuple: the collector was querying network topology from a source not present in the approved baseline.

This is not the same as a route creation or security group write. It is a medium-priority candidate because the event class is read-only, but the network domain is sensitive.

Why the framework worked

The framework did not inflate read-only discovery into a critical exposure event. It still produced a useful alert because new network collectors can indicate:

a new monitoring tool,
a moved collector,
a compromised role performing reconnaissance,
or an untracked infrastructure scanner.

Evidence required

Collector ownership.
Source host or NAT mapping.
Change record for monitoring migration.
Role session details.
Confirmation that no network write APIs followed.

Alert 3: KMS CreateGrant by a Non-Standard Actor

Customer: Beacon
Alert: KMS grant creation outside normal service tuple
Severity: High
Business risk: data_access_storage
Account: application-prod
Actor: unknown-assumed-role
Source: rds.amazonaws.com
Region: us-east-1
Event: CreateGrant
Count: 2

Why it differed from baseline

Beacon had heavy KMS usage. Decrypt and GenerateDataKey were normal because encrypted storage, applications, backup, and service credentials depended on KMS.

But KMS grant creation is different from KMS decrypt. CreateGrant changes who can use a key. In this case, the event came from a tuple that was not normal for the account.

Why the framework worked

The framework understood that KMS volume alone should not suppress KMS administration. Read/use operations and grant/key changes are different risk classes.

Evidence required

KMS key ARN.
Grantee principal.
Grant constraints and allowed operations.
RDS or workload context.
Baseline comparison for the actor/source tuple.

Alert 4: CloudWatch Log Stream Creation From an Application Service

Customer: Beacon
Alert: New logging tuple for application service
Severity: Medium
Business risk: observability_or_operational_change
Account: app-prod
Actor: application-service-role
Source: elasticbeanstalk.amazonaws.com
Region: us-east-1
Event: CreateLogStream
Count: high-volume burst

Why it differed from baseline

Beacon had a logging-heavy baseline. CreateLogStream was not inherently suspicious. But the burst came from a service/account/source tuple that was rare in the approved model.

This is a good example of a candidate deviation rather than an incident. It might be a new application deployment, a new environment, or a restarted service. It could also indicate an unexpected workload creating telemetry in production.

Why the framework worked

The alert was medium severity, not critical. It created a review path without pretending that log stream creation was malicious. The framework captured operational drift with enough context for the analyst to decide whether to promote it into the baseline.

Evidence required

Application deployment record.
Environment name.
Log group and stream names.
Service owner confirmation.
Whether a new workload was created.

Alert 5: Backup or Snapshot Deletion in a Service-Heavy Account

Customer: Beacon
Alert: Backup or database snapshot deletion tuple drift
Severity: High
Business risk: data_loss_or_outage
Account: app-prod
Actor: backup-or-service-role
Source: backup.amazonaws.com or private-workload-source
Region: us-east-1
Events: BackupDeleted, DeleteDBSnapshot, DeleteSnapshot
Count: low-volume destructive actions

Why it differed from baseline

Beacon had normal backup and database activity, but destructive backup/snapshot events are not treated as ordinary inventory. The baseline captured that the event was rare and tied it to a high business-risk category.

Why the framework worked

The framework did not let service-heavy operations dilute destructive action. Even if backup automation is normal, deletion of recovery points or snapshots requires confirmation.

Evidence required

Backup vault or snapshot identifier.
Retention policy.
Lifecycle policy.
Decommission or retention exception ticket.
Restore-point coverage after deletion.

Customer: Beacon
Alert: Data-platform access tuple drift
Severity: Medium
Business risk: operational_change_risk
Account: data-platform-prod
Actor: federated-data-operator
Source: new-workstation-or-egress
Region: us-east-1
Events: AssumeRoleWithSAML, ExecuteStatement, GetStatementResult
Count: low-volume interactive session

Why it differed from baseline

Beacon had low-volume human access compared with service traffic. SAML-backed access was normal only for known users, accounts, and source patterns. Data-platform APIs were also normal, but usually through known service sessions and sources.

This alert joined those facts: a human or interactive session touched the data platform from a source not present in the baseline.

Why the framework worked

The framework did not need a malware signature or impossible-travel model. The source and actor tuple was enough to make the session reviewable.

Evidence required

IdP authentication record.
MFA status.
Source ownership.
SQL or statement metadata if available.
Data-platform owner confirmation.

Same Framework, Different Customer Meaning

The most important lesson is that the same event name can mean different things for Atlas and Beacon.

Event	Atlas interpretation	Beacon interpretation
CreateRole	Common under deployment automation; alert when the account/actor/source tuple is new or human-driven.	More unusual because classic CI/CD was not dominant; new deployment service or role deserves closer review.
AttachRolePolicy	Frequent but sensitive; normal only for approved automation tuples.	Less central to the baseline; likely higher signal if observed from a new actor.
CreateGrant	Common due to KMS-heavy workloads; alert on actor/source drift.	KMS use is common, but grant creation still stands out from decrypt/data-key usage.
CreateRoute	High-signal when performed by SSO or non-network automation.	Would be high-signal too, but most network candidates were read-only inventory drift.
CreateLogStream	Usually lower interest unless tied to new deployment or suspicious workload.	Meaningful operational drift because logging and service telemetry are core to the baseline.
AssumeRole	Expected in deployment and cross-account operations; tuple matters.	Extremely common, so only new service/account/source combinations are interesting.
DeleteDBInstance / snapshot deletion	Rare and high-risk; require change and backup evidence.	Rare and high-risk despite service-heavy operations; require retention and recovery evidence.

This is where baseline-driven MDR earns its keep. It does not ask analysts to memorize every customer's cloud architecture. It turns each customer's observed operating model into detection context.

What the Alerting Framework Got Right

For Atlas, the framework handled automation without going blind:

It recognized that IAM writes were common.
It allowed known deployment tuples to be treated differently from human or unknown actors.
It still elevated CreateAccessKey, database deletion, route changes, and bucket policy changes.
It generated reviewable alerts when deployment automation appeared in a new account or region.

For Beacon, the framework handled service-heavy noise without flattening everything into "normal":

It treated high-volume STS, KMS, SSM, monitoring, and inventory as the background operating model.
It separated read-only network discovery from network writes.
It flagged new collector/source tuples without labeling them as confirmed incidents.
It escalated KMS grant changes, backup deletion, snapshot deletion, and unusual data-platform access.

The framework worked because it made alerting relative to customer behavior, while keeping critical outcomes absolute.

Get Access to SecurityOS

Related Posts

LLMs and the Baseline Problem in Cloud MDR

Voice Chat for Attack Path Root Cause Analysis

One OS, Many Scenarios: Detection, Response, Compliance, Threat Exposure Management and Offensive Security Use Cases