LLMs and the Baseline Problem in Cloud MDR

Cloud detection has a baseline problem.

For anyone who has operated cloud environments for a long time, this is obvious from daily work but surprisingly hard to encode in rules. There is no single normal pattern in cloud. There is not even one normal pattern inside a single enterprise.

One application team deploys with Terraform from a build host. Another uses CloudFormation StackSets. Another still has Jenkins or TeamCity roles starting instances, creating grants, or attaching policies. Some administrative work comes through SSO. Some comes from service roles. Some comes from bastions. Some comes from containers started near a database for a one-off operational workflow. Some activity that looks dangerous in isolation, such as creating roles, deleting roles, updating security groups, creating KMS grants, or even deleting database instances, may be part of how a customer has historically operated.

That does not make it safe. It means the detection decision cannot stop at the event name.

This is the problem we have been working on at Transilience AI, and it is the reason we will open source the first pieces of our cloud MDR detection work in the next couple of days.

The central idea is simple: cloud alerts should be judged against how the business actually operates, not only against a generic heuristic about whether an API call is scary.

Why cloud has no universal normal

Traditional detection logic often assumes there is a stable baseline:

This event is rare.
This IP is unusual.
This actor performed a sensitive action.
This count crossed a threshold.
This source country is different.

Those signals are useful. They are not enough.

In cloud, business operations create high-variance control-plane activity. A mature customer environment often has multiple generations of operating models layered on top of each other:

acquired business units with their own AWS accounts,
old IAM users next to newer SSO roles,
Jenkins, TeamCity, Terraform, CloudFormation, CDK, and StackSets all coexisting,
security tools changing Config, SecurityHub, GuardDuty, or IAM resources,
EKS controllers mutating security groups as part of service exposure,
bastion hosts or build hosts acting as operational entry points,
service-linked roles performing lifecycle work,
periodic start/stop automation,
backup, patching, and diagnostic workflows,
account-specific practices that nobody would design from scratch but that are real.

If a detection rule says eventName = AttachRolePolicy, it will fire constantly in environments where StackSets and deployment automation legitimately manage IAM roles. If a rule says CreateGrant is suspicious, it will collide with Auto Scaling, RDS, EKS, Lambda, and encryption-heavy application workflows. If a rule says DeleteDBInstance is always malicious, it will miss the distinction between a known devops build role refreshing a non-production database and an SSO admin deleting a database from a new source.

The same event can be normal, suspicious, or critical depending on the tuple around it:

account,
actor,
event name,
source IP or AWS service source,
region,
target resource,
business workload,
actor archetype,
historical recurrence,
residual security impact.

The unit of normality is not the event. The unit of normality is the business operation.

What we learned from real customer baselines

In one 90-day CloudTrail baseline, correlated with AWS inventory, we analyzed 42,258 CloudTrail observations across 78 accounts. The top normal events included:

AttachRolePolicy
CreateRole
ConsoleLogin
CreateGrant
DeleteRole
RunInstances
TerminateInstances
PutRolePolicy
AuthorizeSecurityGroupIngress
RevokeSecurityGroupIngress
StartInstances
StopInstances

Viewed as raw event names, that list is uncomfortable. It contains IAM privilege changes, compute lifecycle changes, network exposure changes, and database-adjacent activity. A generic cloud detection pack would produce a large number of alerts.

But once we grouped by business operation, the picture became clearer.

Normal actors included CloudFormation StackSets execution roles, Auto Scaling service roles, SSO service roles, SecurityHub service roles, EKS load balancer controller roles, SSM QuickSetup host-management roles, AWS Backup roles, TeamCity, Jenkins/build roles, and known devops roles. Normal source services included cloudformation.amazonaws.com, ssm.amazonaws.com, autoscaling.amazonaws.com, sso.amazonaws.com, securityhub.amazonaws.com, and backup.amazonaws.com.

That does not mean those events should be suppressed. It means they need interpretation.

For example:

Role and policy churn was normal when it mapped to StackSets, CloudFormation, CDK, deployment, or control-plane automation already observed in the same account.
EC2 lifecycle events were normal where compute, Auto Scaling, or deployment footprints existed, but only when source roles and regions stayed stable.
KMS CreateGrant was normal for repeated service or deployment actors, but unexpected actors still mattered.
S3 control-plane activity was expected in storage-heavy accounts, but bucket policy changes, public-access changes, and bucket deletion remained high risk.
IAM users and access keys existed in some accounts, so identity events were plausible, but CreateAccessKey still required strict actor and source baselining.

This is the key distinction: business-normal does not mean security-benign.

Business-normal is not the same as acceptable risk

One of the most important lessons from customer baseline work was separating two questions:

Does this match how the customer normally operates?
Is the residual security risk acceptable?

Those are different questions.

A bastion-host workflow that rotates Bitbucket pipeline keys every month may be business-normal. The cadence, user agent, source IP, target user, and event sequence may all match a sanctioned process. But it still leaves a design risk: a bastion role has permission to create long-lived AWS access keys. If that bastion is compromised, an attacker can mint CI/CD credentials.

A named SSO administrator updating account password policies across many accounts may be business-normal hardening. But if the change is hand-run from CloudShell instead of expressed as infrastructure as code, there is still change-management risk.

A build role deleting databases may be normal in a non-production refresh pipeline. But a human SSO administrator deleting a database from a new source is a different operation, even when the event name is identical.

A root login with MFA may be legitimate break-glass. It still requires a ticket, approval, source validation, and post-login review.

This distinction changes alert handling. Instead of asking, "Is DeleteDBInstance bad?", we ask:

Is this DeleteDBInstance normal for this account?
Is this actor the expected actor?
Is this source IP or AWS service source expected?
Is this region expected?
Is this the same workload lane?
Is this a destructive event that remains alertable even when normal?
What verification is required before closing it?

The output is not just a binary verdict. It is a triage position: normal and low risk, normal but high residual risk, abnormal and high risk, or unknown and needs evidence.

Why heuristic rules break down

Cloud SIEM rules often begin with a useful heuristic:

eventName = CreateAccessKey

or:

eventName in (StopLogging, DeleteTrail, UpdateTrail)

or:

AuthorizeSecurityGroupIngress from 0.0.0.0/0

These are good starting points. The problem is that the response decision quickly needs context that is not in the event itself.

In that baseline, many high-risk-looking events were expected under specific lanes:

StackSets and CloudFormation roles created and deleted IAM roles.
EKS load balancer controller roles changed security group rules.
Auto Scaling roles launched and terminated instances and created grants.
SSM QuickSetup roles created and updated operational roles.
SecurityHub roles deleted or modified Config rules as part of security-service behavior.
Bastion and build-server roles performed operational maintenance from owned EC2 public IPs.

At the same time, several things remained alert-worthy:

CreateAccessKey by a new SSO admin/account/source tuple.
DeleteDBInstance by an SSO admin where the normal lane was a build or devops role.
bucket deletion involving audit-like or CloudTrail-named buckets.
root console login, even with MFA.
CloudTrail, Config, or logging-control changes.
public exposure changes outside known controller/deployment lanes.
IAM privilege creation by personal IAM users with static keys from unmanaged networks.

This is why a cloud MDR system needs both baseline memory and security judgment. A baseline without security judgment becomes suppression. Security judgment without baseline memory becomes alert fatigue.

What LLMs change

Before LLMs, encoding this kind of operational context was hard. You could write rules. You could enrich with asset inventory. You could add allowlists. You could tune source IPs. But representing business intent was difficult.

LLMs make a different workflow possible.

An LLM can inspect a cluster of events and reason over the surrounding evidence:

"This looks like CloudFormation StackSets creating expected roles across member accounts."
"This source IP belongs to an owned EC2 bastion, but the action is credential minting, so keep the residual risk high."
"This is a repeated EKS load balancer controller pattern, so the security group mutation is probably deployment-driven."
"This DeleteDBInstance event is normal for a build role in this account, but not for this SSO actor and source."
"This password policy sweep is likely organization-wide hardening, but it should move into an IaC-controlled process."

The LLM is not replacing detection logic. It is making the baseline interpretable.

The right architecture is not "ask an LLM if an event is bad." The right architecture is:

Normalize cloud events into a stable schema.
Build tuple-scoped baselines over accounts, actors, event names, source IPs, regions, resources, and business families.
Correlate with inventory so account purpose, workload type, public IP ownership, resource footprint, and security-service coverage are available.
Preserve critical overrides for identity, logging, destructive, public exposure, root, KMS, and audit-integrity events.
Use an LLM to explain whether the event matches a known business operation and what residual risk remains.
Produce a decision artifact that a human can verify.

This is the model we have been turning into reusable skills.

The shape of a useful cloud baseline

A useful cloud baseline is not a list of common events.

It needs to answer:

Which accounts exist, and what business workloads do they support?
Which actors are deployment automation, service roles, SSO admins, support roles, bastions, build servers, or unknown users?
Which source IPs are owned cloud assets versus unmanaged networks?
Which event families are expected for each account and actor?
Which regions are normal?
Which resources are expected to be touched?
Which event types are business-normal but still require verification?
Which patterns are first-seen, rare, or outside the actor/source lane?

In a real customer environment, that meant modeling account-level operations such as:

a data-platform account, where build roles, EKS controller roles, Auto Scaling, database lifecycle work, and SSO administrators all existed in the same operating surface.
an application-platform account, where Jenkins/build-server roles, EKS controller roles, SSM host-management roles, and serverless/storage-heavy workloads drove legitimate change.
a market-data account, where TeamCity and scheduled shutdown roles explained some lifecycle activity, but bucket deletion and key creation still required audit-integrity review.
a file-platform account, where Terraform-style activity could look coherent from a naming perspective while still being unacceptable when performed by a personal IAM user with a long-lived static key from unmanaged networks.

The baseline did not say, "these accounts are safe." It said, "this is how these accounts appear to operate, and here is where the operation exceeds acceptable risk."

From alerts to business-context findings

The most useful output of this approach is not an alert count. It is a finding that separates evidence, normality, and risk.

For example:

Event: CreateAccessKey
Account: application-platform / 000000000001
Actor: AWSReservedSSO_AWS-Admin_...
Source: 163.116.214.64

Baseline position:
This account has IAM activity in the baseline, but this account/actor/event/source/region tuple is new.

Security position:
Long-lived credential creation remains high risk even when performed by an administrator.

Action:
Verify approved legacy integration or rotation. If not approved, revoke the key and review subsequent usage.

Or:

Event: DeleteDBInstance
Account: analytics-platform / 000000000002
Actor: mhuo3-devops-frostlor-build-server-role
Source: 3.109.205.201

Baseline position:
This exact account/actor/source/region/event tuple appeared repeatedly in the 90-day baseline.

Security position:
Destructive database activity remains reviewable, but this instance is more likely business-normal than suspicious.

Action:
Verify the deployment or refresh record. Do not treat it the same as a new SSO-admin database deletion path.

This is more useful than a static severity. It tells the analyst how to think.

What we are open sourcing

The transilience-mdr-rainer repository is intended to capture the skills and workflows we have learned while operating cloud MDR for real customer environments.

The first set of skills is focused on the baseline problem:

collecting reproducible CloudTrail evidence,
normalizing CloudTrail events into stable observations,
building business-as-usual baselines,
comparing recent activity against tuple-scoped normality,
preserving critical security overrides,
packaging raw evidence for reproduction,
converting baseline drift into explainable triage findings,
generating detection specifications that can move into SIEM or SOAR systems.

The repository is structured as a skills-first AWS MDR project:

transilience-mdr-rainer/
├── .claude-plugin/
│   ├── plugin.json
│   └── marketplace.json
├── projects/
│   └── aws-mdr/
│       ├── .claude/skills/
│       │   ├── lookup-collector/
│       │   ├── normalize-observations/
│       │   ├── raw-evidence-pack/
│       │   ├── business-baseline/
│       │   ├── detection-specs/
│       │   ├── business-triage/
│       │   └── report-packager/
│       ├── examples/
│       └── outputs/
└── scripts/

Each skill is intentionally narrow. The workflow is composable: collect, normalize, preserve raw evidence, build the baseline, generate detections, triage with business context, then package the report.

What this looks like in Markdown

The repository is designed to produce reviewable Markdown artifacts. The point is not to hide the decision inside a script or a SIEM query. The point is to create a record that an analyst, engineer, or customer can read and challenge.

A normalized observation can be represented in Markdown like this:

### Observation

| Field | Value |
| --- | --- |
| Account | `000000000001` |
| Region | `us-east-1` |
| Event | `CreateAccessKey` |
| Actor | `build-user` |
| Actor type | `iam_user` |
| Source | `198.51.100.25` |
| User agent | `aws-cli` |
| Business risk | `credential_or_privilege_risk` |
| Pattern family | `identity_and_privilege_management` |

The business baseline can then be summarized as a Markdown table:

### Business Baseline Tuple

| Account | Actor | Event | Source | Region | Prior observations | Baseline position |
| --- | --- | --- | --- | --- | ---: | --- |
| `000000000001` | `deployment-role` | `CreateRole` | `cloudformation.amazonaws.com` | `us-east-1` | 150 | Expected deployment lane |
| `000000000001` | `build-user` | `CreateAccessKey` | `198.51.100.25` | `us-east-1` | 0 | First-seen credential lane |

The important part is what the baseline stores. It is not just "CreateAccessKey happened." It records the account, actor, event, source, region, business family, recurrence, and security override:

### Baseline Decision

- Account: `000000000001`
- Actor: `build-user`
- Event: `CreateAccessKey`
- Source: `198.51.100.25`
- Region: `us-east-1`
- Business family: Identity and privilege management
- Baseline status: first-seen tuple
- Critical override: yes
- Decision: alertable
- Why:
  - New account/actor/event/source/region tuple.
  - Long-lived credential creation remains high risk.

That is the baseline problem in Markdown: the event name is only one column in the decision.

Skill examples as Markdown

The project skills can be described as Markdown contracts. Each skill has a clear input, a decision it is responsible for, and an output artifact.

### Skill: lookup-collector

Purpose: collect reproducible CloudTrail management events.

Inputs:
- Authorized AWS accounts and roles.
- Event names such as `CreateAccessKey`, `DeleteDBInstance`, `DeleteBucket`,
  `StopLogging`, `DeleteTrail`, `UpdateTrail`, and `AuthorizeSecurityGroupIngress`.
- Time window and regions.

Outputs:
- Raw CloudTrail events.
- Lookup call ledger.
- Reproduction notes.
- Collection caveats, including partial or capped queries.

### Skill: normalize-observations

Purpose: convert raw CloudTrail-shaped records into one stable observation view.

Inputs:
- LookupEvents records.
- CloudTrail JSON or JSONL.
- Evidence-pack wrappers.
- SIEM decoded payloads.

Outputs:
- One observation per event.
- Stable fields for account, actor, event, source, region, resource,
  actor type, business risk, and pattern family.
- Invalid-record counts and parsing caveats.

### Skill: business-baseline

Purpose: build a business-as-usual model from normalized observations.

Baseline dimensions:
- Account.
- Actor.
- Event name.
- Source IP or AWS service source.
- Region.
- Business family.
- Actor archetype.
- Resource pattern.

Rule:
Use recurrence as context, not approval. Critical identity, logging,
destructive, public exposure, root, and audit-integrity events remain alertable.

### Skill: detection-specs

Purpose: turn baseline lessons into portable detection logic.

Detection families:
- Root activity.
- Access-key creation outside approved rotation.
- IAM role and policy writes.
- CloudTrail and Config tamper.
- S3 bucket deletion or audit-bucket deletion.
- Public security group or route exposure.
- Destructive database operations.
- KMS key and grant changes.
- First-seen baseline tuple drift.

### Skill: business-triage

Purpose: explain what happened in business terms.

Finding structure:
- What happened.
- Evidence summary.
- Business-as-usual assessment.
- Residual risk.
- Verification steps.
- Remediation actions.
- Alerting implication.

A generated detection spec can also be represented as Markdown before it is implemented in QRadar, Athena, ClickHouse, Sigma, or a custom engine:

### Detection: CreateAccessKey outside approved rotation

Severity: Critical

Business family: Identity and privilege management

Trigger:
- Event name is `CreateAccessKey`.
- Actor is not an approved key-rotation role.
- Account/actor/event/source/region tuple is first-seen or rare.

Required evidence:
- Account ID.
- Actor.
- Source IP or AWS service source.
- User agent.
- Target user.
- Access key ID.
- Prior baseline count for the tuple.

Allowed closure:
- Approved rotation ticket exists.
- Caller is an approved rotation role.
- Target user is in the approved exception list.
- Newly created key is accounted for and monitored.

Analyst guidance:
If the event is not an approved rotation, revoke the key and review subsequent usage.

The output finding should read like an analyst decision, not a raw alert:

### Finding: Bastion CreateAccessKey

Business-as-usual assessment:
Likely sanctioned key rotation. The cadence, user agent, source EC2 elastic IP,
target pipeline user, and create/update/delete sequence match the known monthly
rotation lane.

Residual risk:
High design risk. A bastion role can mint long-lived AWS keys. If the bastion is
compromised, the attacker can create CI/CD credentials.

Action:
Move the pipeline to federation, remove key-minting permissions from bastion
roles, and alert on future `CreateAccessKey` unless the caller and target match
the approved exception.

The goal is not to publish one perfect rule pack. Cloud does not work that way.

The goal is to publish a way of thinking and a set of reusable skills that help teams answer:

What is normal for this customer?
What is normal only for this account, actor, source, region, and resource pattern?
What is abnormal?
What is normal but still dangerous?
What should be verified before closure?
What should become an exception, a control improvement, or a detection?

Closing thought

Cloud detection should not be a fight between generic rules and endless allowlists.

Generic rules are necessary, but they are incomplete. Allowlists reduce noise, but they often erase risk. The better path is business-context baselining: understand how the customer actually operates, keep security-critical events visible, and use LLMs to explain the intent and residual risk behind the activity.

There is no universal normal in cloud.

There is only normal for this business, in this account, by this actor, from this source, against this resource, for this purpose.

That is the baseline problem. And solving it is the next step for effective cloud MDR.