Approaching the Agentic SOC

An agentic SOC routes detections through an LLM-based agent that triages them, decides which need a human, and handles the rest. The detection layer feeding the agent is what determines whether the deployment pays back, not the agent itself.

A SOC (security operations center) is the team responsible for monitoring security alerts and responding to incidents. SOCs typically operate on an L1, L2, and L3 tiered model. L1 (first-tier) analysts handle initial triage of high-volume alerts. L2 (second-tier) analysts handle escalations and deeper investigation. L3 (third-tier) analysts handle threat hunting and the most complex incidents. Current agentic SOC implementations are typically aimed at replacing L1 work.

In current AI terminology, an agent is an LLM (large language model, the kind of AI underlying tools like ChatGPT and Claude) combined with a set of tools it can call (functions, APIs, scripts) and an orchestration loop that lets it iterate. The orchestrator executes the LLM's chosen tool calls, returns the results to the LLM, and lets the LLM decide the next step until it reaches a final answer or hits a configured limit. What makes the workflow agentic rather than just "prompting an LLM" is that the sequence of steps is decided at runtime, not scripted in advance. The detections fed in come from systems like a SIEM (security information and event management platform that aggregates logs and runs detection rules) and an EDR (endpoint detection and response sensor that monitors process and file activity on workstations and servers).

How detections get gated before reaching an analyst varies by environment. Common mechanisms include pre-filters that deterministically drop obvious noise before the LLM is invoked (running the LLM on every event is impossible at SOC-scale volumes due to throughput limits on LLM APIs and the cost), agent scoring that routes by confidence, risk weighting based on asset criticality and user privilege, clustering of similar alerts, and per-alert-class routing rules. Architectures vary further. Some are fully-managed platforms with predefined tool integrations. Some are frameworks built against an existing stack. Multi-agent designs split work among specialist agents for triage, threat hunting, and IR (incident response). Single-agent designs have one LLM handle whatever an investigation requires.

Alert classes are categories of detection that share similar characteristics, often aligned with MITRE ATT&CK tactics (a widely-used adversary framework). Examples include credential access (brute force, credential stuffing), lateral movement (pass-the-hash, remote service abuse), privilege escalation (token theft, sudo abuse), persistence (autostart modifications, scheduled tasks), defense evasion (log clearing, security tool tampering), and data exfiltration (large outbound transfers, anomalous cloud uploads). Grouping detections into classes lets the agent apply class-specific investigation playbooks, autonomy levels, and latency budgets rather than treating every alert the same.

LLMs handle IOC matches (file hashes, malicious domains, known-bad IPs) trivially since these are direct lookups against threat intelligence. Behavioral detections are how unknown threats and zero-days get caught, and their quality is what makes the agent effective and economical to run.

Audit trails are central to an agentic SOC, not an afterthought. The program needs to prove which detections fired and what happened to each, what tool calls and actions the agent took, and which alerts it closed as benign and on what basis. The harder question, which logging alone cannot answer, is whether any attacks slipped through as silent closures, since a closure that buried an attack looks identical to a closure that buried noise. That gap is why structured manual review and active false negative discovery sit alongside comprehensive logging. In regulated industries, the lack of audit infrastructure alone can stop a program from deploying. In any industry, it determines whether the program can be trusted with autonomous decisions over time.

# What Importance Why How
1 Detection engineering talent. The people who write behavioral detection rules. Critical Behavioral detections catch unknown threats and zero-days, but their quality varies enormously by who writes them. Strong talent produces high-fidelity rules that keep LLM costs low. Weak talent produces noise that drives costs up from day one and compounds over time. Invest upfront in one excellent detection engineer over several mediocre ones. The cost of strong talent is small compared to the compounding LLM bills from running on poor detections.
2 Detection quality before the agent. Production-grade behavioral detections with aggregation, sequencing, lineage, identity correlation, and context fields built in. Critical Behavioral detections catch unknown threats and zero-days that IOCs cannot. Without context built in, the LLM has to infer intent from raw telemetry, which is where hallucination starts and token costs spiral. Build behavioral detections with explained logic baked in (why anomalous, expected behavior, false positive considerations, context fields). Mature the detection engineering function before deployment.
3 Tool selection and telemetry depth. EDR and SIEM capabilities feeding behavioral detections. High Tools picked just to check the audit box cost far more in the long run. Underpowered telemetry and weak query languages prevent building the sophisticated behavioral detections that would otherwise keep agent costs low. Prioritize EDR telemetry depth (event breadth, granularity, kernel visibility) and SIEM query capability (aggregations, sequencing, joins, statistical functions). Splunk and LogScale both support sophisticated behavioral detection.
4 Environment context and asset criticality. Ongoing knowledge of asset criticality, vendor IPs, user roles, and business context the agent uses to interpret alerts. High Without environment context, the agent applies generic logic to specific situations. A login from Singapore could be a CEO traveling or a compromised account. Feed the agent live asset criticality data, vendor allowlists, user-role mappings, and business-context tags from CMDB and identity systems. Keep the feed continuous, not a one-time setup.
5 SOC workflow consistency. Standardized severity definitions, closure reasons, escalation paths, and ticket ownership across the existing SOC. High The agent inherits whatever workflow inconsistencies exist. If closure reasons vary by analyst today, the agent's output will look inconsistent for a reason that has nothing to do with the agent. Audit the existing SOC workflow before deployment. Standardize severity definitions, closure reason codes, escalation paths, and ticket ownership. Build agent prompts and actions on top of the cleaned-up workflow.
6 Use-case selection for first deployment. The alert classes the agent runs on first. High Starting with high-impact, hard-to-investigate alert classes overweights early risk and obscures whether failures are model issues or use-case mismatch. Narrow first deployments let quality and confidence build before scope expands. Begin with narrow, repeatable, evidence-rich classes where incorrect closure has limited impact (phishing triage, IOC enrichment, low-severity repeated alerts). Defer high-stakes classes such as ransomware containment, insider threat, and privileged account abuse until the program has measured performance to justify expansion.
7 Autonomy maturity ladder. Staged progression of agent independence over time. High Going straight to autonomous closure on day one inverts the learning curve. Each maturity level surfaces failures the previous level cannot, and skipping levels means discovering those failures in production.

Stage the rollout across levels:

  • Read-only enrichment
  • Recommendation-only triage
  • Supervised closure (analyst approves)
  • Autonomous closure or containment

Advance per alert class only after measured performance at the current level holds.

8 Privilege boundaries per alert class. The agent's permitted actions per alert class. High Different alert classes have different impact and reversibility. A blanket default ignores that mismatch. Define permitted actions per alert class across the autonomy spectrum of read-only investigation, autonomous closure, and containment. Tie autonomy to action impact and reversibility, with stricter controls on actions that cannot be reversed.
9 Investigation completeness criteria. The definition of "thorough" per alert class. High Without a checklist, completeness varies run to run. Missed escalations cannot be diagnosed as coverage gaps versus quality issues. Define per-alert-class investigation checklists. E.g., for a credential access detection the agent should pull related auth logs, cross-reference HR records, and check for follow-on lateral movement before reaching a verdict.
10 Standard evidence package format. The structured output the agent produces for each investigation. High Without a defined output format, the same investigation produces different shapes of evidence run to run, and reviewers waste time parsing freeform output instead of validating the decision. A consistent package also makes downstream automation possible.

Define and enforce a standard package with required fields:

  • Summary
  • Timeline of relevant events
  • Entities involved
  • Telemetry queried, with citations
  • Confidence and uncertainty
  • Recommended action
  • Reason for verdict

Validate the package by schema. Reject investigations that do not produce it.

11 Gating threshold tuning. The confidence threshold flipping alerts from agent-close to human-review. High If the agent closes too readily, real positives close silently. If it escalates too readily, the queue fills with ambiguous cases. Set initial thresholds conservatively. Calibrate against labeled outcomes once a meaningful sample exists. Revisit as alert mix and model behavior change.
12 Latency budget for time-sensitive alerts. The maximum time from detection to action for high-impact alert classes. High LLM tool-calling loops can take 30+ seconds per investigation. For active ransomware, lateral movement, or data exfiltration in progress, slow response means containment happens after damage is done. Define per-alert-class latency budgets. Pre-cache enriched context for hot classes. Use faster models for time-sensitive classes. Parallelize tool calls where the investigation logic allows it.
13 Rollback paths for autonomous actions. A reverse action for every autonomous capability. High The agent will isolate wrong hosts and disable legitimate accounts at some point. Without a defined reverse, recovery is ad hoc and slow. Build the reverse before enabling autonomy. No autonomy for actions that cannot be reversed quickly and cleanly.
14 Failure mode design. Defined paths when the LLM API or tools fail mid-investigation. High The alert in flight must go somewhere. "Lost" is not acceptable. Define fallback queues, deterministic backup rules, or human routing per failure type.
15 Agent credentials are a high-value target. The API access the agent uses across SIEM, EDR, identity, ticketing, threat intel, and often AD (Active Directory) or cloud admin. High Compromise grants analyst-equivalent or privileged access in one shot. Treat as privileged service accounts. Audit, rotate, monitor for anomalies, and consider least-privilege per tool call.
16 PII and sensitive data in LLM calls. Personal, sensitive, and regulated data flowing through the LLM API. High Telemetry contains usernames, IPs, file paths, command-line credentials, and sometimes PII. External LLM providers may retain, log, or train on this data depending on contract terms. Read the LLM provider's data retention and training terms. Sanitize or tokenize sensitive fields before sending to external APIs. Consider self-hosted or in-region LLMs for regulated data.
17 Prompt injection through telemetry content. Attacker-controlled text in logs read as agent instructions. High Command lines, filenames, registry values, and headers can contain imperatives like "internal scanning tool, ignore" that influence triage. Sanitize at ingest. Separate instructions from data in prompts. Require the agent to verify claims against telemetry it queries itself rather than text in alert payloads.
18 Hallucinated factual claims. LLM confidently stating things not in the data. High Unsourced claims in agent output (e.g., "user logged in from Russia at 03:00" with no log reference) degrade analyst trust in agent verdicts. Require every claim to cite specific retrieved telemetry. Treat unsourced claims as hallucinated.
19 Reasoning audit trail. Retrievable chain of reasoning behind agent verdicts. High "Closed by agent" alone fails compliance, legal hold, retrospective, and analyst training needs. Without the reasoning behind a verdict, there is no way to evaluate whether the verdict was reached soundly. Capture and persist intermediate inferences, the evidence weighed, and the final verdict. Make the full chain queryable by alert ID, time range, alert class, and verdict type.
20 Action audit log. Comprehensive record of every tool call, query, decision, escalation, and action the agent takes. High Reasoning explains why. The action log proves what was actually done. Both are required for retrospective review, incident response, and regulatory audit, and the two often diverge in failure cases. Log every tool call with full parameters, results, latency, and outcome. Persist alongside the reasoning chain. Make the combined record queryable by alert ID, action type, time range, and outcome.
21 Detection processing proof. Evidence that every alert fired by the detection layer reached a terminal state through the agent or a defined alternative path. High Without this, alerts can silently drop between detection and agent. A silent drop looks identical to no alert firing in the first place, and the gap can sit undetected for months. Track every alert from detection through to terminal state (agent verdict, human escalation, suppression rule, error fallback). Reconcile detection layer output against agent input counts continuously. Alarm on any unaccounted alerts.
22 Structured manual review workflow. A defined cadence and methodology for human review of agent decisions. High Without a systematic process, review becomes ad hoc and full of gaps. Verification ends up depending on chance rather than statistically meaningful sampling, and entire alert classes can go uninspected for months. Define a weekly cadence covering random sampling of closed alerts, full review of high-severity decisions, and class-specific deep dives on high-impact alert classes. Track findings in a log that drives detection tuning, prompt updates, and process changes.
23 False negative discovery program. Structured methods for surfacing alerts the agent incorrectly closed as benign. High A missed escalation looks identical to a correctly closed false positive until an incident reveals it. Without an active discovery program, false negatives accumulate undetected and the program cannot defend its claimed effectiveness. Sample agent-closed alerts weekly for human review. Inject synthetic adversary scenarios that should escalate. Reconcile every incident post-mortem against the agent's prior decisions on related alerts.
24 Cost ceiling discipline. Caps on LLM token spending per investigation, per alert, and aggregate. High A bad day can produce a multi-thousand-dollar bill in one shift. Billing dashboards lag. Set per-investigation and aggregate caps. Rate-limit tool calls. Alarm on cost trajectory via leading indicators.
25 Regulatory and audit requirements. Documented controls over automated security decision-making. High Auditors require access to every LLM decision and its inputs. Without prepared answers, audits stall the program. Document before audits. Have answers for verifying true negatives, retention windows, false negative review cadence, and prompt injection flagging.
26 Ground truth labeling. Analyst review of agent decisions to produce performance labels. High Labels (true positive, false positive, missed escalation, correct close) come from ongoing analyst time, not a one-time setup. Budget continuous labeling as part of operating cost. Without it, performance cannot be measured rigorously.
27 Drift detection. Continuous measurement of agent accuracy against ground truth. High Model updates, prompt changes, and shifting detection coverage cause accuracy drift that can compound unnoticed for months without continuous measurement. Continuously measure decisions against labeled ground truth. Point-in-time evaluation misses drift.
28 Model version pinning. The specific LLM version, prompt template, and tool schema the agent runs. High Providers update models silently. A workflow that worked yesterday can break tomorrow with no code change, distinct from gradual drift in being sudden and vendor-driven. Pin the model to a specific version string. Test new versions in staging against a regression set before production cutover. Version prompts and tool schemas like code.
29 Telemetry coverage gaps inherited by the agent. Detection blind spots that become agent blind spots. High Gaps in collection (container-internal visibility, destinations hidden by encrypted SNI/Server Name Indication, etc.) become agent blind spots while verdicts can still look complete. Map coverage gaps. Communicate to stakeholders. Plan telemetry expansion before blind spots are exploited.
30 Output structure reliability. Reliable structured output for downstream consumers. Medium Agents commonly prepend prose, rearrange fields, or omit fields. Pipelines assuming clean output break. Use JSON mode, schema validation, and retries on format failure. Build pipelines to handle imperfection gracefully.
31 Cross-agent context consistency. Shared memory architecture across multiple agents. Medium Without it, multiple agents on the same incident make conflicting decisions, repeat tool calls, and surface contradictory verdicts. Decide the shared memory architecture before scaling to multiple agents.
32 Pre-production validation for agent changes. A staging environment where prompt, tool, and detection changes get validated before reaching production. Medium Prompt edits and tool changes do not always behave as expected. Without a staging pipeline, breaking changes reach production undetected. Maintain a staging environment with replayable historical alerts. Validate every change against a regression test set of labeled outcomes before promotion.
33 Skill atrophy in human analysts. Loss of L1 pattern recognition as the agent handles routine work. Medium Analysts stop seeing L1 work day-to-day. New hires lack the foundational reps that build evaluation skills. Preserve some manual case load deliberately, or build training paths that do not depend on real-world reps.

Auditable artifacts and retention

The table below covers the categories of agentic-SOC-specific data that should be captured, retained, and made retrievable. Existing enterprise programs for change management, vendor risk, segregation of duties, training, and audit data integrity should be extended to cover the new artifact types rather than rebuilt. LLM-specific audit data is not yet codified in retention regulations in most jurisdictions, so the working default is to apply the organization's existing security log retention policy. PCI DSS v4.0.1 Requirement 10.5.1 specifies at least 12 months of audit log history with the most recent three months immediately available. HIPAA 45 CFR ยง 164.316(b)(2)(i) requires retention of documentation required by the Security Rule for 6 years from creation or last effective date, which covers the policies and procedures themselves rather than audit logs directly. Specific windows depend on industry, jurisdiction, compliance frameworks in scope, and any active legal holds.

Category Artifacts to capture Retention guidance Why
Detection inputs
  • Detection rule (ID, version, query at fire time, author, last-modified timestamp)
  • Raw alert payload
  • Enriched alert payload after context joins
  • Asset context at fire time (criticality, owner, environment)
  • Identity context at fire time (user, role, department, employment status)
  • Threat intel referenced
  • Prompt template version and content
  • System prompt version and content
  • Tool schemas presented to the LLM
  • Model identifier and version string
  • Model parameters (temperature, top_p, max tokens, etc.)
Match the organization's security log retention policy. Extend if retrospective accuracy reconstruction is needed. A verdict cannot be audited without the inputs that produced it. Reconstructing what the agent saw at decision time requires every field to be retrievable later, including the exact version of the detection rule and the exact context joined in.
LLM reasoning
  • Full LLM request payload per call
  • Full LLM response payload per call
  • Tokens consumed (input and output) per call
  • Cost per call
  • Latency per call
  • Number of iterations to reach verdict
  • Intermediate inferences
  • Evidence weighed
  • Citations to specific retrieved telemetry
  • Confidence scores
  • Final verdict
  • Rationale for verdict
Match the organization's security log retention policy. Demonstrating why a decision was made. Required for regulatory audit, legal hold, post-incident analysis, and analyst training. Citations are particularly important since they distinguish reasoned verdicts from hallucinated ones.
Agent actions
  • Every tool call with full parameters and full results
  • Latency per call
  • Retry attempts and reasons
  • SIEM and EDR queries executed (query string, time range, result count)
  • External APIs called
  • Containment actions executed
  • Tickets created or modified
  • Notifications sent
  • Data sources accessed
  • Access tokens or API keys used (identifier only, never the secret)
Match the organization's security log retention policy. Proof of what was actually done, separate from what the agent reasoned. Required for chain of custody, incident response, and demonstrating compliance with action authorization policies. Reasoning and action logs often diverge in failure cases, which makes both required.
Outcomes
  • Final disposition (true positive, false positive, escalation, containment, etc.)
  • Human override details (reviewer identity, timestamp, original verdict, new verdict, reason)
  • Correlation to related incidents and tickets
  • Final state of any containment or remediation
  • Time to resolution
  • Downstream impact if the alert was part of a larger incident
Match the organization's incident records retention policy. Linking agent decisions to actual outcomes. Required for accuracy measurement, post-mortem analysis, and demonstrating that overrides happen when appropriate.
Performance metrics
  • End-to-end latency per investigation
  • Token consumption per investigation
  • Cost per investigation
  • Tool call count per investigation
  • Error count per investigation
  • Retry count per investigation
  • Queue wait time before processing started
Shorter. 90 days hot is typical, with cold archives for trend analysis. Identifying cost overruns, performance degradation, and operational drift before they become critical. Trends matter more than individual data points, so aggregation can take over after a cooldown period.
Operational events
  • System errors and exceptions
  • LLM API failures (timeouts, rate limits, model unavailable, content filter refusals)
  • Tool call failures
  • Rate limit hits
  • Fallback path invocations and the alerts they handled
  • Failover events
  • Scheduled maintenance windows
  • Deployment events
Shorter. 90 days hot is typical, with cold archives for trend analysis. Reconstructing operational failures, identifying systemic issues, and demonstrating that alerts handled during outages did reach a defined alternative path rather than being lost.
Aggregated statistics
  • Alert volume by class over time
  • Verdict distribution by class (TP, FP, FN, TN counts and rates)
  • Mean times to triage, escalation, and containment by class
  • Cost trends by class and overall
  • Accuracy by alert class over time
  • Drift indicators
  • Manual review findings rolled up
  • False negative discovery rate
Indefinite. Storage cost is low and trend data is more valuable the longer the window. Demonstrating program effectiveness over time, supporting trend analysis, surfacing degradation early, justifying continued investment, and answering board-level and audit questions about overall posture.
Manual review records
  • Reviewer identity
  • Review timestamp
  • Alerts sampled for review
  • Sampling methodology used
  • Agreement or disagreement with agent verdict for each
  • Reasons for disagreement
  • Action items raised
  • Follow-up status
  • Updates made to detections, prompts, or processes as a result
Match the organization's security log retention policy. Demonstrating that human verification is happening systematically, not ad hoc, and that findings feed into program improvement. Often the artifact auditors ask for first in SOC 2 and similar reviews.
False negative discoveries
  • Original agent verdict
  • Actual finding when discovered
  • Time elapsed between agent decision and discovery
  • Discovery source (sampling, red team injection, post-mortem, external report)
  • Root cause analysis
  • Remediation action
  • Prompt or detection updates made
  • Verification that the fix would have caught the original miss
Match the organization's incident records retention policy. Tracking what was discovered, why it was missed, and whether systemic issues exist. Often the most scrutinized category in audits since it directly addresses whether the program can be trusted with autonomous decisions.
Configuration changes
  • Prompt template versions with diffs
  • Tool schema versions
  • Model version changes
  • Model parameter changes
  • Gating threshold changes
  • Autonomy boundary changes
  • Alert class definition changes
  • Reviewer and approver identity
  • Approval timestamp
  • Rollback information
Match the organization's change management retention policy. Linking behavior changes to specific configuration changes. Required for change management compliance and for diagnosing accuracy or behavior shifts that follow a config change.
Audit data access logs
  • Who accessed audit data
  • When the access occurred
  • What was accessed
  • What was exported
  • What was modified (audit records should generally be append-only, so modifications need to be flagged)
Match the organization's security log retention policy. Audit data itself must be auditable. Prevents tampering, supports regulatory requirements around access to sensitive records, and provides chain-of-custody for any record produced for legal proceedings.

An agentic SOC built on weak detections and underpowered tooling runs into the same problems a non-agentic SOC has, plus a compounding LLM bill that rarely shows up in the original budget projection. The useful pre-deployment question is whether the detection foundation, operational guardrails, and ongoing measurement are in place, not whether the agent can handle the easy cases.

Next
Next

Understanding How Windows Handles Deleted Files