MITRE ATT&CK Mapping Accuracy: How We Measure It

ATT&CK technique mapping is often presented as a binary operation: a behavior either maps to T1059.001 (PowerShell) or it does not. In practice, the confidence of any given mapping varies significantly depending on available signal, and treating low-confidence mappings with the same weight as high-confidence ones produces misleading coverage claims that don't survive contact with real adversary behavior.

The Confidence Problem in ATT&CK Mapping

The MITRE ATT&CK framework documents 196 techniques and 411 sub-techniques across 14 tactics as of ATT&CK v14. Each technique has a defined set of observable behaviors — process events, network connections, registry modifications, file system changes — that serve as detection opportunities. The challenge is that many techniques share overlapping behavioral signatures, and the same observable event can legitimately map to multiple techniques at different confidence levels.

Consider a process event where cmd.exe spawns net.exe with arguments querying domain group membership. This behavior plausibly maps to T1069.002 (Permission Groups Discovery — Domain Groups), T1087.002 (Account Discovery — Domain Account), and, in some contexts, T1482 (Domain Trust Discovery). A detection rule that fires on this event cannot definitively declare which technique is occurring without additional context: what preceded the event, what network connections exist, what user account is executing the command.

Most commercial ATT&CK mapping tools output a single technique tag per event. This collapses the confidence distribution into a point estimate and discards the uncertainty information that would help an analyst determine how much weight to place on the mapping. ThreatPulsar's approach is different: we output a confidence-scored list of candidate techniques, ordered by posterior probability given the observable evidence.

The Scoring Model

ThreatPulsar's ATT&CK mapping confidence score is computed using a weighted combination of four signal dimensions:

Behavioral specificity (weight: 0.40): How distinctive is the observed behavior as a signature for this technique versus alternative techniques? A process injecting into lsass.exe is highly specific to T1003.001 (LSASS Memory). A command prompt executing a script is far less specific, mapping to a broad range of execution techniques. Specificity scores are assigned per technique-behavior pair based on empirical analysis of the ATT&CK dataset and associated detection analytics.

Kill chain position coherence (weight: 0.25): Does this technique fit the expected position in the attack sequence given prior enriched events? If ThreatPulsar has already observed Initial Access and Execution techniques in the current alert cluster, mapping a new event to Lateral Movement is more coherent than mapping it to Reconnaissance. The coherence score is computed against the current session's tactic progression graph.

Threat actor association (weight: 0.20): Does any known threat actor cluster currently assessed as active in this sector use this technique in a documented campaign? When a threat actor profile — sourced from MITRE ATT&CK's Groups dataset and augmented with commercial threat intelligence feeds — assigns this technique a documented usage frequency above a threshold, the confidence score receives a positive adjustment. This dimension is weighted lower than behavioral specificity because threat actor attribution at the enrichment stage carries its own uncertainty.

Environmental baseline deviation (weight: 0.15): Is this behavior anomalous relative to the organization's historical baseline? A behavioral event that occurs hundreds of times daily in the environment and maps to a technique carries less diagnostic weight than the same event occurring for the first time. The baseline deviation score is computed against 30-day rolling activity profiles per endpoint, user account, and network segment.

Calibration Against Ground Truth

Confidence scores are useful only if they are calibrated — meaning that events scored at 70% confidence should, empirically, map to the correct technique approximately 70% of the time. Miscalibrated scores undermine the entire purpose of the scoring system: if 70% confidence actually means 40% accuracy, analysts will either over-trust or over-discount the output.

ThreatPulsar calibrates its scoring model against a labeled dataset of 47,000 technique-behavior mappings assembled from public ATT&CK threat reports, red team exercise documentation, and incident response case notes where ground truth technique assignment was independently verified. The calibration process uses isotonic regression to adjust raw model scores toward empirical accuracy, and the calibration curve is re-evaluated quarterly as new data is added to the ground truth dataset.

Current calibration data (as of Q3 2025): events scored above 80% confidence map to the correct ATT&CK technique in 84% of cases. Events scored between 60% and 80% confidence map correctly in 63% of cases. Events scored below 60% are flagged as "candidate match — review required" in the ThreatPulsar interface and are not used to automatically populate SOAR playbook technique fields.

Where Automated Mapping Falls Short

There are four categories of ATT&CK techniques where automated mapping performs below the overall system average, and detection engineers should apply additional scrutiny to automated assignments in these categories:

Living-off-the-land techniques (T1218 and sub-techniques): LOLBin abuse detection is difficult to map accurately because the same signed Windows binary can be used for both legitimate administrative tasks and malicious execution. Confidence scores for T1218 sub-techniques (mshta, regsvr32, rundll32) average 0.61 in our calibration set, compared to a system-wide average of 0.73. These mappings should be treated as candidate hypotheses requiring analyst review rather than confirmed technique tags.

Techniques requiring cross-host correlation (T1021): Remote service exploitation and lateral movement techniques often require correlating events across two hosts — the source and the destination. When enrichment context covers only the local host, the mapping confidence for T1021 sub-techniques is necessarily lower because the outbound connection alone is not sufficient to distinguish legitimate remote access from adversarial lateral movement.

Cloud techniques (T1078.004, T1098): Cloud account manipulation and identity-based attacks are underrepresented in the ground truth dataset because detailed cloud incident reports with verified technique assignments are published less frequently than endpoint-focused incident reports. Mapping confidence for cloud techniques should be treated with appropriate uncertainty until the dataset grows.

Multi-stage payload delivery: When an attack sequence involves multiple stages — downloader, dropper, payload — the intermediate stages often produce behavioral events that could map to multiple techniques depending on which stage the analyst is looking at. Without full kill chain context, intermediate-stage technique assignments carry higher uncertainty than terminal-stage assignments.

Using Confidence Scores in Practice

The practical value of confidence-scored ATT&CK mappings becomes apparent when integrating with SIEM correlation rules and SOAR playbooks. A high-confidence technique assignment (above 80%) can trigger automated containment responses without analyst review — for example, automatically isolating an endpoint when T1003.001 (LSASS Memory) is observed with 85%+ confidence. A low-confidence assignment (below 60%) should queue for analyst review rather than triggering automated response, because the cost of acting on a false mapping in automation is higher than the cost of a brief analyst queue delay.

Detection engineers building Sigma rules and YARA signatures from ATT&CK mappings should filter to high-confidence technique assignments when building new detection logic. Rules generated from low-confidence mappings inherit the uncertainty of the underlying assignment and tend to produce broader, noisier detection signatures that require more tuning effort post-deployment.

For more on YARA rule generation from enriched IOC clusters, see our article on generating YARA rules from enriched IOC clusters.

Coverage Claims Are Relative to Detection Depth

A final note on how to interpret ATT&CK coverage metrics: "coverage" means different things depending on what you are counting. A SOC can claim coverage of T1059.001 (PowerShell) simply by having a detection rule that fires on any PowerShell execution. Whether that rule produces actionable, contextualized alerts that correctly identify malicious use is a separate question. ThreatPulsar reports ATT&CK coverage at three levels: rule presence, contextual enrichment availability, and automated response capability. The distinction matters because coverage at level one with no contextual enrichment is largely cosmetic — it satisfies compliance questionnaire checkboxes while contributing little to actual detection effectiveness.

The 94% figure ThreatPulsar cites for TTP mapping refers to the percentage of observed technique behaviors in our customer environments that receive a technique assignment with confidence above 60%. It does not mean 94% of all ATT&CK techniques are covered — the realistic coverage figure for any SOC environment is heavily dependent on log source availability and data collection configuration. What the 94% figure means is that when a behavior is observed and ingested, it almost never goes uncategorized.

Conclusion

ATT&CK mapping is more useful when it carries calibrated confidence information than when it produces single-point technique tags. The investment in building and maintaining a calibrated scoring model is justified by the downstream benefits: SOAR playbooks that automate with higher precision, detection rules with empirically grounded signal, and analysts who can triage based on technique-level summaries rather than raw event logs.

The accuracy floor for automated mapping is not a product limitation — it is a reflection of inherent ambiguity in behavioral signals at the technique level. Acknowledging that ambiguity and making it legible to analysts is more useful than hiding it behind a deterministic tag.