top of page

Why thresholds matter more than accuracy scores.


Discussions of AI performance typically begin with accuracy.


A model is described as “95% accurate” or as outperforming a benchmark. These statements are useful. They provide a high-level indication of model capability and allow for comparison across approaches.  


However, once an AI system is deployed into an operational environment, accuracy is no longer the primary determinant of outcomes.  Consider pre-AI rule-based systems, such as highly accurate fraud detection rules that still failed on edge cases, leading to significant downstream impacts.  The lessons are there, from what could be considered more controlled binary environments.


What matters in practice are the thresholds that govern how the model is used.


Accuracy is an aggregate measure. It describes performance across a dataset, typically averaged over a wide range of scenarios. By contrast, real-world decisions are made individually, often under conditions of uncertainty, and frequently at the margins where model performance is less consistent.


Thresholds define how that uncertainty is handled.


They determine the point at which a model output is considered sufficient to trigger an action, when a case should be escalated for human review, and when a process should be halted. In effect, thresholds translate model outputs into operational decisions.


This distinction is often under appreciated.


Organisations invest significant effort in improving model performance, refining data, tuning parameters, and optimising metrics. BUT, a well-performing model can still lead to less effective or even bad outcomes if the thresholds governing its use are not appropriately defined.


This is because accuracy can obscure variation. It does not fully reflect how a model performs across different subgroups, edge cases, or higher-risk scenarios. Thresholds, on the other hand, directly influence whether those cases are acted upon, escalated, or constrained. In short, it enables prevention rather than a reactive response after the event.


As a result, thresholds sit at the centre of decision-making and accountability.


Each threshold reflects a set of choices about acceptable risk, required certainty, and the role of human oversight. These choices are not purely technical. They are shaped by organisational objectives, regulatory expectations, and the potential impact on individuals and outcomes.


In many organisations, these decisions are not made explicitly.


They may be embedded within model configurations, inherited from initial deployments, or adjusted informally over time. Without clear visibility and ownership, thresholds can become implicit assumptions rather than governed decisions.  This presents a challenge: translating established, often legacy, systems and human governance processes, developed over years, into an AI-enabled operating model.


This becomes problematic when scrutiny is applied.


Stakeholders, whether internal audit, regulators, or senior leadership, are less concerned with overall model accuracy and more focused on how specific decisions were made. In particular, they will seek to understand why a system was permitted to act at a given level of confidence, and whether that decision was appropriate in context.


This has led to a gradual shift in more mature organisations.


Rather than focusing solely on whether a model is performing well, there is increasing emphasis on whether decisions are being made appropriately. This requires thresholds to be defined deliberately, aligned to context, and reviewed as conditions change.


Conditions will change.  Data evolves, models drift, and organisational risk appetite is not static. Policy and regulatory expectations also continue to develop. A threshold that was appropriate at one point in time may become misaligned if it is not actively managed.


This highlights a limitation in many traditional governance approaches.


Governance is often expressed through principles, policies, and documentation. While necessary, these mechanisms do not in themselves ensure that decision rules are applied consistently in live systems. Without operational visibility and control, organisations may find it difficult to demonstrate how decisions are governed in practice.


The challenge, therefore, is not one of intent but of implementation.


Effective AI governance requires the ability to define, monitor, and evidence the thresholds that drive real-world decisions. This includes understanding how thresholds are set, who is responsible for them, how they relate to risk, and how they perform over time.


This is precisely the context in which platforms such as RAITracker are designed to operate.


RAITracker focuses on embedding governance within operational workflows. Thresholds are made explicit, linked to risk and outcomes, and assigned clear ownership. Changes are recorded, rationale is captured, and performance is monitored continuously.


In this model, thresholds are not hidden parameters but governed by decision rules.


This provides a more robust foundation for assurance. When questions arise about how a system has behaved, organisations are able to provide clear, evidence-based explanations of how decisions were made, rather than reconstructing them retrospectively.


Accuracy remains an important metric. It provides a necessary view of model performance.


However, it is thresholds that determine how that performance is translated into action.


For organisations seeking to operationalise responsible AI, the critical question is therefore not simply how accurate a model is, but how and where the line is drawn for decision-making and how that is governed over time.

 
 
 

Comments


bottom of page