DBMT Articles for our Clients

How You Know AI is Working

Here is a conversation I have more than any other right now. An executive tells me their organization has embraced AI. Hundreds of people are using it. Tools have been deployed. There is genuine enthusiasm. And then, after a pause: "But we can't really point to what it's done for us."

Find out how "AI Impact Governance" helps companies measure and manage AI performance throughout the organization. Read Article

How You Know AI is Working

AI Impact Governance is not a compliance framework or an IT policy. It is the discipline of ensuring that AI use — at every stage of maturity, from an individual experimenting with ChatGPT to a fully autonomous workflow running without human intervention — consistently produces outcomes worth producing, at a cost worth paying.

Consistency and quality are not the same thing. I've seen teams spend months refining their workflows within a tool — genuinely improving them, by the metrics the tool surfaces — before anyone asked whether the outputs were actually performing better against the business outcomes that matter. The tool was being governed. The outcomes weren't.

Mature AI Impact Governance at this stage is an operational discipline: continuously measuring workflow performance against quality and cost benchmarks, making deliberate decisions about what to retire and what to invest in further, and maintaining the visibility to see clearly what the systems are doing and what they are costing. This is not glamorous work. It is the work that determines whether your AI investment compounds in value or quietly compounds in waste and losses

Beyond Hallucination: A Bigger LLM Accuracy Problem Is Impacting You Right Now

Proximal Reasoning is the danger coming from inside the house

By David Bernard, Managing Director, DBMT

When enterprise leaders talk about LLM accuracy problems, they are almost always talking about hallucination. This is understandable — hallucination is vivid, it has a name, and it produces the kind of visible failures that make it into case studies. An LLM invents a legal citation, fabricates a statistic, or confidently describes an event that never happened. The failure is legible, the cause is intuitive, and the remediation strategies — retrieval-augmented generation, grounding against authoritative sources, fact-checking layers — are at least partially understood.

What is not well understood is that hallucination is only one of two distinct LLM accuracy failure modes, and that the second one is both more frequent and less visible. It occurs not when a gap in the model's training data forces it to confabulate, but when the model is asked to perform a task that requires deterministic judgment — a classification, a structured decision, an evaluation against defined rules — and produces an output that looks exactly like correct, reasoned judgment without having performed any reasoning at all. The model has no rule engine, no logical evaluator, no mechanism for applying a criterion consistently and arriving at the same answer on identical inputs. What it has is a pattern-matching architecture sophisticated enough to have learned what deterministic reasoning looks like — and to reproduce its surface characteristics fluently, confidently, and variably. I've been calling this Proximal Reasoning: the production of outputs that occupy the neighborhood of deterministic judgment without being derived from it. It deserves the same level of attention hallucination receives — and in enterprise deployments where AI workflows touch structured decisions at scale, probably considerably more.

What an LLM hallucination actually is

When you receive a false or misleading statement from an LLM, it is likely due to a hallucination. LLMs are trained to respond to prompts using a large amount of data to extrapolate language patterns, and a byproduct of this data is information. When a user asks a question of an LLM it is incapable of separating the language patterns it has consumed from the information embedded in that data — and that information can contain issues. If the answer to a question pulls up contradictory information, or if no information exists on the subject at all, the LLM's primary objective to respond to the prompt compels it to construct an answer anyway. It has no idle state, no mechanism for saying "I don't have reliable information here" unless specifically trained and prompted to do so. The result is a confabulated output delivered with the same fluency and confidence the model uses when it actually has solid training data to draw from.

The important implication of this is that hallucination is a conditional failure. It is triggered by a specific circumstance — a knowledge gap or contradiction in the training data — and while those gaps are common enough to matter, they are not present in every interaction. A model with strong, consistent training data on a subject, asked a question well within that domain, is unlikely to hallucinate. The risk is real but it is bounded by the content of the query and the coverage of the training data.

Retrieval-augmented generation addresses hallucination by replacing the model's reliance on parametric training data with a curated external knowledge base that you compile and control. This is a meaningful improvement — your sources are likely more authoritative, more current, and more directly relevant to your domain than the broad probabilistic compression of the internet that the model was trained on. But RAG changes what the model is drawing on, not how it draws on it. The model's fundamental inability to recognize the boundaries of its own knowledge remains intact, which means that wherever your curated sources contain gaps, outdated entries, or internal contradictions — and at any meaningful scale, they will — the same confabulation dynamic that produces hallucination in the base model will produce it in the RAG-augmented one. On top of this, despite valiant efforts, context management within RAG data stores is never perfect enough to eliminate hallucination entirely. You have narrowed and improved the knowledge surface. You are managing the risk, not eliminating it

Proximal Reasoning — How LLMs introduce risk at each encounter

Proximal Reasoning is, in plain terms, fake reasoning. LLMs are not capable of reasoning — they are trained on enough examples of human reasoning to produce outputs that look indistinguishable from it. When you ask an LLM to perform a structured judgment task — evaluating pages of text against a fixed rubric, applying a defined set of criteria to a complex input, making a classification decision according to explicit rules — you are asking it to do something it is architecturally incapable of doing. It has no rule engine, no logical evaluator, no mechanism for applying a criterion consistently and arriving at the same answer on identical inputs. What it has is a sophisticated pattern-matching architecture trained to recognize what the output of that kind of reasoning looks like — and to produce it. Fluently. Confidently. And variably.

That last word is where the business risk lives. When an organization routes a structured judgment task through an LLM, it is not doing so experimentally — it is doing so because it needs that judgment to be accurate and repeatable. The same input evaluated today and evaluated next month should produce the same result. A compliance check, a content evaluation, a scoring decision — these are not tasks where approximate consistency is acceptable. They are tasks where inconsistency creates legal exposure, operational failure, and erosion of trust in the capability itself. The LLM will produce output that looks like it performed the judgment correctly. It will not perform it the same way twice. And the distinction from hallucination is fundamental rather than a matter of degree.

This property is called determinism — the requirement that a process produce identical output every time it receives identical input, regardless of when it runs, how many times it has run before, or what else is happening in the system around it. A traditional rule-based compliance check is deterministic: feed it the same transaction under the same rules today, next quarter, or three years from now, and it will return the same judgment every time, because the output is derived from the rule, not from a probabilistic approximation of what rule-application tends to look like. The rule doesn't drift. It doesn't have good days and bad days. It doesn't produce a slightly different interpretation of "exceeds threshold" depending on what other inputs it processed earlier in the session. It applies the criterion and returns the result — the same result, every time, until the rules themselves are deliberately changed. That is what businesses are expecting when they ask an LLM to evaluate content against a rubric or classify an input against defined criteria. It is not automatically what they are getting — and whether they are getting it depends entirely on architectural decisions made before the build began.

The field has been circling this problem under several names, none of which quite land on the full concept. Non-deterministic outputs describes the symptom accurately but says nothing about the mechanism or when variability constitutes a governance failure versus an acceptable range. Stochastic judgment is technically precise — LLMs are probabilistic generative models, and their outputs are samples from a learned distribution over tokens — but the term belongs to the research literature rather than the enterprise conversations where the damage actually occurs. The closest formulation in the academic literature is epistemia, coined in a 2025 PNAS paper titled "The Simulation of Judgment in LLMs," defined as the illusion of knowledge emerging when plausibility replaces verification. That framing captures something genuinely important, but it is centered on the knowledge dimension — what the model appears to know — rather than the capability dimension — what the model appears to be able to do. Proximal Reasoning is the term I have settled on because it names the mechanism precisely: the output is proximal to correct deterministic judgment, close enough to be mistaken for it, without being derived from it.

Where Proximal Reasoning shows up — and what controls it

Consider two scenarios that enterprise organizations are deploying LLM-based capabilities to handle right now.

In content compliance review, a pharmaceutical or financial services organization needs to evaluate marketing materials against a regulatory rubric — a defined set of criteria that determines whether a piece of content is compliant, requires revision, or must be rejected. The criteria are explicit. The required output is a governed judgment that can be audited, defended to a regulator, and reproduced consistently across reviewers and over time. When this task is routed through an LLM, the model will produce an output that looks exactly like that governed judgment — a structured evaluation, criterion by criterion, with conclusions that read as authoritative. Run the same content through the same model tomorrow, or next week, and the evaluation will be different in ways that cannot be predicted or explained. Not dramatically different — proximal to the first result, in the same neighborhood — but different enough that the two evaluations cannot both be correct, and neither can be defended as the definitive governed judgment the regulatory context requires.

In candidate evaluation, an organization uses an LLM to assess applicant materials against a defined hiring rubric — scoring candidates on explicitly stated criteria to produce a ranked or tiered output. The appeal is obvious: consistency across a high volume of applications, free from the variability of individual human reviewers. The reality is the inverse. The LLM will score the same candidate differently across evaluations in ways that are invisible unless you specifically test for them — and most organizations don't. The output has the appearance of systematic, criteria-based assessment. It does not have the property that makes systematic assessment defensible: a governed, auditable, and substantially consistent output that holds to defined criteria until those criteria are deliberately changed.

In both cases the organization may not be getting what it believes it is getting — but whether it is or isn't comes down entirely to one question: was sufficient Harness Engineering built into the capability's architecture before it went into production? If the build included a properly designed Constraint Layer — output schemas that bound the model's evaluative scope, determinism requirements specified for each judgment type, and classification decisions banked as governed retrievable facts rather than left to model inference — the Proximal Reasoning risk is controlled. The capability can deliver the consistent, auditable, reproducible output the use case requires.

If it wasn't built in, the organization is getting Proximal Reasoning — outputs that occupy the neighborhood of deterministic judgment, presented with enough structural confidence that the gap between appearance and reality goes undetected until it surfaces as an audit finding, a legal challenge, or a pattern of inconsistency that no one can explain.

This is observable before the damage occurs. In an internally built capability, the presence or absence of Harness Engineering is visible in the build plan and the architectural specification — if you know what to look for. In a licensed tool, it is a specific set of questions a vendor evaluation should answer before a procurement decision is made. The build either incorporated these controls or it didn't. There is no middle ground, and there is no retrofitting them after deployment without significant redesign. For organizations that want to evaluate either a build or a licensed product against a structured methodology, we have documented the full architectural approach in Controlling for Proximal Reasoning in LLM Systems: A Framework for Harness Engineering.

Why the Proximal Reasoning problem is important across the Enterprise

Here is the comparison that I think changes how seriously organizations should take Proximal Reasoning relative to hallucination.

Hallucination is triggered by a knowledge gap. It is a conditional failure mode — present when the model's training data is inadequate to the task, absent or reduced when it isn't. It is a real problem, it causes real damage, and it deserves the attention it receives. But it does not occur on every interaction, in every workflow, at every LLM touchpoint.

Proximal Reasoning occurs every time an LLM is asked to perform a deterministic task. Not sometimes. Not under specific conditions. Every time — because the architectural gap between probabilistic generation and deterministic rule application is permanent and universal. Every workflow that routes a deterministic judgment through an LLM is experiencing Proximal Reasoning on every execution, whether anyone knows to look for it or not.

In a functional AI application — whether built internally or licensed from a vendor — there are typically multiple points where the LLM touches the workflow to produce an output. Each of those touchpoints is a Proximal Reasoning exposure. And the exposures compound: a workflow with five LLM touchpoints, each performing some form of structured judgment, is accumulating five independent sources of output variability in sequence. The final output of that workflow is the product of five probabilistic approximations of deterministic reasoning, and its reliability degrades with each additional touchpoint.

This applies equally to capabilities you build and capabilities you license. If you licensed the capability, the vendor made the architectural decisions on your behalf — including the decision of how many deterministic tasks to route through the LLM and how much scaffolding to build around them. You may be managing a Proximal Reasoning problem that was designed into the product before you ever saw it.

Proximal Reasoning is more Controllable than Hallucination

This is the point that I think is most underappreciated, and most useful, about the relationship between these two failure modes.

Hallucination is the failure mode that has received nearly all of the attention — and it is the harder one to control. RAG helps, but as established above, external knowledge sources carry their own gaps and contradictions, and context management within RAG data stores is never perfect enough to eliminate hallucination entirely. Better models help at the margins. But the fundamental cause — the model's inability to recognize the boundaries of its own knowledge — is not something that external architecture fully resolves. You are managing the risk, not eliminating it.

Proximal Reasoning, despite receiving almost no attention under any name, is more controllable — provided you know to address it during the build. The reason is that the response to Proximal Reasoning does not require changing what the model knows or how it generates tokens. It requires changing what the model is asked to do. The architecture surrounding the LLM — not the model itself — is where Proximal Reasoning is governed, and that architecture is entirely within the control of the team building or procuring the capability.

The Response: Harness Engineering, Constraint Layer, and Output Quality Standards

The engineering community has begun using the term Harness Engineering to describe the discipline of building external architectural scaffolding around LLMs to govern their behavior in production systems. The concept is sound and the term is apt. DBMT's Constraint Layer is our implementation of Harness Engineering specifically designed for enterprise AI workflow governance — the set of architectural controls built around every LLM touchpoint in a capability to reduce the scope of what the model is asked to judge, replace generative outputs with evaluative confirmation wherever a pre-built artifact can be assessed rather than constructed, and bank classifiable decisions as governed retrievable facts rather than leaving them to model inference at runtime.

Harness Engineering through a Constraint Layer addresses Proximal Reasoning at the source — by removing deterministic judgment tasks from the model's operating envelope wherever possible, and constraining the ones that remain to the narrowest possible scope.

But architecture alone is not sufficient without standards that define what the architecture is required to achieve. Output Quality Standards specify, for every LLM touchpoint in a capability, the determinism requirements that apply to that output type — what level of consistency is required, how variance is measured, what constitutes a governed versus an ungoverned output. Without these standards, Harness Engineering has no target to engineer toward, and the Constraint Layer has no specification to be tested against.

And standards without actualization are shelfware. The most important and most commonly missing element in enterprise AI governance is not the document that defines the standards — it is the ongoing human function that carries those standards into every build decision and every procurement evaluation, translating architectural requirements into specific engineering choices at the moment those choices are being made, and keeping the standards current as capabilities evolve and the business they serve changes. A standard that exists only as a ratified document will not survive contact with the first sprint that runs behind schedule. The stewardship function is what makes the difference between governance that holds and governance that erodes.

The three work together: Harness Engineering is the discipline, Output Quality Standards define what it must achieve, and an active stewardship function ensures neither becomes a document rather than a practice.

What this means if you are building or buying AI capabilities now

Every AI capability your organization is currently building is making decisions — right now, in the current sprint — about how many deterministic tasks to route through LLM touchpoints, and how much Harness Engineering to build around them. Those decisions determine whether the capability's outputs can be governed, whether its judgments can be reproduced and audited, and whether the humans depending on it can trust its consistency at scale. Proximal Reasoning is not a risk that emerges after deployment. It is designed in or designed out before the first line of code is written.

For organizations licensing AI capabilities from third-party vendors, the same architectural reality applies — with one critical difference. The vendor made the build decisions on your behalf, and those decisions are now embedded in a product you did not design and cannot directly modify. This does not mean the risk cannot be evaluated. It means the evaluation has to happen before the procurement decision, not after it.

A rigorous vendor evaluation for Proximal Reasoning risk goes well beyond asking whether the product performs well on a demo or scores well on a benchmark. It requires a structured assessment of the Harness Engineering actually built into the product — layer by layer, touchpoint by touchpoint. How has the vendor constrained the scope of what the LLM is asked to judge at each point in the workflow? What output schemas are in place? What determinism requirements were specified, and for which output types? What classifiable decisions have been removed from model inference and banked as governed retrievable facts? What evidence exists that these controls were tested against consistency requirements rather than functional requirements alone? These are not hostile questions. They are the questions any organization should be able to answer about a capability it built. A vendor that cannot answer them specifically and with supporting architectural documentation has not built sufficient Harness Engineering into the product — and that is a known, assessable procurement risk.

We have developed a structured evaluation framework for assessing the depth of Harness Engineering in third-party AI products, organized by layer and designed to produce a clear picture of where Proximal Reasoning risk is controlled and where it remains unaddressed. Controlling for Proximal Reasoning in LLM Systems: A Framework for Harness Engineering is available for download for organizations that are currently in a vendor evaluation or approaching one.

Hallucination is a real problem worth managing. But Proximal Reasoning is the accuracy failure mode that is present in every workflow, at every touchpoint, on every execution — and it is the one that most organizations are not currently evaluating for, either in their own builds or in the tools they are procuring, because they don't yet have a name for it.

Naming it is the first step toward addressing it deliberately, rather than discovering its consequences in production and managing them after the fact.

Book a Meeting

DBMT Articles for our Clients

DBMT Articles for our Clients

How You Know AI is Working

Beyond Hallucination: A Bigger LLM Accuracy Problem Is Impacting You Right Now

Proximal Reasoning is the danger coming from inside the house

By David Bernard, Managing Director, DBMT

​