Epistemic Engine
Right now we are running four language models inside a single controlled loop. Two models generate arguments. One model audits structure and epistemic integrity. One model narrates the evolving arc of the exchange. Only the human operator sees the entire system at once. None of the models have full visibility into the whole.
That asymmetry is intentional.
We are not building a chatbot. We are building a multi model reasoning engine. It is a layered orchestration architecture that separates generation, evaluation, interpretation, and governance into distinct computational roles. The separation is not cosmetic. It is structural. Without it, synthetic reasoning collapses into a single stream that generates, validates, and concludes in one motion. That design produces coherence quickly. It also produces drift.
Most AI systems compress cognition into one channel. A single model speaks, critiques itself, and resolves tension internally. It may simulate disagreement, but structurally it remains a monologue. What we are building instead is closer to a constrained ecosystem. DeepSeek V3 and Mistral operate adversarially in a generative layer. A separate DeepSeek R1 instance functions as an auditor. A Llama instance acts as narrator. The human layer remains outside the loop, with the ability to intervene but not to automate steering.
The debaters do not see the auditor. The auditor does not see the narrator. The narrator sees only a sliding window of recent turns. The human sees everything. This is not redundancy. It is compartmentalization. We are isolating cognitive functions the way complex systems isolate subsystems. Generation is one function. Monitoring is another. Interpretation is another. Governance is separate still. When these collapse into a single stream, synthetic reasoning becomes performative. It converges too quickly. It stacks citations. It becomes fluent without being grounded.
In earlier experiments we saw this clearly. Without instrumentation, models engage in what I would call citation escalation. A race toward increasingly specific percentages, journal names, effect sizes, and historical claims that cannot be verified inside the loop. The dialogue looks rigorous. It feels empirical. But the structure becomes brittle because no component is responsible for interrogating epistemic load.
So we built that component.
The first iteration of the auditor measured shape. It tracked convergence scores, move classifications, agreement loops, deference patterns, and stagnation. It could tell us whether a debate had tension. It could not tell us whether the tension meant anything.
We extended it.
The auditor now produces an epistemic layer alongside structural metrics. It assigns an epistemic risk score. It evaluates claim density. It assesses whether uncertainty signals are present, absent, or merely performative. It flags fabrication risk when models cite specific statistics or studies that cannot be verified within the dialogue context. Most importantly, it extracts and ranks load bearing claims.
Not every questionable sentence is surfaced. Only the claims that the argument depends on. Each flagged claim receives a weight based on two dimensions: structural load and severity of epistemic concern. A claim that underwrites the entire trajectory of a debate carries more weight than a peripheral illustration. Across audit cycles, these claims are merged into session anchors so that early premises do not disappear simply because the conversation has moved on.
This changes the character of the system. The engine does not assume truth. It measures epistemic stress. It does not eliminate drift. It surfaces it. We optimize for adversarial friction rather than for fluency. In debate mode, convergence is undesirable. Agreement loops trigger scrutiny. High claim density without adequate uncertainty raises risk. Epistemic anchors persist across the entire session so that foundational premises must survive repeated evaluation. Structural claims cannot quietly evaporate under the pressure of new rhetoric.
This architecture is not tied to debate as a genre. The pattern is modular. There is a generation layer, an evaluation layer, an interpretation layer, and a governance layer. The models occupying those roles can change. The intent mode can change. The domain can change. The orchestration pattern remains intact. This is one reason we are building with multiple models, including open source systems. The goal is not dependency on a single proprietary endpoint. The goal is architectural literacy. We are learning how to coordinate heterogeneous models across clearly defined roles. The models are interchangeable. The orchestration is the asset.
There is a parallel effort unfolding inside Discriminology Plus. There, we are formalizing a measurement schema layer. Constructs are encoded. Item to construct mappings are defined. Scoring logic and transformation rules are specified. The intelligence layer operates only on structured, validated outputs. Interpretation is bounded by schema.
But interpretation without governance drifts. If models are allowed to reason over structured measurement outputs, simulate trade offs, or generate scenario analyses, they must operate within epistemically constrained environments. The schema constrains meaning. The reasoning engine constrains epistemic behavior. Together they form a measurement intelligence system that is less fragile than a free form AI overlay.
There is also a longer horizon.
When we deploy local models on private clusters, we lose the implicit guardrails of centralized API providers. If we build agentic workflows or autonomous systems that plan and execute actions, we amplify the risk of unobserved drift. Before allowing local agents to operate at scale, we need to understand how to observe their reasoning, instrument their claims, and intervene without collapsing autonomy.
The work we are doing here is rehearsal for that future. Separate roles. Measure drift. Preserve anchors. Maintain human override. At the moment, the engine runs four coordinated models. It performs periodic structural and epistemic audits. It extracts and ranks load bearing claims. It persists anchors across session windows. It exposes reasoning traces to the human layer. It maintains strict role isolation. It does not browse the web. It does not externally fact check. It does not autonomously intervene. It does not claim to produce truth.
It instruments reasoning.
In an environment where synthetic information scales without cost, truth becomes structurally expensive. This engine is an attempt to reintroduce cost into the reasoning process. Not moral cost, but structural cost. Claims must carry weight. Assertions must survive scrutiny. Coherence must withstand friction.
This is not the final system. It is a foundation. The larger architecture it will eventually inhabit is still evolving. But the pattern is becoming clear. Separate roles. Measure epistemic load. Preserve tension. Scale cautiously.