LLM hallucinations are not going away. Frontier model error rates have improved enormously since 2023, but production deployments still need active hallucination detection because the cost of a confidently wrong answer in a customer-facing context is high. Here is what is actually working in 2026, ranked by where each technique earns its spot in the stack.
Why "just check it" fails
The naive approach — "ask the model if its previous answer is correct" — sounds appealing and rarely helps. The model that produced the wrong answer has the same blind spots that produced it. Self-evaluation by the same model produces high false confidence. Useful hallucination detection requires either external grounding or asymmetric sampling.
The four practical techniques
1. Grounding verification (highest leverage in RAG systems)
When the model is supposed to answer based on retrieved documents, you can check whether each claim in the answer is supported by the retrieved context. Approaches:
Span-level entailment. For each sentence in the answer, find the supporting span in the context. Score with an NLI model or an LLM-as-judge prompt. Flag unsupported claims.
Citation enforcement. Require the model to cite source IDs alongside every claim. Validate that the cited source actually contains the claim.
Attribution scoring. Run an LLM-as-judge to grade attribution quality, with calibration against human-graded examples.
This works because the ground truth (the retrieved context) is right there. Per recent surveys, grounding-verification approaches consistently outperform self-evaluation in production RAG systems.
2. Self-consistency across samples
For open-ended factual queries where the correct answer should be stable, sample the model multiple times at non-zero temperature and check whether the answers agree. Strong agreement is correlated with truthfulness; high variance flags potential hallucination.
This is computationally expensive (3-5x the inference cost) but is one of the few techniques that helps when you don't have ground truth. Best for high-stakes queries where latency tolerance is moderate. Not appropriate for streaming UIs.
3. Classifier-based detection
A separate model — sometimes a small fine-tuned classifier, sometimes an LLM-as-judge with a structured rubric — scores the output for hallucination markers:
Numeric or named-entity claims without grounding
Hedging language inconsistent with confident assertions
Self-contradiction within the answer
Output that exceeds what the input could support
Classifiers work as a triage layer — flag suspicious outputs for further checks rather than rejecting all of them.
4. External fact-checking
For specific high-leverage claim types — dates, prices, names, numerical statistics — run a separate retrieval against an authoritative source and check the claim. Most useful for narrow domains: medical drug doses, legal citations, financial figures. Building it for "general fact" is hard; building it for a specific domain is tractable.
What does NOT reliably work
A few much-discussed techniques that disappoint in production:
Asking the same model "are you sure?" The model's confidence is poorly calibrated; this rarely surfaces real hallucinations.
Token-level log-prob thresholding alone. Hallucinations don't reliably correlate with low log-probs in modern instruction-tuned models. Useful as a feature in a classifier, not as a standalone signal.
Generic "fact-check this with search." Search is itself unreliable; without source ranking and grounding verification, you are layering hallucinations.
Single-pass LLM-as-judge with a vague rubric. Rubric quality dominates results; without calibration against human grades, the judge reflects the same biases as the model under test.
Stacking techniques
Production hallucination detection in 2026 is rarely one technique — it is a pipeline:
Confidence triage. Cheap classifier or rule-based check flags suspicious outputs.
Grounding verification. For RAG outputs, span-level entailment against retrieved context.
Self-consistency for the hard cases. Multi-sample for outputs that pass triage but the stakes are high.
External fact-check for known claim types. Drug names, dollar amounts, legal cites, etc.
Human review for the residual. Whatever survives the pipeline goes to a human if the stakes warrant it.
Each layer adds latency and cost. The art is choosing where on the curve to sit:
Streaming chat UIs: triage + grounding only; no self-consistency in-line. Run heavier checks asynchronously and surface warnings post-hoc.
Batch document processing: layer in self-consistency and external fact-check; latency is not a constraint.
High-stakes single-answer (medical, legal, financial): full pipeline + human review.
Low-stakes assistive (search summaries, suggestions): triage only; rely on the user to notice.
The choice should follow the harm model. Don't over-engineer for low-stakes outputs; don't under-engineer for high-stakes ones.
Measuring detection quality
Detection systems themselves need evals. Build a graded test set of known-hallucinated outputs and known-correct outputs. Measure:
True-positive rate (caught hallucinations)
False-positive rate (flagged correct answers as hallucinations)
Precision-recall curves at different thresholds
Slice analysis by claim type, domain, and output length
A detection system with 95% recall and 60% precision is often better than one with 80% recall and 90% precision, depending on the cost of a false-positive vs. a missed hallucination.
What is changing in 2026
A few research directions reshaping the toolkit:
Reasoning-trace inspection. New reasoning models expose their internal trace; checking the trace against the answer surfaces inconsistencies that a single forward pass misses.
Tool-use validation. When the model uses a tool (search, calculator, API), verifying the tool output against the answer is more tractable than free-text verification.
Cross-model voting. Sampling answers from two different model families and checking agreement is a stronger signal than self-consistency within one family.
Calibrated confidence training. Some 2026 frontier models include explicit calibration objectives, making the raw confidence signal more useful than it was.
A starter stack
For a team starting today on a RAG-based product:
Add citation requirements to your output schema; refuse outputs without citations.
Build a span-level entailment check between cited claims and source spans.
Add a triage classifier that flags suspicious outputs for additional review.
Run a periodic self-consistency probe against a sampled subset to detect drift.
Build the eval set; measure your detection precision/recall before you commit to thresholds.
This stack is achievable in a few weeks and catches the majority of practical hallucinations. The remaining tail is a long fight; the techniques above are how you stay in it.
Frequently Asked Questions
Can I rely on the LLM to detect its own hallucinations?
No — self-evaluation by the same model has poor calibration and rarely catches the errors it produced. Use external grounding (citation verification, retrieved-context entailment) or asymmetric sampling (a different model, multi-sample voting).
How much does hallucination detection slow down inference?
Triage classifiers add roughly 50-200ms; grounding verification adds 200-500ms for typical RAG outputs; self-consistency multiplies inference cost 3-5x. Stack only what your latency budget supports; run heavier checks asynchronously where possible.
Do reasoning models hallucinate less?
Reasoning models often have lower hallucination rates on tasks within their training distribution and surface their reasoning trace, which makes inconsistencies easier to spot. They are not immune — out-of-distribution prompts and high-confidence wrong premises still produce hallucinated outputs.