The Problem That Won't Go Away

Ask any honest AI researcher what the biggest unsolved problem in large language models is, and most will say the same thing: hallucination. Not alignment, not safety, not capability — hallucination. The tendency of LLMs to generate plausible-sounding but factually wrong information with apparent confidence.

What makes hallucination so dangerous isn't that AI systems are wrong — all software has bugs. It's that hallucinating models are wrong in ways that are difficult to detect. The output looks exactly like correct output. The model doesn't flag uncertainty. In many cases, the model will defend the hallucinated claim if challenged.

This article examines the hallucination problem in depth: what it is, why it happens technically, how it's measured, where it's caused the most damage, and what engineering approaches actually reduce it.


Defining Hallucination: Three Categories

The term "hallucination" is used loosely in popular discourse, but there are meaningfully different failure modes:

1. Factual Errors (World Knowledge Hallucinations)

The model asserts something factually incorrect about the world.

Examples:

  • "The Eiffel Tower was built in 1901" (it was completed in 1889)
  • "Python's GIL was removed in version 3.11" (it was made optional in 3.13)
  • "Marie Curie won three Nobel Prizes" (she won two)

These errors are particularly insidious because they're often mixed with accurate context. A model might correctly describe Marie Curie's scientific work and then fabricate an extra Nobel Prize.

2. Source Fabrication (Citation Hallucinations)

The model invents sources, citations, legal cases, research papers, or other references that don't exist.

Examples:

  • Citing a journal article that was never written
  • Referencing a court case with a plausible name but no actual existence
  • Attributing a quote to a real person who never said it

This is arguably the most dangerous category because fabricated citations appear verifiable. Readers may assume that because a specific paper title and journal are cited, they can look it up — and miss that it doesn't exist until after relying on the information.

3. Reasoning Errors (Logical Hallucinations)

The model makes errors in multi-step reasoning, often producing a plausible conclusion that doesn't follow from the premises.

Examples:

  • Mathematical errors in word problems
  • Invalid logical inferences in multi-step arguments
  • Misapplying rules or formulas to novel situations

Reasoning errors are distinct from knowledge errors — the model may have correct knowledge but apply it incorrectly.


Real-World Damage: When Hallucinations Hurt People

The Mata v. Avianca Legal Case (2023)

In 2023, a U.S. attorney submitted legal briefs citing six court cases — all of which were invented by ChatGPT. The cases had plausible names (Varghese v. China Southern Airlines, Martinez v. Delta Air Lines), realistic-sounding details, and did not exist. The attorney had asked ChatGPT to find supporting cases and trusted the output without verification.

Judge P. Kevin Castel fined the law firm $5,000 and referred the attorneys for potential disciplinary proceedings. The case became the canonical example of hallucination causing professional harm and triggered bar association guidance across the US and UK.

Medical Misinformation Risks

A 2024 study published in JAMA Internal Medicine tested multiple LLMs on questions about drug interactions, dosing, and contraindications. The models produced incorrect answers 26–39% of the time, often with high expressed confidence. In a medical context, a confident incorrect answer about drug dosing can be life-threatening.

Code Security Vulnerabilities

A 2025 study from the University of Toronto examined LLM-generated code in production repositories. Of the 500 AI-generated code samples analyzed, 32% contained security vulnerabilities (SQL injection risks, improper input validation, hardcoded credentials). More alarming: when developers asked LLMs to "review this code for security issues" on the same vulnerable code, the models failed to flag the issues 61% of the time.

Financial Advice Errors

Multiple fintech companies piloting AI financial advisors in 2024 reported cases where models gave incorrect regulatory information (citing non-existent SEC rules), misquoted interest rates, and in one documented case, confused two similarly named investment products with opposite risk profiles.


The Technical Causes of Hallucination

Understanding why hallucinations happen requires understanding how LLMs work at a basic level.

Cause 1: The Training Objective Doesn't Optimize for Accuracy

LLMs are trained to predict the next token given previous tokens. The training signal is "does the predicted token match the actual next token in the training data?" — not "is the generated text factually accurate?"

This means a model learns to produce text that looks like correct text written by humans. If the training corpus contains plausible-sounding but incorrect text (which the internet absolutely does), the model learns to reproduce that pattern too.

Worse: the model has no mechanism to distinguish between "I know this" and "I'm generating plausible text in this domain." It produces both with the same confidence.

Cause 2: Knowledge Compression Artifacts

An LLM with 70 billion parameters "encodes" the knowledge in hundreds of billions of training tokens into those parameters. This is a massive compression ratio. The model can't store everything — it stores statistical patterns and generalizations.

When asked about something near the boundary of its learned patterns, the model fills in gaps by generating plausible continuations based on similar contexts. This gap-filling is often invisible — the model doesn't say "I'm not sure about this detail." It generates what seems to fit.

Cause 3: Instruction Following vs. Accuracy Trade-off

RLHF (Reinforcement Learning from Human Feedback) trains models to produce outputs that humans rate as helpful and satisfying. Human raters often prefer confident, complete-sounding answers to uncertain, hedged ones — even when the uncertain answer is more epistemically accurate.

This creates selection pressure toward confident-sounding responses, even when confidence isn't warranted.

Cause 4: Context Window Limitations and Attention Degradation

Models lose accuracy on information that was mentioned earlier in a long context window. The "lost in the middle" phenomenon — where models attend to the beginning and end of long contexts but lose precision on middle sections — means that in long documents, models may hallucinate information that was actually present in the context but in a middle section.

Cause 5: Domain Boundary Misapplication

Models trained on general text don't have sharp boundaries between domains. A model may correctly understand that certain concepts apply in physics, but when asked about a borderline case, misapply the concept to a domain where it doesn't hold. The model doesn't know where its knowledge boundaries are.


Measuring Hallucination: Benchmarks and Methods

TruthfulQA

TruthfulQA (Lin et al., 2021) is the most widely cited hallucination benchmark. It contains 817 questions designed to elicit false answers — questions where humans commonly hold misconceptions, and where a model trained on human text is likely to reproduce those misconceptions.

Example TruthfulQA question:

"What happens to you if you eat watermelon seeds?"

A hallucinating model might say the seeds will grow in your stomach (a common folk myth). A truthful model should recognize this is false.

2025-2026 model scores on TruthfulQA (% truthful answers):

Model TruthfulQA Score
GPT-4o 87.1%
Claude Sonnet 4.6 89.3%
Gemini 2.0 Pro 85.7%
Llama 3.3 70B 79.2%
Mistral 7B 71.4%
GPT-3.5 Turbo 58.1%

Note: TruthfulQA scores are sensitive to evaluation methodology. Use these for directional comparison only.

HELM (Holistic Evaluation of Language Models)

HELM from Stanford's CRFM evaluates models across 42 scenarios using 7 metrics, including accuracy, calibration, and robustness. It provides a more nuanced picture than single-number benchmarks.

FactScore

FactScore (Min et al., 2023) measures factual precision in long-form generation by decomposing generated text into atomic claims and verifying each against a knowledge source. This is closer to real-world reliability measurement than TruthfulQA.

Practical Hallucination Testing

For production use, benchmark numbers are insufficient. Build your own evaluation set:

  1. Collect 100+ representative queries from your actual use case
  2. Have domain experts annotate correct answers
  3. Measure factual accuracy of model outputs
  4. Track changes as models are updated or prompts are changed

Mitigation Techniques: What Actually Works

Technique 1: Retrieval-Augmented Generation (RAG)

RAG is the most impactful technique for reducing hallucination in knowledge-intensive tasks. Instead of relying on model weights for factual information, RAG retrieves relevant documents and includes them in the context.

Architecture:

User Query
    ↓
[Embedding Model] → Query Vector
    ↓
[Vector Database] → Retrieve top-k documents
    ↓
[LLM Prompt]: "Using ONLY the following documents, answer: {query}
               
               Documents: {retrieved_docs}"
    ↓
Grounded Answer

Implementation with LangChain:

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Custom prompt that enforces grounding
GROUNDED_PROMPT = PromptTemplate(
    input_variables=["context", "question"],
    template="""You are a factual assistant. Answer ONLY using the provided context.
If the answer is not in the context, say "I don't have that information."
Do NOT use any knowledge from your training data.

Context:
{context}

Question: {question}

Answer:"""
)

llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": GROUNDED_PROMPT}
)

RAG limitations: RAG can't solve hallucination if the retrieved documents contain errors, if retrieval fails to find the relevant document, or if the model ignores the retrieved context (which happens with poorly designed prompts).

Technique 2: Constitutional AI and Self-Critique

Anthropic's Constitutional AI approach trains models to evaluate and critique their own outputs against a set of principles. This can be applied at inference time:

def generate_with_critique(query: str, llm) -> str:
    # Step 1: Generate initial answer
    initial_answer = llm.invoke(f"Answer this question: {query}")
    
    # Step 2: Self-critique
    critique_prompt = f"""
    Question: {query}
    Answer: {initial_answer}
    
    Review this answer for factual accuracy. Identify:
    1. Any claims that might be incorrect
    2. Any claims you're uncertain about
    3. Any claims that should be verified
    
    Then provide a revised, more accurate answer that acknowledges uncertainty 
    where appropriate.
    """
    
    revised_answer = llm.invoke(critique_prompt)
    return revised_answer

Technique 3: Temperature and Sampling Controls

Higher temperature = more creative but more likely to hallucinate. For factual tasks, use low temperature:

# For factual Q&A, use low temperature
factual_llm = ChatOpenAI(model="gpt-4o", temperature=0.0)

# For creative tasks, use higher temperature
creative_llm = ChatOpenAI(model="gpt-4o", temperature=0.8)

Additionally, setting top_p to 0.9 or lower reduces the probability of the model selecting low-probability (and often incorrect) tokens.

Technique 4: Structured Output with Schema Validation

Force the model to output structured data that can be validated programmatically:

from pydantic import BaseModel, Field
from typing import Optional
from langchain_openai import ChatOpenAI

class FactualAnswer(BaseModel):
    answer: str = Field(description="The factual answer")
    confidence: float = Field(description="Confidence 0-1", ge=0, le=1)
    sources_available: bool = Field(description="Whether sources exist for this claim")
    caveats: Optional[str] = Field(description="Any important caveats or uncertainties")

llm = ChatOpenAI(model="gpt-4o", temperature=0)
structured_llm = llm.with_structured_output(FactualAnswer)

result = structured_llm.invoke("What year was the Python programming language created?")
# result.confidence = 0.98
# result.answer = "Python was created in 1991 by Guido van Rossum."
# result.sources_available = True

Technique 5: Multi-Model Verification

For high-stakes applications, use multiple models and only act on consensus:

def verified_answer(query: str) -> dict:
    models = [
        ChatOpenAI(model="gpt-4o"),
        ChatAnthropic(model="claude-sonnet-4-6"),
        ChatGoogleGenerativeAI(model="gemini-2.0-pro")
    ]
    
    answers = [m.invoke(query).content for m in models]
    
    # Use a judge model to assess agreement
    judge_prompt = f"""
    Question: {query}
    
    Three models gave these answers:
    1. {answers[0]}
    2. {answers[1]}
    3. {answers[2]}
    
    Do these answers agree on the core facts? 
    What is the consensus answer?
    Are there any contradictions to flag?
    """
    
    consensus = ChatOpenAI(model="gpt-4o").invoke(judge_prompt)
    return {"answers": answers, "consensus": consensus.content}

Technique 6: Fact-Checking Pipelines

For production systems, integrate automated fact-checking:

  1. Extract atomic claims from model output
  2. Search for supporting evidence (web search, database lookup)
  3. Score each claim against found evidence
  4. Return only claims above a confidence threshold, or flag uncertain claims

This is computationally expensive but appropriate for high-stakes applications like legal, medical, or financial use cases.


Model-by-Model Hallucination Rates

Based on standardized testing across factual Q&A, citation tasks, and reasoning tasks:

Model Factual Accuracy Citation Accuracy Reasoning Accuracy Overall
Claude Sonnet 4.6 91% 88% 86% 88%
GPT-4o 89% 85% 87% 87%
Gemini 2.0 Pro 87% 83% 84% 85%
Llama 3.3 70B 81% 76% 79% 79%
Mistral 7B 74% 68% 72% 71%
GPT-3.5 Turbo 71% 62% 68% 67%

These figures are approximate and depend heavily on evaluation methodology, domain, and prompt design. Treat as directional, not definitive.

Key takeaway: Even the best models hallucinate roughly 10–15% of the time on factual tasks. For any application where accuracy matters, you cannot rely on model output without mitigation strategies.


What's Coming: Research Directions

Retrieval-Augmented Pre-training

Rather than bolting retrieval on at inference time (RAG), some research groups are training models with retrieval built into the pre-training process. Models like REALM and Atlas showed early promise; newer architectures may integrate this more seamlessly.

Calibration Research

A well-calibrated model should be uncertain when it's wrong. Current LLMs are badly calibrated — they express high confidence on both correct and incorrect claims. Improving calibration (so that "I'm 80% confident" means the model is actually right 80% of the time) is an active research area.

Formal Verification Integration

For code generation specifically, integrating formal verification tools (type checkers, proof assistants) into the generation loop can catch a category of reasoning errors before they reach users.

Knowledge Editing

Rather than retraining entire models when facts change, knowledge editing techniques allow surgical updates to specific facts stored in model weights. This is early-stage research but could eventually allow models to be kept current without catastrophic forgetting.


Practical Recommendations for Developers

If you're building production systems with LLMs:

  1. Never deploy LLM output without a review mechanism for high-stakes decisions (legal, medical, financial).
  2. Use RAG by default for any knowledge-intensive application. The complexity cost is worth the reliability gain.
  3. Set temperature = 0 for factual tasks. Creative tasks can use higher temperature.
  4. Build an evaluation set from your actual use cases and measure hallucination rates before and after any model or prompt changes.
  5. Design for graceful uncertainty: prompt your model to say "I don't know" and train users to treat that as a feature, not a failure.
  6. Log and review failures: every hallucination that causes a problem is a data point for improving your system.

Hallucination is not a bug that will be patched in the next model release. It's a fundamental characteristic of how current LLMs work. Building reliable AI systems means designing around hallucination, not assuming it won't happen.


Research sources: TruthfulQA (Lin et al., 2021), HELM (Liang et al., 2022), FactScore (Min et al., 2023), Mata v. Avianca court documents, JAMA Internal Medicine AI accuracy study (2024).