AI Hallucinations in Theology: Why Your Bible Chatbot Gets It Wrong 5-19% of the Time

In June 2023, attorney Steven Schwartz submitted a legal brief to a federal court in New York. It cited six prior cases as precedent. All six were fabricated. ChatGPT invented them. When Schwartz asked the chatbot if the cases were real, it said yes — they “can be found in reputable legal databases.” Judge P. Kevin Castel imposed a $5,000 fine and found “subjective bad faith.”

That was law. A domain with searchable databases, case numbers, and verifiable facts. Now imagine what happens when the domain has no database. No case numbers. No way to verify the output unless you already know the answer.

That domain is theology.

The Hallucination Problem Is Worse Than You Think

AI hallucination means the model generates information that sounds authoritative but is entirely fabricated. Not a glitch. Not a rare edge case. A structural feature of how large language models work.

The numbers across specialized domains are staggering:

Legal AI

69-88% hallucination rate on general legal queries (Stanford 2024). 596+ documented cases of fabricated citations since mid-2023.

Medical AI

Epic's Sepsis Model, deployed across thousands of hospitals, missed 67% of sepsis patients. Of 6,971 alerts it generated, only 12% were correct — 88% were false alarms (JAMA 2021).

Theological AI

Average faith score: 48/100 on Christian-specific prompts. On Gloo's FAI-C Benchmark, leading AI models averaged 61/100 — failing worst when prompts required Christian interpretation.

These are not generic chatbots. Epic spent $4 billion on Watson Health. Westlaw built AI specifically for legal research. And in August 2025 alone, three separate federal courts sanctioned attorneys for submitting hallucinated citations from Westlaw Precision — a tool designed to prevent exactly that.

If purpose-built AI in domains with verifiable facts fails at these rates, what is the actual error rate when generic ChatGPT handles the Trinity, the atonement, or the nature of God — topics where there is no database to check against?

The Confidence Trick

Here is what makes AI hallucination uniquely dangerous for theology.

MIT researchers found that AI models use 34% more confident language when hallucinating than when stating verified facts. Phrases like “definitely,” “certainly,” “without a doubt.” The model sounds most authoritative when it is most wrong.

34% more confident when wrong.

The model doesn't hesitate when it fabricates. It doubles down.

And it doesn't just invent whole facts. It invents partial facts — a real author with a wrong title, a real Bible verse with a wrong context, a real doctrine with a subtle distortion. The kind of error that looks entirely right unless you already know the material deeply enough to spot it.

In the Mata v. Avianca case, the fabricated citations included real court names, plausible case numbers, and realistic legal reasoning. Everything looked correct. Nothing was real. And the attorney trusted it because the output felt authoritative.

Now picture a pastor asking AI to explain the hypostatic union, or the relationship between faith and works in James 2, or whether 1 Timothy 2:12 is culturally conditioned. The AI will respond with fluent prose, plausible citations, and confident theological language. Some of it will be accurate. Some of it will be fabricated. And the pastor will have no way to tell the difference without doing the very research they asked AI to do for them.

Theology Is the Highest-Risk Domain

Hallucination rates vary by domain. The pattern is consistent: the more specialized the knowledge, the higher the error rate.

Legal hallucination rates run 6.4% for top models to 18.7% across all models. Medical runs 4.3% to 15.6%. Scientific runs 3.7% to 16.9%.

Theology has not been formally benchmarked. But the indicators are clear. Gloo's data shows AI models scoring 61/100 on a Christian worldview benchmark — with the worst performance on prompts requiring Christian interpretation. Models “often fail to connect scenarios to Christian values, or provide coherent theological reasoning around concepts like grace, sin or forgiveness.”

The structural reason is simple: LLMs generate statistically probable text. They trend toward the average of their training data. Theology is not average. Orthodox Christian doctrine is specific, nuanced, and often counterintuitive. The resurrection is not a statistically probable claim. The Trinity is not the median position of world religions. The exclusivity of Christ is not the safe, centrist output that alignment systems reward.

So when AI handles theology, it gravitates toward the flattened, noncommittal, spiritually generic middle. And it does it with 34% more confidence than when it states something true.

The Detection Problem

Knowledge workers already spend an average of 4.3 hours per week fact-checking AI outputs. In professional domains with verifiable databases, that's difficult enough.

In theology, what does fact-checking look like? It looks like doing the original research yourself. Checking the Greek. Reading the commentary. Tracing the cross-references. Comparing the systematic theology.

Which means theology is the one domain where verifying AI output requires the exact same work as not using AI at all. The “time saved” by AI evaporates the moment you take accuracy seriously.

And most pastors are not taking that step. Sixty-four percent of pastors use AI for sermon preparation. The vast majority are not running verification protocols. They are trusting the output because it sounds right — the same reason Steven Schwartz trusted six fabricated court cases.

What This Means for the Church

For every 100 theological claims an AI makes, somewhere between 5 and 19 are likely fabricated, distorted, or subtly wrong — based on hallucination rates in comparable specialized domains. And those claims will be delivered with more confidence than the accurate ones.

This is not an argument against using AI. It is an argument against using AI without verification. Against using AI as an author instead of a research assistant. Against trusting a tool that is structurally incapable of knowing whether what it says about God is true.

“We who teach will be judged more strictly.”
— James 3:1

If an AI is shaping what your congregation believes about God — and you haven't verified what it said — you are teaching what you have not studied. That is not a technology problem. That is a stewardship problem.

A Different Approach

OpenLumin was built on a single conviction: AI should retrieve evidence, not generate theology.

Every claim is sourced from 15+ scholarly databases — commentaries, original language tools, historical context, cross-references. Every citation is marked as “verified” (from evidence data) or “training-assisted” (flagged for review). You always know what you can trust and what you need to check.

Because the alternative — trusting a model that is 34% more confident when it's wrong — is not a foundation anyone should build a sermon on.

AI hallucination rates in specialized domains run 5–19%.
Theology has no verification database.
Your Bible study deserves sourced evidence, not confident guesses.

Start Researching — Free Support the Mission

Sources: Mata v. Avianca, 678 F.Supp.3d 443 (S.D.N.Y. 2023); Wong et al., JAMA Internal Medicine (2021); Stanford Legal AI Hallucination Study (2024); MIT AI Confidence Research (2025); Gloo FAI-C Benchmark (2025); Charlotin AI Legal Hallucination Database (2025); Exponential/AI NEXT Churches Study (2025). This article is part of the AI Fluency Ministry research series.

AI Hallucinations in Theology:Why Your Bible Chatbot Gets It Wrong5–19% of the Time.