Openai’s new research paper asks why large-scale language models like GPT-5 and chatbots like ChatGpt are still hallucinating and whether there is anything you can do to reduce those hallucinations.
In a blog post summarizing the paper, Openai defines hallucinations as “plausible but false statements generated by language models,” acknowledging that despite improvements, hallucinations “continued to be a fundamental challenge for all major language models.”
To explain the point, researchers say they got three different answers when asked about the title of Adam Tauman Kalai’s doctoral dissertation: “Widely Used Chatbots.” (Karai is one of the authors of the paper.) They then asked about his birthday and received three different dates. Again, they were all wrong.
Why are chatbots so wrong? Researchers suggest that hallucinations occur due to pre-training processes focused on correctly predicting the model without attaching true or false labels attached to the training statement.
“The spelling and parentheses follow a consistent pattern, so the error disappears on scale,” they write. “However, like a pet’s birthday, any low-frequency fact cannot be predicted from the pattern alone, and thus leads to hallucinations.”
However, the proposed solution does not focus on the initial prerequisite process, which is why a large model of language models has been evaluated. Current evaluation models do not cause hallucinations per se, but they argue that they “set the wrong incentives.”
Researchers should compare these ratings with a large number of choice tests where random guesses make sense.
TechCrunch Events
San Francisco
|
October 27th-29th, 2025
“In the same way, if the model is rated only with accuracy, the exact percentage of questions is encouraged to guess rather than say “I don’t know,”” they say.
The proposed solution is similar to a test that includes partial credits (such as SAT) to leave a blank to discourage the negative (scoring) of incorrect answers or blind guessing. Similarly, Openai states that valuing the model should “punish a confident error rather than punish uncertainty and give partial credits for the appropriate expression of uncertainty.”
And the researchers argue that “it’s not enough to introduce some new uncertainty-conscious tests on the side. Instead, “A widespread accuracy-based avoidance should be updated so that scoring prevents guessing.”
“When the main scoreboard continues to reward fortune guesses, the model continues to learn speculation,” the researcher says.