Hallucinations: A Fly in the Ointment

AI, based in large language models, makes stuff up. It just does. This is generally called “hallucinations.” It’s a real problem, a serious problem. You need to understand hallucinations if you’re going to work with AI.

Cambridge Dictionary’s Word of the Year for 2023 was “Hallucinate,” whose definition has been expanded to include “When an artificial intelligence… hallucinates, it produces false information.” (Other additions to the 2023 dictionary include “prompt engineering,” “large language model,” and “GenAI.”)

AI hallucinations, Cambridge notes, “sometimes appear nonsensical. But they can also seem entirely plausible–even while being factually inaccurate or ultimately illogical.” This, sadly, is quite true, and as of July 2024 remains a dramatic limitation for using generative AI for mission-critical tasks. It’s one of the several great oddities of AI, and it takes people a while to get their heads around it. Remember, generative AI is mostly a next word prediction engine, not a database of facts. Hence the need for HITLs, Humans-In-The-Loop, as we’re now known, double-checking AI output. And again, it’s remarkable that we can get such extraordinary value from a technology that can produce provably inaccurate output. So it goes.

Gary Marcus, an experienced and well-informed AI-critic, compares AI hallucinations to broken watches, which are right twice a day. “It’s right some of the time,” he says, “but you don’t know which part of the time, and that greatly diminishes its value.”

Ethan Mollick, a keynote speaker at the Publishers Weekly September 2023 conference, notes that people using AI expect 100% accuracy. Hallucinations, he says, are similar to “human rates of error” which we tolerate daily.

Andrej Karpathy, a noted scientist specializing in AI, who currently works at OpenAI, writes about hallucinations:

“I always struggle a bit when I’m asked about the ‘hallucination problem’ in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.

“We direct their dreams with prompts. The prompts start the dream, and based on the LLM’s hazy recollection of its training documents, most of the time the result goes someplace useful.

“It’s only when the dreams go into deemed factually incorrect territory that we label it a ‘hallucination.’ It looks like a bug, but it’s just the LLM doing what it always does.”

It’s not just the problem of making stuff up. Chat AI is deeply flawed software.

For many queries, particularly from novices, the responses are mundane, off-target or simply unhelpful. Chat AI has trouble counting: Ask it for a 500-word blog post and you’ll be lucky to get 150.

And each of the AI companies, in order to reduce bias and to avoid answering “how-to-build-a-bomb” queries, has erected tight response guardrails around their products: all too often, the response to a question is, essentially, “No, I won’t answer that.” I asked Google Gemini to review a draft of this text and was cautioned that “it’s essential to get the author’s approval before publishing.”

Fact checking

I argue, mostly upon deaf ears, that hallucinations are a technology problem, which will find a technology solution. Yes, they’re endemic to LLMs, but they can be circumvented.

Consider this: I asked four Chat AI’s to fact-check the following statements:

As of 2024, there are 6 big multinational publishers based in New York City. They are known as the Big 6.
Ebooks continue to dominate book sales in the United States.
Borders and Barnes & Noble are the two largest bookselling chains in the United States.
After a sales decline during Covid, U.S. book sales are again growing by double-digits.

All of them spotted the errors in the first three statements. Each of them became a little confused on the fourth, uncertain of the extent of the Covid sales bump, and of subsequent sales patterns. It’s a tiny, non-representative experiment, but these Chat AIs, which are not meant to be fact-based, can do a credible job assessing facts most casual observers would miss.

Up next

What About Images and Video?