Facepalm: It is no secret that generative AI is prone to hallucinations, but as these tools make their way into medical settings, alarm bells are ringing. Even OpenAI warns against using its transcription tool in high-risk settings.
OpenAI’s AI-powered transcription tool, Whisper, has come under fire for a significant flaw: its tendency to generate fabricated text, known as hallucinations. Despite the company’s claims of “human level robustness and accuracy,” experts interviewed by the Associated Press have identified numerous instances where Whisper invents entire sentences or adds non-existent content to transcriptions.
The issue is particularly concerning given Whisper’s widespread use across various industries. The tool is employed for translating and transcribing interviews, generating text for consumer technologies, and creating video subtitles.
Perhaps most alarming is the rush by medical centers to implement Whisper-based tools for transcribing patient consultations, even though OpenAI has given explicit warnings against using the tool in “high-risk domains.”
Instead, the medical sector has embraced Whisper-based tools. Nabla, a company with offices in France and the US, has developed a Whisper-based tool used by over 30,000 clinicians and 40 health systems, including the Mankato Clinic in Minnesota and Children’s Hospital Los Angeles.
Martin Raison, Nabla’s chief technology officer, said their tool has been fine-tuned on medical language to transcribe and summarize patient interactions. However, the company erases the original audio for “data safety reasons,” making comparing the AI-generated transcript to the original recording impossible.
So far, the tool has been used to transcribe an estimated 7 million medical visits, according to Nabla.
Using AI transcription tools in medical settings has also raised privacy concerns. California state lawmaker Rebecca Bauer-Kahan shared her experience refusing to sign a form allowing her child’s doctor to share consultation audio with vendors, including Microsoft Azure. “The release was very specific that for-profit companies would have the right to have this,” she told the Associated Press. “I was like ‘absolutely not.'”
The extent of Whisper’s hallucination issue is not fully known, but researchers and engineers have reported numerous instances of the problem in their work. One University of Michigan researcher observed them in 80 percent of public meeting transcriptions examined. A machine learning engineer encountered hallucinations in approximately half of over 100 hours of Whisper transcriptions analyzed, while another developer found them in nearly all 26,000 transcripts created using the tool.
A study conducted by Professor Allison Koenecke of Cornell University and Assistant Professor Mona Sloane of the University of Virginia examined thousands of short audio snippets, discovering that nearly 40 percent of the hallucinations were deemed harmful or concerning due to potential misinterpretation or misrepresentation of speakers.
Examples of these hallucinations include adding violent content where none existed in the original audio, inventing racial commentary not present in the original speech, and the creation of non-existent medical treatments.
In one instance, Whisper transformed a simple statement about a boy taking an umbrella into a violent scenario involving a cross and a knife. In another case, the tool added racial descriptors to a neutral statement about people. Whisper also fabricated a fictional medication called “hyperactivated antibiotics” in one of its transcriptions.
Such mistakes could have “really grave consequences,” especially in hospital settings, said Alondra Nelson, who led the White House Office of Science and Technology Policy for the Biden administration until last year. “Nobody wants a misdiagnosis,” said Nelson, a professor at the Institute for Advanced Study in Princeton, New Jersey. “There should be a higher bar.”
Whisper’s influence extends far beyond OpenAI. The tool is integrated into some versions of ChatGPT and is offered as a built-in service on Oracle and Microsoft’s cloud computing platforms. In just one month, a recent version of Whisper was downloaded over 4.2 million times from the open-source AI platform HuggingFace.
Critics say that OpenAI needs to address this flaw immediately. “This seems solvable if the company is willing to prioritize it,” William Saunders, a former OpenAI engineer who left the company in February over concerns about its direction, said.
“It’s problematic if you put this out there and people are overconfident about what it can do and integrate it into all these other systems.”