Tech behemoth OpenAI has touted its artificial intelligence-powered transcription tool Whisper as having near “human level robustness and accuracy.”

But Whisper has a major flaw: It is prone to making up chunks of text or even entire sentences, according to interviews with more than a dozen software engineers, developers and academic researchers. Those experts said some of the invented text — known in the industry as hallucinations — can include racial commentary, violent rhetoric and even imagined medical treatments.

Experts said that such fabrications are problematic because Whisper is being used in a slew of industries worldwide to translate and transcribe interviews, generate text in popular consumer technologies and create subtitles for videos.

More concerning, they said, is a rush by medical centers to utilize Whisper-based tools to transcribe patients’ consultations with doctors, despite OpenAI’ s warnings that the tool should not be used in “high-risk domains.”

  • ShittyBeatlesFCPres@lemmy.world
    link
    fedilink
    English
    arrow-up
    153
    arrow-down
    4
    ·
    3 days ago

    Why is generative AI even needed for audio transcription? We’ve had decent voice recognition tools for years even on cheap consumer grade stuff.

      • InverseParallax@lemmy.world
        link
        fedilink
        English
        arrow-up
        28
        arrow-down
        3
        ·
        3 days ago

        Because with normal algorithms you have someone to blame.

        AI is a trick to hide when you steer the results the way you want.

    • ayyy@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      15
      arrow-down
      1
      ·
      3 days ago

      No, we really haven’t had on-device voice recognition that meets any definition of “decent”. Anything reasonable phones out to “the cloud” for decent voice recognition.

      • LavenderDay3544@lemmy.world
        link
        fedilink
        English
        arrow-up
        5
        ·
        2 days ago

        So? I’d rather have my software talk to a server than be downright wrong just so another business can climb onto the AI bandwagon.

        • Szyler@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          ·
          23 hours ago

          You can’t do that with personal information like the ones doctors needs transcribed. It has to be local.

    • TheBlackLounge@lemm.ee
      link
      fedilink
      English
      arrow-up
      26
      arrow-down
      4
      ·
      3 days ago

      Whisper really is a lot better when it works, and it’s free. The problem is that it refuses to produce gibberish or give up when it doesn’t work. You’ll always need an editor.

      • The Assman@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        42
        arrow-down
        6
        ·
        3 days ago

        The toaster oven I just invented works much better than a traditional one. It reheats French fries perfectly, you can dehydrate in it, makes succulent roasted chicken, and about 2.5% of the time it burns down your house. You’ll always need to keep an eye on it to make sure that doesn’t happen. Remember though, much better than a traditional one.

        • TheBlackLounge@lemm.ee
          link
          fedilink
          English
          arrow-up
          1
          ·
          2 days ago

          You need an editor for traditional transcription tools too :) and it’s A LOT more work. They don’t even do punctuation or names.

      • wdx@feddit.org
        link
        fedilink
        English
        arrow-up
        5
        arrow-down
        2
        ·
        3 days ago

        This definition of “better” feels like claiming that a Beeper that’s constantly hooked to power is the perfect alarm because it warns you every time someone is trying to break in - while entirely ignoring that it is just constantly blaring.

        • TheBlackLounge@lemm.ee
          link
          fedilink
          English
          arrow-up
          2
          ·
          2 days ago

          I use it for generating subtitles. It figures out context, it ignores stuttering, it does punctuation etc. It’s really is just better. With clean audio it transcribes like a human does.

          It does better than other techniques with dirty audio, but when it fails it fails weird, which is the big issue here.