For one month beginning on October 5, I ran an experiment: Every day, I asked ChatGPT 5 (more precisely, its “Extended Thinking” version) to find an error in “Today’s featured article”. In 28 of these 31 featured articles (90%), ChatGPT identified what I considered a valid error, often several. I have so far corrected 35 such errors.

  • Ace@feddit.uk
    link
    fedilink
    English
    arrow-up
    59
    arrow-down
    3
    ·
    edit-2
    8 hours ago

    If you read the post it’s actually quite a good method. Having an LLM flag potential errors and then reviewing them manually as a human is actually quite productive.

    I’ve done exactly that on a project that relies on user-submitted content; moderating submissions at even a moderate scale is hard, but having an llm look through for me is easy. I can then check through anything it flags and manually moderate. Neither the accuracy nor precision is perfect, but it’s high enough to be useful so it’s a low-effort way to find a decent number of the thing you’re looking for. In my case I was looking for abusive submissions from untrusted users; in the OP author’s case they were looking for errors. I’m quite sure this method would never find all errors, and as per the article the “errors” it flags aren’t always correct either. But the effort:reward ratio is high on a task that would otherwise be unfeasible.