Yeah right? I tried it yesterday to build a simple form for me. Told it to look at the structure of other forms for reference which it did and somehow it used NONE of the UI components and helpers from the other forms. It was bafflingly bad
Despite the “official” coding score for GPT5 being higher, Claude sonnet still seems to blow it out of the water. That seems to suggest they are training to the test and the test must not be a very good test. Or they are lying.
Now that we have vibe coding and all programmers have been sacked, they’re apparently trying out vibe presenting and vibe graphing. Management watch out, you’re obviously next!
Problem with the “benchmarks” is Goodhart’s Law: one a measure becomes a target, it ceases to be a good measurement.
The AI companies obsession with these tests cause them to maniacly train on them, making then better at those tests, but that doesn’t necessarily map to actual real world usefulness. Occasionally you’ll see a guy that interviews well, but it’s petty useless in general on the job. LLMs are basically those all the time, but at least useful because they are cheap and fast enough to be worth it for super easy bits.
Yeah right? I tried it yesterday to build a simple form for me. Told it to look at the structure of other forms for reference which it did and somehow it used NONE of the UI components and helpers from the other forms. It was bafflingly bad
Despite the “official” coding score for GPT5 being higher, Claude sonnet still seems to blow it out of the water. That seems to suggest they are training to the test and the test must not be a very good test. Or they are lying.
They’d never be lying! Look at these beautiful graphs from their presentation of GPT5. They’d never!
Source: https://www.theverge.com/news/756444/openai-gpt-5-vibe-graphing-chart-crime
Wut…did GPT5 evaluate itself?
Now that we have vibe coding and all programmers have been sacked, they’re apparently trying out vibe presenting and vibe graphing. Management watch out, you’re obviously next!
Problem with the “benchmarks” is Goodhart’s Law: one a measure becomes a target, it ceases to be a good measurement.
The AI companies obsession with these tests cause them to maniacly train on them, making then better at those tests, but that doesn’t necessarily map to actual real world usefulness. Occasionally you’ll see a guy that interviews well, but it’s petty useless in general on the job. LLMs are basically those all the time, but at least useful because they are cheap and fast enough to be worth it for super easy bits.