

AI researchers are rapidly embracing AI reviews, with the new Stanford Agentic Reviewer. Surely nothing could possibly go wrong!
Here’s the “tech overview” for their website.
Our agentic reviewer provides rapid feedback to researchers on their work to help them to rapidly iterate and improve their research.
The inspiration for this project was a conversation that one of us had with a student (not from Stanford) that had their research paper rejected 6 times over 3 years. They got a round of feedback roughly every 6 months from the peer review process, and this commentary formed the basis for their next round of revisions. The 6 month iteration cycle was painfully slow, and the noisy reviews — which were more focused on judging a paper’s worth than providing constructive feedback — gave only a weak signal for where to go next.
How is it, when people try to argue about the magical benefits of AI on a task, it always comes down to arguing “well actually, humans suck at the task too! Look, humans make mistakes!” That seems to be the only way they can justify the fact that AI sucks. At least it spews garbage fast!
(Also, this is a little mean, but if someone’s paper got rejected 6 times in a row, perhaps it’s time to throw in the towel, accept that the project was never that good in the first place, and try better ideas. Not every idea works out, especially in research.)
When modified to output a 1-10 score by training to mimic ICLR 2025 reviews (which are public), we found that the Spearman correlation (higher is better) between one human reviewer and another is 0.41, whereas the correlation between AI and one human reviewer is 0.42. This suggests the agentic reviewer is approaching human-level performance.
Actually, now all my concerns are now completely gone. They found that one number is bigger than another number, so I take back all of my counterarguments. I now have full faith that this is going to work out.
Reviews are AI generated, and may contain errors.
We had built this for researchers seeking feedback on their work. If you are a reviewer for a conference, we discourage using this in any way that violates the policies of that conference.
Of course, we need the mandatory disclaimers that will definitely be enforced. No reviewer will ever be a lazy bum and use this AI for their actual conference reviews.


Yeah, it’s not like reviewers can just write “This paper is utter trash. Score: 2” unless ML is somehow an even worse field than I previously thought.
They referenced someone who had a paper get rejected from conferences six times, which to me is an indication that their idea just isn’t that good. I don’t mean this as a personal attack; everyone has bad ideas. It’s just that at some point, you just have to cut your losses with a bad idea and instead use your time to develop better ideas.
So I am suspicious that when they say “constructive feedback”, they don’t mean “how do I make this idea good” but instead “what are the magic words that will get my paper accepted into a conference”. ML has become a cutthroat publish-or-perish field, after all. It certainly won’t help that LLMs are effectively trained to glaze the user at all times.