Benchmarks Considered Boring
I think about and write about AI and work with it daily. I've also repeatedly mentioned the importance of objective evaluations for your AI tools. So you might assume that I'm constantly checking the various leaderboards and benchmarks:
Who's on top of Chatbot Arena?
Is Opus-4 or o4-mini better at the FrontierMath benchmark?
"Gemini 2.5 Pro just topped SimpleBench! Here's why that's a big deal."
Honestly, I never really cared about them. Sure, having a rough sense of who's coming out with new models demonstrating generally improved capabilities is good. For that, it's enough to keep a very loose finger on the pulse. If at any point a model makes a really big splash, you're guaranteed to hear about it. No need to compulsively refresh the AI Benchmarking Dashboard.
But beyond larger trends, much of the leaderboard news is just noise. OpenAI comes out with its new model and tops the leaderboard. "This is why OpenAI is the leader and Google missed the boat", the pundits proclaim. Next week, Google comes out and it's all "Oh baby, Google is so back!".
Besides, the leaderboard position does a terrible job at making an informed choice:
The best model at a benchmark might fare poorly on the specific task you intend to use it for.
Or maybe it's "the best" in output quality, but you can't use it for your purpose due to... cost, deployment considerations, compliance issues, latency, etc
Instead of obsessing over rankings that collaps a multi-faceted decision into a single number, you'd do best to consider all these tradeoffs holistically and then, combined with a benchmark custom-built for your use case, make an informed decision.