Sponsored by

A model scores 90% on a reasoning benchmark.

A month later it fails a basic logic task in production. The company updates the benchmark. The score goes up again.

Nobody calls this fraud. It has a gentler name: evaluation drift.

The benchmarks that the AI industry uses to claim progress were designed for a different problem. They were built to measure narrow, well-defined tasks under controlled conditions. They were not built to measure whether a model is useful to a business trying to deploy it in the real world.

That gap has been known since at least 2023. It has not been fixed. It has been managed.

The management strategy is straightforward. When a model underperforms on a benchmark, the benchmark gets updated or retired. When a new model is released, it is evaluated on tests the lab designed, using test sets the lab controls. Independent evaluators exist but lack the compute to run the same tests at the same scale.

The result is a measurement system that tells you very little about what you are actually buying.

8 levels of context maturity in AI-native engineering

AI shows up in 60% of engineering work. But only about a fifth of it can be handed off without someone babysitting the output. That’s because agents are missing context.

This 8-stage context maturity model gives a real answer on why you haven't seen meaningful productivity gains for all the tokens burned.

- Why more MCPs provides agents access but not understanding

- What it takes to deploy agents you can trust without supervision

- How a context layer solves for quality, efficiency and cost

This matters more than it sounds.

Enterprise contracts for AI infrastructure are being signed right now based on benchmark performance. Governments are writing AI procurement policies that reference benchmark scores as capability thresholds. Investors are pricing AI companies partly on leaderboard position.

All of that is downstream of a measurement system that the people being measured are also designing.

Stuart Russell, whose work on AI safety predates most of the current debate, has argued for years that the field optimises for the metric rather than the underlying capability. That distinction is not academic. A model that scores well on a benchmark and fails in deployment is not a capable model. It is a well-optimised test-taker.

The practical consequence for builders is specific.

If you are building a product on top of a frontier model, benchmark scores tell you almost nothing about performance on your actual use case. The only evaluation that matters is the one you run on your own data, with your own prompts, against your own definition of success.

This is not complicated. It is also not what most builders do. Most builders read the release announcement, check the leaderboard position, and choose the highest-ranked model.

That is the same as hiring someone based on their claimed GPA without giving them a work sample test.

The builders who will have a durable edge in the next 18 months are the ones who build internal evaluation systems early. Not because they distrust the labs. Because trust without verification is not a strategy. It is a dependency.

The benchmark problem is unlikely to be solved by the labs.

Their incentive runs the other direction. The most credible pressure is coming from enterprise buyers who are starting to write evaluation requirements into procurement contracts, and from European regulators who are building independent assessment frameworks under the AI Act.

Neither of those moves fast. In the meantime, the leaderboard keeps moving and the press release keeps saying "state of the art."

Read past the score.

404 Found covers AI developments from a European Insider, three times a week.

Their first after-hours call was a $20,000 job.

Air Texas was paying $2,000 a month for an answering service that couldn't close jobs.

Their first after-hours call with Podium’s AI Employee booked a $20,000 job.

Now no call goes unanswered after 5PM.

Reply

Avatar

or to participate

404 Found