Two of today's stories — the Hugging Face eval-cost piece and OpenAI's goblins post-mortem — point at the same uncomfortable truth: measuring whether an AI tool is actually doing its job is now one of the hardest problems in the industry. And most mid-market businesses aren't doing it at all.
Here's the pattern I see in the data. A company picks a use case — say, auto-generating customer quotes or summarizing inspection reports. They build or buy a tool, test it manually for a week, decide it's "good enough," and move on. Six months later, nobody's checked whether the outputs are still accurate, whether the model's behavior has drifted, or whether the tool is actually saving time versus creating new review work.
This isn't laziness. It's a resource problem. Rigorous evaluation — the kind that catches a GPT-5 going full goblin before your customers see it — requires defining what "good" looks like for your specific use case, building test sets, running them regularly, and having someone review the results. That's expensive even for well-funded AI labs. For a 60-person manufacturer running a single AI tool, it can feel impossible.
The Hugging Face research quantifies what many teams already feel: eval infrastructure is becoming the bottleneck. Not compute, not model access, not even talent — the ability to measure quality at scale. Their data shows eval costs growing faster than training costs for many organizations.
So what should a mid-market business actually do?
First, accept that "we tried it and it seemed fine" isn't an eval strategy. It's hope.
Second, start simple. You don't need a Stanford-grade benchmark suite. Pick your AI tool's five most common tasks. Create ten examples of each with known-good answers. Run them monthly. Track whether accuracy goes up, down, or sideways. That's a spreadsheet, not a software project.
Third, build eval checkpoints into your vendor relationships. If you're paying for an AI-powered tool, ask the vendor: how do you measure output quality? How often? What happens when it degrades? If they can't answer clearly, that tells you something.
The goblins story is entertaining, but the lesson underneath it is serious. OpenAI caught the problem because the drift was dramatic and public. Most drift isn't. It's a chatbot that gets 3% less accurate per month. It's a document summarizer that starts omitting key details. It's subtle, and it compounds.
You can't manage what you can't measure. And right now, most companies aren't measuring.