Most AI programs we audit fail for the same handful of reasons. Here is the short list, and the questions to ask before the first dollar is spent.
When a client calls us about an AI program that isn't going well, the first thing we look at is rarely the model. In most cases the model is fine. What's broken is something less interesting and more important: how the work around the model is organized.
We've looked at enough of these now to see the same problems show up in the same order. The team can demo something impressive. They have a sponsor, a budget, and a roadmap. What they cannot tell you is who owns the system the Tuesday morning after it goes live, or what "working" looks like as a number both sides have signed off on. That gap is where most programs quietly stop mattering.
The five questions we ask first
Before we get into architecture, vendors, or fine-tuning, we ask five things. They sound obvious. Most teams cannot answer all five on the spot.
- What does success look like as a number, and who has agreed to that number?
- What workflow does this replace, augment, or sit alongside?
- Who owns the system once we leave the room?
- What is the cost of being wrong, and how often is that acceptable?
- Where is the evaluation data going to come from, and is any of it labeled yet?
If a team can confidently answer four of these, the program is ready for real money. If they can't, it isn't. That isn't a judgment about the team. It just means the project is still a hypothesis, and the right next step is a small discovery sprint rather than a procurement.
The failures nobody talks about
When AI programs fail loudly, people learn from them. The pilot misses an important customer, the demo crashes in front of the board, the executive sponsor leaves. Those failures get postmortems.
The harder failures are the ones that happen quietly. The model is right most of the time, and nobody looks too closely at the cases it gets wrong. The pilot was scoped to make the system look good, not to stress-test it. The team that built the prototype is not the team that has to live with it. Six months in, the program isn't cancelled. It is just no longer where the energy is. We've watched this happen often enough to recognize the early signals: the review meetings stop including the operators, the metrics drift toward usage instead of outcomes, and the team starts shipping features instead of fixing wrong answers.
What working programs share
The AI programs that hold up past their first year share three boring things.
There is an evaluation harness the operators themselves can run, without a data scientist in the loop. There is a clear escalation path for the cases the model gets wrong, with a real person who owns those cases and feeds the fix back into the system. And there is a plan for what happens when the model goes dark for a week, because eventually it will. If any one of those three is missing, you are not running a system. You are running a demo with a longer uptime than usual.
A note on cost
We get asked about token economics on almost every call. Token cost is almost never the constraint that matters. Engineering cost is. The budget that lands you in production is a fraction of the budget that keeps you there. Plan for what year two of operation looks like, not the cost of standing up the first prototype.
