Field NoteFebruary 11, 20269 min read

Why most AI pilots stall before production

The pilot looked great in the demo, then died in the handoff. After fifty of these, the failure modes are almost always the same six.

Every team we talk to has a pilot in their drawer. A demo that worked, a notebook that impressed someone in a Friday review, a Slack bot a developer built over a weekend. Almost none of it is in production. The reason is rarely the model. It is rarely the prompt. It is almost never the vector database. The reason an AI pilot stalls is the same reason a regular software project stalls - but compressed, because the demo is so convincing that everyone skips the part where they plan a real system.

We have shipped, salvaged, and post-mortemed enough of these to see the pattern. There are six reasons pilots die between the demo and the deployment. Most failed projects hit at least three of them. If you are about to greenlight an AI initiative, read this first.

1. The pilot was scoped to the easy half

Pilots almost always solve the part of the problem that is already clean. A handful of golden documents. A curated dataset. A tidy slice of the inbox someone hand-picked because the model handled it well. The hard half is the long tail: the malformed PDFs, the customers who use the wrong subject line, the rep who copy-pastes from a 2014 template, the contract that has a footnote that reverses the main clause.

Production lives in that long tail. A model that handles 80% of cases and silently mangles 20% is not a 20% problem - it is a 100% trust problem, because nobody downstream knows which case they are looking at. The team that built the pilot saw the 80%. The team that has to use it sees the 20%, every day, until they stop using it.

What to do instead: scope the pilot to the messy half from day one. Pull the worst ten examples your operators have complained about. If the pilot can handle those, you have a product. If it can only handle the gold-set, you have a screenshot.

2. There is no owner once the demo ends

A pilot belongs to whoever championed it. A production AI system needs a team that wakes up if it breaks at 2am, a team that owns the prompt the way another team owns a microservice, a team that can say no to the next feature request. Most pilots stall because no one ever signed up for that pager.

The pattern is predictable. A senior engineer or a product manager builds the prototype. It works. Leadership gets excited. The builder moves on to the next shiny thing. The pilot sits in a repo with no oncall rotation, no eval suite, no runbook. Six months later, someone asks why we never shipped it. The honest answer is that nobody owned it past the applause.

What to do instead: name the team that will own the system in production before you write the first line of pilot code. Not the team that builds it. The team that runs it. If no team will sign up, that is your signal that the project is not yet real.

3. The cost model never got drawn

At pilot scale, you do not notice that each request costs a quarter. You run the demo a hundred times, the bill is twenty-five bucks, nobody flinches. At production scale, it is ten thousand requests a day, the bill is two thousand five hundred dollars, and the CFO is in your inbox by Wednesday.

The cost surprise is rarely just tokens. We have seen pilots blow their budget on retries (see our note on the hidden cost of always-on LLM calls), on long context windows that grew without anyone noticing, on tool-using agents that decided to call the same API forty times in a single trace. None of this shows up in the demo because the demo runs once.

What to do instead: draw a per-action cost curve the same week you draw the prompt. Project it to your target volume. Decide what an acceptable unit cost looks like before usage scales, not after. If the unit economics do not work at 10x volume, the project is not ready, no matter how good the output looks.

4. Evaluation never moved from vibes to numbers

Pilots are evaluated by stakeholders looking at outputs and nodding. That is fine for a pilot. It is not a launch criterion. Production needs an eval suite - a fixed set of representative inputs, a defined notion of correctness, and a number that goes up or down when you change the prompt or the model.

Without that suite, every change is a coin flip. You upgrade the model and quietly break a third of your accuracy on edge cases. Nobody notices for six weeks because nobody is running the eval. By the time the regression surfaces, three more changes have shipped and you cannot tell which one broke things.

Eval suites do not need to be fancy. A hundred examples, graded on a small rubric, run on every prompt change, with the score posted somewhere visible. That is enough to catch most regressions before they ship and to give you a real answer when leadership asks how it is doing.

5. The integration was a slide, not a shipped surface

A demo is a script that someone clicks through. Production is an API that another team depends on, a queue that has to drain before the next batch arrives, a UI that has to handle a user who pastes 80,000 characters of HTML into a free-text field. The integration work - auth, rate limits, idempotency, observability, error pages - is rarely in scope for the pilot and almost never on the roadmap.

We have seen pilots stall because the answer to “how does this connect to Salesforce” was a hand-wave. We have seen pilots stall because nobody could decide which team owned the webhook. The model worked. The plumbing was a quarter of engineering work that no one had budgeted for.

What to do instead: include the integration in the pilot scope. Even a thin version. Make the pilot run from a real input source and write to a real output destination, not a notebook cell. The plumbing is where the project lives or dies.

6. Stakeholders signed off on the demo, not the system

The demo answered one question - can the model do the thing - and stakeholders signed off as if it had answered all the others. Can it do the thing reliably. Can it do it cheaply. Can it do it without exposing PII. Can it do it inside our security review. Can our auditors accept it. Can we explain it to a customer if it is wrong.

Each of those questions is a separate workstream. Treating a successful demo as a green light for production is the single most expensive mistake we see. It compresses six months of governance, security, and operations work into a two-week sprint that inevitably misses something, which is why so many pilots end up “90% done” for the better part of a year.

What to do instead: turn the sign-off into a production-readiness checklist. The demo unlocks the next phase, not the launch.

The production-readiness checklist we actually use

Before any AI system we have built moves from pilot to production, it has to clear the same short list:

A named oncall team with a documented runbook for the top three failure modes.
A frozen eval suite with at least 100 examples and a target pass rate that has been agreed by the business owner.
A unit cost projection at 1x and 10x current volume, signed off by the budget holder.
A monitoring dashboard with cost per successful action, p95 latency on the user path, and eval pass rate over time.
A real integration - a webhook, a queue, an API - not a notebook.

Five items. None of them is glamorous. All of them are why the thing is still running a year later instead of sitting in a repo with the README half-written.

What to do this week

If you have a pilot stuck somewhere between “works on my machine” and “in front of customers,” pick the one item from the checklist you have not done and do it this week. Most of the time it is the eval suite. Sometimes it is the cost projection. Occasionally it is the awkward conversation about who owns it. The first item you address is usually the one that has been blocking everything else.

AI pilots do not stall because the technology failed. They stall because the unglamorous parts of shipping software did not get budgeted. The teams that get past this stage are the ones that decide, before the demo, what production actually means. Everyone else ends up with a demo.

Related reading: how we choose between agents and workflows when scoping a project, and the hidden cost of always-on LLM calls we keep finding in production systems.

Contact us.