TL;DR
Most AI projects get funded for the build and starved for the version that works in production. The MVP gets you to a demo at 60 to 70 percent usefulness. Getting it to the point where it produces real ROI costs about the same again, and that money goes into feedback loops, fine-tuning, UX rework, and the quiet integration work nobody sees. For mid-market operators staring at a pilot proposal right now, this is the difference between an AI project that lives on a slide deck and one that shows up in the P&L.
The Build Part Got Easy
Here’s something worth saying out loud: the MVP is no longer the hard part.
Models are commoditized, and orchestration is mature. A competent team can ship a demoable AI product in a matter of weeks, for a fraction of what the same thing cost eighteen months ago. An internal copilot. A CX triage agent. A document summarizer for the legal team. A dashboard that answers questions in plain English.
And the demos look great. That is the problem.
The Second X Nobody Puts in the Budget
The MVP solves part of the problem. It does not solve the whole problem at a level the business can trust.
Getting from “demoable” to valuable-enough-to-deploy roughly doubles the original cost. Call it the second X. It is the predictable, structural cost of making an AI product work in production, and almost no proposal on the market discloses it.
The pattern shows up everywhere you look. MIT’s NANDA initiative reported that roughly 95 percent of enterprise generative AI pilots fail to deliver measurable P&L impact. Gartner has been publishing post-pilot stall rates in the same zip code for two years running. Both are independent reads on what happens after the launch email goes out.
Here’s the thing, though. Those pilots stalled because the budget ended at launch. The models were fine.
Iteration Is the Product
The second X is where the product becomes real.
Production-data fine-tuning does the heaviest lifting. You retrain or re-prompt the system against the messy questions real users ask, instead of the clean examples the MVP was built on. UX rework takes a bigger bite than most teams expect, because what works in a demo breaks the moment a frontline employee uses it for the third time. Evals infrastructure tells you whether this week’s version is better than last week’s. The boring name hides how much it matters. Without evals, you are tuning blind.
Some of the budget pays for the work of noticing when the foundation model updates and your product quietly starts behaving differently. That is called drift. Every team eventually meets it, and most meet it in a customer complaint.
Onyx is the cleanest version of this pattern I have seen at Valere. They are a SaaS platform in the licensing space, and they came to us with a brittle Zapier-based agent they had built themselves. The demo worked fine. In production with enterprise licensing customers, it fell apart.
The rebuild ran straight into the second X’s line items: a chat UX that could handle complex interactions, orchestration underneath that could scale, a visual layer enterprise buyers take seriously, and the kind of guardrails that keep a complex licensing workflow from breaking under real use. What Onyx had built was the sales pitch. What they shipped with us was the product.
Read more about how their AI-native chatbot helped them unlock the $100M customer segment here: https://www.valere.io/case-study/onyx-ai/
The 2X Playbook: Where the Second Half of Your AI Budget Goes
If the MVP is the first X, here is what the second X buys. Screenshot this section and bring it to your next vendor meeting.
1. Production-data fine-tuning (15 to 20% of total project cost). Retraining or prompt-engineering against real user queries, instead of the sanitized dataset the MVP was built on. This is where accuracy climbs.
2. UX rework (10 to 15%). Input shaping, output formatting, workflow redesign. The frontline user will tell you what the demo could not.
3. Evals infrastructure (5 to 10%). The system that scores each new version against the last. Without it, every improvement is a guess.
4. Guardrails for domain-specific failure modes (5 to 10%). What happens when the model invents a dosage, a dollar figure, a case citation, or a policy number. Regulated industries spend disproportionately here and should.
5. Integration plumbing (10 to 15%). The connectors, auth layers, data pipes, and permission models that only surface once the tool touches a real system of record.
6. Feedback loop tooling (5 to 10%). Structured capture of what users accept, reject, edit, and quietly work around. This is the fuel for every future improvement.
7. Governance cadence and ownership (around 5%). A named owner, a weekly review, a documented kill switch, and a line on the roadmap past launch.
8. Version upgrades and regression management (around 5%). When the foundation model updates, your product changes whether you touched it or not. Budget for the work of catching drift and correcting it.
The percentages are directional. Your mix will shift by industry and by how much integration debt you are walking in with. The shape is what matters.
Six Questions to Ask Before Approving an AI Pilot Budget
If you are about to sign a proposal, these six questions will tell you whether you are funding a demo or a product.
- What is the cost of the MVP, and what is the cost of getting it to 90 percent useful?
- Who owns iteration after launch, and how many hours per week?
- What does our evals setup look like on day 31?
- Which failure modes are we specifically designing guardrails for?
- How are we capturing user feedback, and who reads it?
- What is our plan for the first time the foundation model upgrades?
If the vendor cannot answer these, they are selling you the first X and hoping you do not notice the second one.
Where AI Roadmaps Keep Ending Too Early
This is a technology-adoption problem with an AI costume on.
Every meaningful shift has a gap between the demo and the deployed version. CRM in the 2000s ran this same play. The cloud migration, a decade later, ran it again at ten times the scale. The organizations that captured real value were the ones that funded the boring middle, the part between “we bought it” and “it works.” Everyone else is still running pilots.
What makes AI different is that the demo is so good, it hides the gap. A well-crafted MVP looks like a finished product. It is the opening move.
So the real question for a mid-market operator is simple. If your AI proposal ends at launch, what line item is the version that works?
FAQ
- What does it cost to take an AI pilot from MVP to production? Plan for roughly 2X the original build cost, spread across four to six months of iteration. The exact mix depends on how much integration and governance work you are carrying in.
- Why do most enterprise AI pilots stall after launch? Because nobody owns iteration, evals are missing, the feedback loop was never built, and the original team moved on to the next pilot. The model is rarely the problem.
- How should I budget for AI fine-tuning and iteration in year one? Treat the MVP cost as the first half of your real year-one budget. The second half funds production-data tuning, UX rework, evals, guardrails, and integration. If your vendor will not break this out, ask them to.
- What are the hidden costs of building AI solutions in-house? Evals infrastructure, feedback capture, governance cadence, and the engineering time to handle model upgrades. These rarely appear in an initial scope and are almost always where internal projects blow their timeline.
- How long does it take to get an AI MVP to real production accuracy? Four to six months of active iteration is a realistic floor for a scoped use case. Complex or regulated domains run longer.
- Should we extend the MVP team through iteration, or hand off to a different group? Continuity almost always wins. The team that built the MVP knows where the bodies are buried. Handoff to a maintenance group is one of the most reliable ways to lose momentum in the second X.
Key Takeaways
- The MVP is the cheap half. Getting from demo to production value roughly doubles the upfront cost, and most proposals bury that.
- Iteration is the product. Fine-tuning, UX rework, evals, and the feedback tooling that feeds them are where real usefulness lives. They cannot be scoped as polish.
- Budget for 2X, plan for 18 months. If your AI proposal only funds the build, you are funding a demo.
- Feedback loops beat feature lists. What moves an AI product from 70 to 95 percent usefulness is structured signal from real users.
- Accuracy is a product decision. The last 20 percent of accuracy usually comes from UX, guardrails, data work, and feedback tooling before it comes from a bigger model.
- If the vendor will not price the second X, find one who will. Firms that scope only the MVP are optimizing for their margin.
- The gap between pilot and production is organizational. Companies that land AI value treat launch as the halfway point.
Resources and Sources
Independent and academic research (strongest citations):
- MIT NANDA, State of AI in Business 2025. Origin of the ~95% GenAI pilot failure figure.
- MIT Sloan Management Review, AI and Business Strategy.
- Stanford HAI, AI Index Report.
Consulting and analyst research (credible, read with frame in mind):
- McKinsey QuantumBlack, The State of AI.
- BCG, Artificial Intelligence research.
- Deloitte, State of Generative AI in the Enterprise.
- Gartner AI research hub.
Investor and vendor-influenced (use for pattern recognition, not ROI claims):
- a16z, enterprise AI essays. (portfolio-influenced, cite with caveat)
Signal vs. Noise is a newsletter for mid-market operators who need to cut through AI and enterprise-tech hype. Written by Guy Pistone, CEO of Valere.
