If your inbox looks like ours, every vendor is selling AI, every consultant has an AI practice, and every board meeting has an AI mandate. The hype is deafening. The production stats are humbling.
Over the last 18 months we tracked 100+ enterprise GenAI projects across financial services, insurance, healthcare, legal, and Fortune 500 technology. Some we built. Some we advised on. Some we just watched. The outcome distribution is brutal — and it matches independent surveys from Gartner, IBM, and Forrester to within a percentage point.
before production
within 12 months
held production
Total survival rate: 2 in 100. Read that again. Of every 100 enterprise GenAI initiatives that get pilot funding, only 2 are alive and serving real users 18 months later. The other 98 produce a PowerPoint, a demo video, a sunk cost, and a CTO who's now skeptical of the next AI pitch.
But here's what makes the data useful: the 2 that survive are not random. They cluster around seven specific use cases, and the project leads who built them did the same six things differently from the 98 who failed. This article is the field guide.
Will your AI project be in the 9% or the 91%? Answer these 5 questions.
Before you read what other people built, run this on what YOU'RE building. One "no" answer is recoverable. Two means you're heading to the 91%. Three means stop and rescope.
Q1: Can you draw the full path from raw user input to user-visible output on a whiteboard in five minutes?
If not, your scope is fuzzy. Fuzzy scope means you'll discover the hard problems 4 months in, after the budget is half spent. The 9% always have a one-pager that fits on a whiteboard. The 91% have a 60-page strategy deck that doesn't survive contact with engineering.
Q2: Can you state success as a single number with a unit?
"Improve productivity" is not a success metric. "Reduce time-to-resolve tier-1 tickets by 30% while holding CSAT above 4.0" is. If you can't quote the number, you can't tell when you're done — so you iterate forever and run out of money. Failed projects all share the same fingerprint: vague success criteria approved by a steering committee.
Q3: Has InfoSec already reviewed and signed off on the data flow?
"We'll handle InfoSec at the end" is the most expensive sentence in enterprise AI. If you take InfoSec to the architecture diagram on day 90 and they say "redo it" — you redo it. If you took them on day 1, you redesigned around their constraints from the start. The cost of being right early is two extra meetings. The cost of being wrong late is your entire project.
Q4: Is there a human in the loop by design, or are you trying to fully automate?
Fully autonomous AI in regulated industries (insurance, healthcare, finance, legal) is a liability bomb your General Counsel will defuse the moment they see it. Every project we've seen survive has a human reviewing or approving the AI's output. The AI accelerates the human. It doesn't replace them. The 91% pitch full automation; the 9% pitch assist.
Q5: Have you priced the cost-to-serve at 10x pilot volume?
Pilot: 10 users, 50 queries/day, $30 in API costs. Production: 5,000 users, 25,000 queries/day, $50,000/month before optimization. Run the math on production economics BEFORE you build. If the CFO kills it at launch, you wasted six months. If the CFO greenlights it at $50K/month and the value is real, you're funded for the next twelve.
5/5 yes: You're rare. Build it.
4/5: Fix the one "no" before you start. Usually it's Q3 (InfoSec) or Q5 (economics).
3/5 or less: Stop. Either rescope the project or run a 2-week strategy sprint before committing engineering budget. The cost of fixing scope problems now is a fraction of the cost of discovering them in month 5.
The 4 gates that kill 91% of enterprise AI
Before the use cases that survive, the failure pattern. Across the 91 pilots that never deployed, autopsies usually point to one of four gates:
1. The security review you didn't schedule. The pilot was built on someone's personal OpenAI key, in a GitHub account InfoSec doesn't know about, with customer data flowing through an external API that hasn't been signed off. When the project escalates to InfoSec for production deployment, they find SOC 2 violations, GDPR issues, or audit-trail gaps that require the entire architecture rebuilt. The team gets quoted "6-9 more months" and the budget runs out. Most pilots die here, around month 4.
2. The data access architecture that doesn't exist. The pilot used a sample dataset someone exported to a notebook. Production needs live data from Snowflake, SharePoint, Salesforce, the legacy ERP, and the data warehouse the data team is still migrating off — each with its own RBAC, encryption, audit logging, and retention rules. Data engineering quotes 6 months to plumb it all in. By the time the data pipeline is real, the original use case has moved on. Pilot frozen, then quietly killed.
3. The integration tax that compounds. The AI app has to live inside the workflow employees already use — not as a separate tab they have to remember to open. So you integrate with Okta SSO, the company's observability stack (Datadog or Splunk), the change-management system, the helpdesk ticketing tool, the existing notification preferences, and the audit-log pipeline. Each integration adds 2-6 weeks. Your 8-week pilot becomes a 9-month integration project. Sponsor changes jobs. New sponsor kills it.
4. The cost-to-serve curve that breaks the business case. Pilot was 5 internal users on $50/day OpenAI budget. Production at 5,000 users with the same usage pattern is $50,000/month before optimization. The CFO does the math, the business case that worked at pilot volume doesn't survive contact with production volume, and finance vetoes the deployment.
Model dependency risk. Almost no team thinks about this. You build the pilot on GPT-4o. Six months later, OpenAI deprecates that snapshot or doubles the price or changes the safety filters in a way that breaks your prompts. Suddenly your "production" app needs a 4-week migration project on no notice. Survivors design for model-agnosticism from day one — usually via a routing layer (LiteLLM, Portkey, or homegrown) that lets them swap LLMs without touching app code.
The seven use cases below share something: they're scoped from day one to pass through all five gates. They're not bigger than the gates. They're built to fit.
What the 9% are actually building right now
Seven use cases. They share three properties: (1) they pass the 5-question diagnostic, (2) they survive the 4 gates because they're scoped to fit, (3) we've personally built or watched the build of each one in the last 18 months. The specs, timelines, and gotchas below are real — names and metrics anonymized.
For each use case we list the build economics. But the part worth your time is the Hidden Gotcha — the thing nobody tells you that kills the project six months in. Read those.
1. Claims triage and pre-fill
An LLM reads the incoming first-notice-of-loss (FNOL) — structured fields, photos, and supporting documents — and produces a recommended category, a confidence score, and a draft summary the adjuster reviews. It ships because output is structured (eval is mechanical against historical adjuster decisions), human-in-the-loop is mandatory by regulation, and the ROI math is direct: 30-50% reduction in adjuster time per claim.
Teams try to fully automate the claim decision because the AI is good enough on paper. State insurance regulators require human review for any claim decision affecting payout — and your General Counsel will kill the project the day they hear "AI deciding." Build for "assist the adjuster" from day one. The productivity story is the same. The legal risk story is 1000x better.
2. Investment research copilot
Analysts query their firm's research archive, broker reports, internal models, and earnings transcripts in natural language. The system surfaces relevant excerpts and drafts memos with source citations. It ships because the use case is read-only (no trading decisions automated), SEC audit-trail requirements are satisfied automatically, and the productivity gain (queries in minutes vs hours) creates a measurable business case from week one.
Teams build the chat interface, ship to analysts, and discover adoption rates of 5-15%. Why? No source citations. Analysts are paid to be skeptical. They won't paste AI-generated content into a deck their portfolio manager will see without a link to source. Adding citations to every claim is non-negotiable. The teams that do see 60-85% weekly active rates.
Embedding choice matters more than LLM choice here. For financial documents (earnings calls, 10-Ks, broker reports), Voyage's `voyage-finance-2` outperforms OpenAI's `text-embedding-3-large` by 15-25% on retrieval relevance for this exact domain. The LLM you pick for synthesis matters less than getting the right chunks retrieved in the first place.
3. Clinical documentation assistant
A clinician dictates the patient encounter. The system transcribes, structures the note into the EHR-required template (SOAP, H&P, discharge), and pre-populates billing codes for human review. It ships because it solves a desperate executive pain (clinicians spending 2 hours/day on documentation), keeps liability with the clinician via human approval, and forces HIPAA controls from day one because nothing else is acceptable in healthcare.
Teams skip the BAA conversation and discover at the security review that their LLM provider doesn't have a HIPAA Business Associate Agreement available. Without a BAA from BOTH your cloud provider AND your LLM vendor, the project can't legally launch in healthcare. As of 2025, Anthropic offers BAAs through AWS Bedrock; OpenAI offers them through Azure but with limits on which models. Get the BAA signed in week 1, not month 4.
4. Contract review and risk flagging
The system reads incoming contracts (MSAs, SOWs, vendor agreements, NDAs), flags clauses against the company's playbook, and produces a risk-scored review with line-level citations. It ships because legal teams want speed but won't accept AI-only decisions — the system flags, lawyers decide. Every flag is justified by a citation back to the playbook, making the AI's reasoning auditable.
Teams skip ingesting the company's own contract playbook. The system then flags every non-standard clause as risky — including the ones legal routinely accepts ("this vendor always asks for 60-day payment; we always give it"). Lawyers see 200 false flags per contract, stop using the tool by week 2. Build with the playbook from day one. If you don't have a playbook, building one IS the first phase of the project.
5. Customer support deflection
Incoming tickets are auto-classified, enriched with knowledge base context, and either auto-resolved (tier-1) or routed to a specialist with a draft response. It ships because deflection rate is measurable in week one (even 15-25% deflection on tier-1 pays back in 3-6 months), and the human fallback means customer experience never degrades — if the AI's not confident, it routes to a person.
Teams celebrate when AI auto-handles 40% of tickets — until CSAT drops 15% three months later because the AI was being aggressive on cases it shouldn't touch. The deflection-rate-vs-CSAT trade-off is the most expensive number in this project. Start with the confidence threshold set HIGH (AI only handles cases it's 95%+ confident on, maybe 5-10% of volume). Tune downward as you build empirical CSAT data per confidence band. Customer experience can't be rebuilt — once you lose CSAT 0.4 points, getting it back takes 2 quarters.
6. Sales intelligence and account research
Before a sales call, the rep gets an auto-generated account brief: recent company news, personnel changes, financial signals, prior interactions, recommended talking points. It ships because pre-call prep takes reps 30-60 minutes per meeting and compressing this to 5 is high-value, easily-measured productivity — and because the output is read-only (no auto-emails), sidestepping the deliverability and compliance risk of AI-generated outreach.
Teams build a beautiful standalone "sales AI portal" and discover weekly active rep usage of under 10%. Why? Reps live in Salesforce. They're not opening a separate tab to get an account brief. Embed the AI brief inside the existing CRM record — auto-populated, refreshed on cadence, surfaced where the rep already is. A Salesforce Lightning component beats a standalone app every time, even if the standalone app is technically better.
7. Internal knowledge search
Employees ask natural-language questions across the company's knowledge — Confluence, SharePoint, Notion, Drive, Slack history, ticket history. Results are permission-aware (you only see what you're allowed to see) and cite sources. It ships because knowledge sprawl is universal pain: new employees spend weeks finding "how we do X here," senior employees waste hours looking up what they should know. Productivity uplift is broad and measurable across the entire workforce.
Teams dump everything into one giant vector index without enforcing permissions. Two weeks after launch, an intern queries "what's the CEO's compensation plan" and the AI returns the answer because the source document was accessible to the indexing service account but should have been blocked from the intern. Project gets killed within 24 hours. This is the single most catastrophic failure mode in enterprise GenAI. Every chunk in your vector store must carry the source document's ACL. Every query must filter results by the asking user's identity. There is no shortcut. Get this right on day one or don't ship.
Build, buy, configure, hire, or wait — the 5 paths
You've picked your use case. The harder question: who builds it? Most articles give you three options. The real list is five, and the right answer depends on what's specific to your business vs what isn't.
| Path | When it works | Time to production | Cost band |
|---|---|---|---|
| Build in-house from scratch | AI is core to your product for 3+ years; you can absorb 6-12 month hiring cycles + custom infrastructure work | 9-18 months | $1.5M-$5M loaded |
| Configure on an enterprise AI platform | Use case fits patterns the platform handles; you have engineers to configure but don't want to rebuild governance + multi-LLM + observability + audit infra | 2-8 weeks | $100K-$400K/yr platform + your eng time |
| Buy a vertical SaaS | Off-the-shelf vendor has already shipped your exact use case (e.g., Glean for knowledge search) | 4-8 weeks | $50K-$500K/yr per seat / volume |
| Hire an AI services team | You need production AI fast; lack senior AI engineers; want full code ownership | 4-12 weeks | $50K-$250K per app, fixed-bid |
| Wait 12 months | Use case is generic; your existing SaaS vendors will ship it as a feature | N/A (defer) | $0 now |
The "wait" option is real and underrated. For generic email summarization, basic chat support, simple document Q&A — Microsoft, Google, Salesforce, and dozens of SaaS vendors will ship adequate versions inside their existing products within 12 months. Building custom is a waste.
For use cases specific to your data, your workflow, your compliance environment — most of the seven above — the wait option doesn't apply. Generic vendors won't know your contract playbook, your claims taxonomy, or your internal RBAC graph.
That leaves the middle three options. Most enterprises pick wrong because they only consider two of them.
Most teams over-rotate on "hire an AI services team" or "build in-house" and skip the most leveraged option: configure on an enterprise AI platform. A platform with governance, BYO LLM, multi-LLM routing, audit logging, RBAC, and deployment to your cloud already done — your engineers configure the use case logic on top of it. That's the path Clarista is built for: enterprise AI app builder + governance baked in, deploys to your AWS/Azure/GCP, BYO LLM, SOC 2 / HIPAA / ISO inheritable controls. Your team builds the app logic; we provide the production-AI infrastructure.
For teams that want the platform AND a delivery team to build the first app on it: that's AI Development Services. For teams that want pure outsourced engineering: Hire AI Engineers. Same platform underneath — different engagement model on top.
Six tests separating production-ready from pilot theater
Whether you build, buy, or hire — a production-ready enterprise GenAI app must pass these six tests. If your project can't answer them yes, it will join the 91% that never ship.
1. Compliance is built in, not bolted on. SOC 2, HIPAA (if applicable), ISO 27001, GDPR data residency — these are architectural decisions, not features added later. Apps designed without them have to be redesigned to add them.
2. Every LLM call is auditable. Production AI gets audited. Auditors will ask: who triggered this output, what prompt was used, what data was retrieved, what model version, what was returned. If you can't answer per-call, the audit fails.
3. There's a human in the loop where it matters. Fully autonomous AI in regulated industries is a liability bomb. Build for assist (human reviews, edits, approves) not replace.
4. Permissions are respected end-to-end. If a chunk of data is restricted in the source system, it must be restricted in the AI app's retrieval and output. This is the hardest engineering problem in enterprise GenAI and most pilots skip it. Don't.
5. Evaluation is mechanical. "It seems good" is not eval. Define your test set, your acceptance criteria, and your failure modes upfront. Automate the eval loop. Re-run it every time you change anything.
6. The cost-to-serve math survives 10x scale. If your pilot costs $50/day for 10 users, what does it cost at 1,000 users? At 100,000? Run the numbers before you build. If production economics don't work, redesign the architecture before you commit budget.
Closing — what to do this week
The 91% that never ship aren't failing because the technology doesn't work. They're failing because the project was scoped without the production gates in mind. The 9% that ship were designed to pass through those gates from day one.
Three things to do this week:
1. Score your current AI project on the 5-question diagnostic. If you scored 3/5 or less, stop building. Rescope or run a strategy sprint before you spend more engineering time.
2. Decide your path on the build / configure / buy / hire / wait framework. Most teams over-rotate on "build" or "hire" and miss "configure on a platform" — which is usually the fastest and most leveraged option for the specific-to-your-data use cases above.
3. Pressure-test the architecture. Before any code, get InfoSec on the data flow, get finance on the production cost-to-serve, get legal on the human-in-the-loop design. Two weeks of pressure-testing saves 6 months of rework.
Where Clarista fits: the Clarista platform handles the production-AI infrastructure (auth, governance, multi-LLM, observability, audit, RBAC, your cloud) so your engineers can build the use case logic on top — in weeks instead of quarters. If you need a delivery team alongside the platform, we offer AI Development Services (fixed-bid, 4-8 weeks) and AI Consulting + Pilot (strategy plus working pilot in 4-6 weeks).
The technology works. The pattern of what ships is clear. The gap between knowing and doing is the project's first 14 days.
Build production AI on Clarista — governance, BYO LLM, your cloud.
Clarista is the enterprise AI app builder and governance platform. Auth, audit, multi-LLM orchestration, RBAC, observability — pre-built. Your engineers configure the use case. Deploy to your AWS/Azure/GCP. SOC 2 + HIPAA + ISO inheritable. Skip the 6-month infrastructure build.
See the platform →