Over the last 18 months we have helped procurement teams stand up AI projects across contract review, spend classification, and supplier risk. We have stood next to teams when the deployment shipped, watched what users actually did with it, and read the post-mortems six months later. This piece covers three of those deployments. None of them are vendor case studies. None are clean wins. All three teach something about how procurement AI actually behaves once it leaves the slide deck.

Most published procurement AI case studies come from vendor marketing teams. They overstate the result, hide the cost, and never describe what broke. Procurement leaders end up making budget decisions based on a sample of one (their own pilot) plus a stack of vendor PDFs that describe an idealised version of someone else's. This piece is our attempt to add real cases to that picture.

Two ground rules before we start. First, none of the companies below are named. The descriptors are accurate (industry, revenue band, geography), the numbers are real, but the identifiers are stripped. Second, we are not the only consultants in any of these projects. In two of the three we worked alongside internal teams who did the bulk of the configuration work. We get to publish because we asked permission, not because we own the outcome.

Case 1: Contract Review at a $4B Specialty Chemicals Manufacturer

The claim: Contract review is the highest-readiness procurement AI use case in 2026. When teams scope it correctly, results show up inside a quarter.

This team came to us with a familiar problem. Their legal function ran a five-day SLA on tier-2 procurement contracts (anything under $500K with a standard MSA in place). The procurement team was queuing eighty contracts a month behind that SLA. Suppliers were noticing. The CPO had been told twice in board meetings that procurement was slowing down enterprise growth.

They had already tried two things before we arrived. A workflow tool that promised faster routing did nothing because the bottleneck was not routing. It was the lawyer''s attention. A junior paralegal hire helped for three months, then she quit and the queue reformed.

What worked was narrower than what they originally scoped. The first pitch was "AI contract review across all procurement contracts." We pushed back. The actual scope shipped was first-pass redlining on tier-2 contracts using a defined clause taxonomy of nineteen clauses (limitation of liability, indemnification, payment terms, IP assignment, audit rights, and so on). The model used was Claude Opus 4.7 via the Anthropic API, with retrieval-augmented prompts pulling from the company''s playbook of preferred language for each clause.

The pilot took six weeks. The production rollout took another twelve. Inside the production window, the first-pass redline got through 73% of clauses without lawyer intervention. The five-day SLA dropped to same-day on tier-2 contracts. The lawyer kept full review authority on tier-1 (over $500K) and on any contract where the AI flagged a non-standard clause.

What broke is more useful than what worked. The first version flagged too many false positives. Every limitation-of-liability clause came back as "non-standard," including the company''s own preferred language. The legal team''s confidence collapsed in week three. We spent the next four weeks tuning the prompt and rebuilding the clause taxonomy. The false-positive rate dropped from 34% to 8% by version eight. That four-week stretch is where most procurement AI projects fail. The team''s first instinct is to abandon and replace the tool. The right move is to instrument what the tool is getting wrong and fix the prompt.

Case in point: $4B specialty chemicals manufacturer, contract review

The situation: 80 tier-2 contracts queued per month behind a 5-day legal SLA. CPO under pressure from the board on procurement cycle time.

What they did: Scoped down from "all contracts" to tier-2 contracts with a 19-clause taxonomy. Used Claude Opus 4.7 with retrieval against an internal playbook. Six-week pilot, twelve-week production rollout.

The result: 73% of clauses resolved without lawyer intervention. Tier-2 SLA dropped from 5 days to same-day. False-positive rate landed at 8% after four weeks of prompt tuning.

The lesson: The pilot''s first version was unusable. The team that ships is the one that instruments the failures and tunes the prompt. The one that abandons after the first bad week never gets the benefit.

The implication for procurement leaders: when you scope a contract review pilot, do not let the vendor sell you horizontal coverage in week one. Pick one contract tier, one clause taxonomy, one team, and one playbook. Budget four weeks for prompt tuning between pilot and production. If you skip that step, the deployment shows the pilot''s accuracy in production, which is usually 10 to 15 percentage points worse than what shipped at the demo.

Case 2: Spend Classification at a $1.2B North American Retailer

The claim: Spend classification is the hardest procurement AI use case to do well. The teams that succeed are the ones that scope down. The teams that fail are the ones that scope up.

This was a deployment we joined late, after the first attempt had already failed. The team had 240,000 transactions a year flowing through their ERP, GL coding was inconsistent across three regions, and the procurement function had no clean view of indirect spend. The CFO had asked the CPO for a category breakdown. The CPO did not have one. That is how the project started.

The first attempt cost $480,000. The team bought an enterprise spend analytics platform from a well-known vendor. The vendor promised 85% classification accuracy out of the box. The implementation took eight months. When the output landed, the classification was 67% accurate against a 500-transaction audit sample. The procurement team rejected it because the errors clustered in the categories they cared most about: marketing services, IT subscriptions, professional services. Those three categories represented 28% of spend.

We came in after the platform decision had already been written off. The CPO asked a different question than the one she had asked the vendor. Not "how do we classify everything?" but "what part of our spend would actually move if we could classify it better?" The answer was direct materials, where 40% of spend sat in 9,000 transactions. Classifying direct materials better would feed sourcing decisions. Classifying tail spend better would feed nothing the team had time to act on.

The second attempt cost about $7,000 in API spend and roughly 120 hours of analyst time spread over five weeks. Two procurement analysts and one external advisor built a Claude-based classification workflow. The prompt was tuned against a labelled sample of 800 transactions across the company''s direct-materials category tree. The accuracy on direct materials hit 91% by week four. The team used the output to renegotiate two strategic supplier agreements that quarter, recovering an estimated $640,000 in unit price savings. That is a real number, drawn from the post-renegotiation purchase orders, not a projection.

What broke is that they never solved indirect spend. The Claude workflow performed worse on indirect, where the description fields were thinner and the supplier names did not map cleanly to categories. After two attempts at tuning, accuracy on indirect stalled at 74%. The team chose to leave it there. Indirect spend is still being categorised through a quarterly manual review by a finance analyst. That is a deliberate decision, not a failure.

Case in point: $1.2B North American retailer, spend classification

The situation: 240,000 transactions per year, no clean category view, $480K already sunk into a failed enterprise platform deployment.

What they did: Scoped from "classify everything" to "classify direct materials only." Built a Claude-based workflow with two analysts. Tuned against 800 labelled samples.

The result: 91% accuracy on direct materials by week four. Used the output to renegotiate two supplier agreements that quarter and recover $640K in unit price savings. Indirect spend left at 74% accuracy.

The lesson: A 91% accurate classification on 40% of spend beats a 67% accurate classification across 100%. Scope to the part that drives a decision. Leave the rest to manual review until the workflow is solid.

The implication: if you are scoping a spend classification project, ask one question before you budget for it. "If this works, what decision does the output feed?" If the answer is "category strategy for direct materials," scope it that way. If the answer is "a dashboard the CFO might look at," do not start. The classification work has to feed a decision that someone is already trying to make. We covered this pattern more deeply in Why Most Procurement AI Projects Fail.

Case 3: Supplier Risk Scoring at a $600M Industrial Distributor

The claim: The hardest part of a procurement AI deployment is rarely the technology. It is whether the output reaches the person whose decision it should change.

This project is the one we think about most because it technically worked and operationally did not. A mid-market industrial distributor had 3,400 active suppliers and a manual risk review process that touched each supplier roughly every 18 months. Between reviews, risk signals went unnoticed. The risk team had asked the procurement function twice for a continuous monitoring system. The CPO had no budget for an enterprise risk platform (the smallest quote was $340K per year).

The build was straightforward. Over 90 days, two engineers and one risk analyst built a Claude-powered supplier scorecard. It combined three inputs. A Dun & Bradstreet API feed gave financial health signals. A news search over the last 30 days surfaced supplier-specific incidents. An internal incident log of quality complaints, late deliveries, and contract disputes filled in the rest. The model produced a one-page summary for each tier-1 supplier (the top 180), updated weekly. Tier-2 and tier-3 suppliers got monthly updates. Total infrastructure cost ran about $2,800 a month.

The technical performance was strong. Across the first six months, the scorecard surfaced four of the five supplier issues that the risk team confirmed in a parallel manual review. Two of those four would have caused real disruption (one was a tier-1 chemical supplier headed into Chapter 11 protection that the public news cycle had not yet picked up). The risk team estimated the system had avoided $1.8M in contract overruns and emergency procurement costs over nine months.

The problem is that the procurement category managers did not use it. The dashboard lived in a Notion workspace nobody opened. The risk alerts went into a shared email distribution that auto-archived in most inboxes. Category managers continued working with their suppliers the same way they always had. When we ran a usage audit at month six, fewer than 20% of tier-1 suppliers had been re-evaluated based on a flag the system had raised. The system flagged. Nobody acted.

We have seen a version of this pattern in roughly half the procurement AI deployments we have observed. The technology works. The data shows up. The decision does not change. The root cause is usually that the AI output is not embedded into the workflow where the decision happens. In this case, the risk scorecard sat next to the workflow rather than inside it.

The team is now rebuilding the integration. The scorecard data is being pushed into the supplier records inside their ERP, where category managers already spend their time. Early signals suggest re-evaluation rates are climbing. The build was 90 days. The behaviour change is taking longer.

Case in point: $600M industrial distributor, supplier risk

The situation: 3,400 suppliers, manual review every 18 months, no budget for an enterprise risk platform.

What they did: Built a Claude-powered scorecard combining D&B, news search, and internal incident logs. Weekly updates on tier-1 suppliers. 90-day build, $2.8K monthly run cost.

The result: Caught 4 of 5 supplier issues that manual review would have missed. Estimated $1.8M in avoided costs over 9 months. Adoption rate inside the procurement team: under 20% of flags acted on.

The lesson: Technical success and operational success are different problems. If the AI output lives next to the workflow instead of inside it, expect 80% of the value to leak out.

The implication: when you scope a procurement AI build, design the integration into the existing decision surface before you design the model. The first question is not "what data do we feed the AI?" It is "where does the recipient already look when they make this decision?" Build the model output into that surface. If the answer is "they look at the supplier record in our ERP," your output goes into the supplier record. If the answer is "they look at a category dashboard during their weekly review," your output goes there. A separate dashboard is rarely the right answer.

Patterns Across All Three Deployments

Three projects is not a sample size. It is anecdote. But three deployments from three different procurement contexts produce a few patterns worth naming.

Scope down to ship. Each of these projects succeeded only after the team narrowed scope from what was originally pitched. Contract review narrowed from "all contracts" to "tier-2 with a 19-clause taxonomy." Spend classification narrowed from "all transactions" to "direct materials only." Supplier risk started narrow and stayed there. In every case, the team that wanted to ship something useful inside a quarter had to give up the boil-the-ocean version of the plan.

The first version is wrong. All three deployments produced a first output that the user-facing team rejected. The contract review tool was over-flagging. The classification accuracy was unusable on indirect spend. The risk scorecard surfaced events nobody acted on. In every case the team that ultimately shipped the production version spent four to eight weeks instrumenting the failure mode and fixing it. The teams that fail are the ones that abandon the tool after the first bad output rather than treating it as a tuning problem.

The hardest part is rarely the model. The Claude or GPT layer was not the bottleneck in any of these projects. The bottlenecks were prompt design, clause taxonomies, integration into the existing workflow, and getting users to change their behaviour. Procurement teams that hand the project to an "AI vendor" and expect the model to solve those problems are still going to be running pilots in 2027.

The cost was lower than expected. The contract review deployment ran at roughly $11K per month in API and tooling cost once it was in production. The spend classification project cost $7K to ship the working version. The supplier risk system costs $2.8K per month to run. Compare these to the enterprise platform quote of $340K per year that the third company was considering. The infrastructure is not where the cost lives. The cost lives in the human time it takes to scope correctly, tune iteratively, and embed the output into the right workflow. That is also where most enterprise AI vendors are still asking you to do the work yourself.

The takeaway: Procurement AI in 2026 is not a model problem. It is a scoping, prompting, and integration problem. The technology is mature enough that the team''s discipline matters more than the vendor selection. The teams that ship inside a quarter are the ones who narrow scope, treat the first output as a starting point, and design the integration before the model.

What This Means If You Are About to Start a Procurement AI Project

If you are evaluating a procurement AI project in the next quarter, three questions worth answering before you spend money. First, what is the narrowest scope where this AI could feed a decision your team is already trying to make? Not the most ambitious. The narrowest. Second, who on your team is going to spend four to eight weeks tuning the output between the first build and production-grade quality? If the answer is nobody, the project is going to stall in pilot. Third, where in your existing workflow does the AI''s output need to land? If the answer is "we will figure that out later," the adoption rate will land around 20%, which is what happened to one of the three teams above.

None of this is theoretical. The patterns are practitioner observations from the three deployments described here and a dozen others we have worked on or watched. The version of procurement AI that ships in 2026 looks less like "buy a platform" and more like "scope a use case, tune a prompt, embed the output." That is closer to the work an internal team can do with consulting support. It looks less like a transformation programme that an enterprise software vendor leads.

If you want to see how this maps to a full implementation path, our step-by-step guide on implementing AI in procurement walks through the readiness, scoping, tuning, and integration stages. And if you are still building the financial case, our procurement AI ROI guide covers what CFOs actually accept as defensible ROI categories.

We will publish a follow-up in Q3 with three more deployments. The plan is to alternate categories: spend, contract, supplier, sourcing, risk. If your team has a deployment story worth telling (success or failure) and you would consider sharing it anonymised, we would like to hear from you. The point of this piece is not marketing. It is to build a real corpus of how procurement AI actually behaves in production, which is something the industry still does not have enough of.

Further reading on procurement AI implementation from outside our work: Gartner''s supply chain research hub covers ongoing benchmarks across procurement AI adoption. McKinsey''s operations insights library publishes adoption surveys across categories. For the model layer, Anthropic''s release notes are the most current source on Claude capability and pricing changes.

Scoping a procurement AI deployment and want a second opinion before you commit budget?

Talk to Molecule One