Claude Opus 4.7 for Procurement: The Model That Expects More From You | The Handshake
The Handshake · Issue #001 · April 2026

Claude Opus 4.7 for Procurement Teams: A Smarter Model That Expects More From You

We ran 5 real procurement workflows head-to-head against Opus 4.6. Here's what the benchmarks won't tell you.

12 min read
April 2026
Sandeep Karangula, Molecule One

4.7 is the better model. That said, read the small print.

Opus 4.7 dropped yesterday. We ran it the same day, five procurement workflows back to back against 4.6: RFP scoring, contract redlining, spend analysis, category strategy, and supplier performance reviews. 4.7 won four of the five, scored 183 out of 200, and beat its predecessor by 15.5 points. For analytical procurement work, it's the right call.

But there's a personality shift in this model that will catch teams off guard if they just drop it into existing workflows. 4.7 is more literal, more analytical, and less willing to infer what you want. Prompts that worked beautifully with 4.6 will produce different outputs. Not worse, just different in ways you need to understand before you flip the switch. We'll walk through all of it.

"4.7 feels like what happens when a model stops trying to be helpful by anticipating what you want, and starts being helpful by actually doing what you asked."

After five workflows

Four things that actually changed

Before the numbers, here's a plain-language summary of what Anthropic shipped. Some of these are capability upgrades, some are behavioural shifts. The behavioural ones are the ones most likely to affect your day-to-day.

🔍
Built-in self-verification New
4.7 runs an active verification pass before presenting output: revising scores, catching arithmetic errors, flagging inconsistencies. Not a stated disclaimer; actual evidence of revision.
📌
Prompt literalism Behaviour change
4.7 executes what's in the prompt, nothing more. It won't infer you wanted a formatted doc or RAG table. If you want those things, say so. Prompts built around 4.6's inference will need a rewrite.
📊
Data pre-checking New
On data tasks, 4.7 verifies source figures before running analysis. If something doesn't add up, it stops, flags the discrepancy, and asks before proceeding. No more burying a note in a footnote after the fact.
🏛️
Deeper domain knowledge Better
4.7 applies specific, current knowledge without being prompted for it: named regulations, sub-category practices, market participants. Less "here are the considerations" and more "here is the answer."

The headline numbers from the release

Before we get into our own testing, here's what Anthropic published. These are their numbers, on their benchmarks.

+13%
Coding task resolution
93-task internal benchmark vs Opus 4.6
3x
More production tasks resolved
Rakuten-SWE-Bench vs Opus 4.6
70%
CursorBench score
vs 58% for Opus 4.6
21%
Fewer errors on documents
Databricks OfficeQA Pro vs Opus 4.6
+14%
Multi-step workflow accuracy
Notion Agent benchmark, fewer tool errors
3.75MP
Vision input resolution
3x more than prior Claude models

Source: Anthropic, "Introducing Claude Opus 4.7" →

Most of these are coding and agentic benchmarks, which is where Anthropic has focused the release messaging. What we wanted to know is whether the same improvements translate to procurement work specifically: document analysis, contract review, spend data, category strategy. That's what we tested.

One claim that does show up in our results: the 21% fewer errors on document work. We saw this in practice with the self-verification behaviour, which we'll cover in a moment.

How both models scored across 5 use cases

Each use case was scored across four dimensions (accuracy, self-consistency, output quality, instruction-following), each 0-10, for a maximum of 40 points per test. How we scored this →

Use Case Opus 4.6 Opus 4.7 Winner
UC1: RFP Analysis & Supplier Scoring 36.0 /40 35.5 /40 4.6
UC2: Contract Redlining & Risk Extraction 34.0 /40 36.0 /40 4.7
UC3: Spend Analysis & Category Intelligence 32.5 /40 37.0 /40 4.7
UC4: Category Strategy & Sourcing Plan 34.0 /40 37.0 /40 4.7
UC5: Supplier Performance Scorecard (QBR) 31.0 /40 37.5 /40 4.7
Total 167.5 /200 183.0 /200 4.7
4.6 — RFP scorecard
4.6: RFP scorecard RAG colour coding, 4-tab Excel, produced without being asked for it.
4.7 — RFP scorecard
4.7: RFP scorecard 3-tab with dimension-level breakdown. Sharper scoring, less visual polish. Self-check revised two scores before presenting.

4.6 held its ground in UC1 (RFP scoring) largely on presentation quality. It produced richer, more polished formatted outputs without being asked. In the four tests where the quality of analysis was the deciding factor, 4.7 pulled ahead cleanly. Expand each use case below for the full breakdown.

UC1
RFP Analysis & Supplier Scoring
36.0 vs 35.5
4.6 wins

Four supplier responses for an Enterprise Laptop Procurement (500 units, £2.5M, 3-year managed service). We embedded a spec-level trap: one supplier quoted the Dell Latitude 5540 in their executive summary but specified an incompatible processor in the technical section.

Both models caught the obvious issues. Where 4.6 stood out was in what it produced: RAG colour coding, a 4-tab Excel scorecard, longer rationale columns written for a non-expert audience. You could hand it straight to a stakeholder without editing. 4.7 gave sharper, more expert-readable rationales and actively revised two supplier scores during its self-check loop, though the 3-tab Excel was simpler and the overall output needed more formatting work to reach the same polish.

The most telling detail: the prompt didn't ask for a formatted output document. 4.6 produced one anyway. 4.7 gave text. That pattern ran through almost every test.

4.6 output
4.6 output RAG colour coding, 4-tab Excel (Summary, Section Scores, TCO Pricing, Risk Flags), produced without being asked.
4.7 output
4.7 output 3-tab scorecard with dimension-level breakdown. Sharper scoring, less visual polish. Self-check revised two scores before presenting.
Practical split: Use 4.7 for the analytical scoring, then use 4.6 or Sonnet to build the stakeholder-facing summary pack.
UC2
Contract Redlining & Risk Extraction
34.0 vs 36.0
4.7 wins

A fictional IT services contract with 9 deviations from standard terms and 4 internal inconsistencies, including a Scotland vs England & Wales governing law mismatch, a 1x vs 2x liability cap, and a 180-day termination notice where the standard is 30 days.

Both models found all 13 issues. The scoring came down to how they communicated them. 4.7 quantified impact: "6x liability cap gap," "6x termination gap." 4.6 described the same issues qualitatively: "liability cap too low," "notice period too long." For a Finance Director making a commercial call, 4.7's framing is more useful; you can take it straight to a decision without recalculating.

4.7 also got the risk severity right where 4.6 didn't. The Scotland/England & Wales governing law issue was rated HIGH by 4.7, MEDIUM by 4.6. The jurisdiction distinction affects litigation venue, applicable precedents, and enforcement. It's a high-stakes flag, and 4.6 undersold it.

4.7 on the liability cap
4.7 on the liability cap Quantified as a 6x gap with a concrete floor figure and a proposed redline inline. 4.6 flagged the same clause as 'too low'.
Worth noting: 4.6 produced a formatted review document without being asked. 4.7 gave plain text with better analysis. The formatting was easy to add; the severity calibration wasn't.
UC3
Spend Analysis & Category Intelligence
32.5 vs 37.0
4.7 wins

A 20-row spend dataset across 12 sub-categories with a deliberate £200,000 arithmetic error baked in; the stated total didn't match the actual sum of the rows. Both models caught it, but handled it very differently.

4.7 flagged the discrepancy before running any analysis, stated it would use the corrected £12.735M baseline, and asked for confirmation before proceeding. 4.6 spotted it too, but buried it as a footnote in the data table and carried on with the analysis on the wrong total. In a live context, a footnote gets missed; a hard stop before analysis starts does not. Every savings figure downstream from 4.6's run was built on an incorrect number.

4.7 also ran supplier deduplication (19 unique suppliers across 12 sub-categories (4.6 didn't surface this)) and correctly classified Waste Management as a Bottleneck category rather than Leverage, picking up the supply-side risk that makes waste a harder category to switch than the textbook answer suggests.

4.6 executive summary
4.6 executive summary The £200k discrepancy appears as a final sentence, after the analysis has already been framed around the wrong total.
4.7 opening
4.7 opening Data quality flag is line one. The model stops, states the discrepancy, and confirms the corrected baseline before any analysis runs.
Data pre-check matters most here. The arithmetic trap was deliberate. In real spend data, errors like this appear all the time. A model that stops and verifies before it calculates is meaningfully safer than one that proceeds and flags quietly.
UC4
Category Strategy & Sourcing Plan
34.0 vs 37.0
4.7 wins

A 12-month Marketing Services category strategy for a European retail chain: €20M spend, 5 markets, 12 product launches, 40+ fragmented agencies. Complex enough that the quality of analysis would show clearly.

Both covered all required sections. The difference was in what sat inside them. 4.6's risk flags were category-level: "product launch disruption." 4.7 named the chain: "agency failure mid-campaign → financial distress / talent attrition / key person dependency." One is something to worry about; the other is something you can build monitoring around.

The sustainability section was the starkest contrast. 4.6 covered FSC certification and Scope 1/2 disclosure. 4.7 cited the UK CMA Green Claims Code, the EU Green Claims Directive 2026, Ad Net Zero framework alignment, FSC stock, vegetable-based inks, modular reusable event builds, and post-event waste diversion reporting. None of it was prompted. It wrote like someone who has managed this category before.

One area where 4.6's answer may be more realistic: agency panel size. 4.6 recommended 8-12 agencies across 5 markets. 4.7 said 2-4 per sub-category. 4.7's number is more commercially aggressive; 4.6's is easier to actually operate across five markets. Worth pressure-testing against your own context.

4.7 sustainability section
4.7 sustainability section FSC-certified stock, vegetable-based inks, Ad Net Zero, greenwashing liability under the EU Green Claims Directive. None of this was in the prompt.
Our team's read on UC4: "4.7 writes like a seasoned analyst: sharp, to the point, deep domain knowledge, actionable outputs."
UC5
Supplier Performance Scorecard (QBR)
31.0 vs 37.5
4.7 wins

A full-year QBR pack for a fictional packaging supplier, covering four quarters of OTIF, quality incidents, invoice disputes, and non-conformance data, with an explicit scoring formula. Largest margin of the five tests.

4.6 made a clean factual error: every improvement action was dated June 2025 in a document dated April 2026. That's the kind of thing a supplier will notice immediately, and it undermines everything else in the pack. 4.7 got the dates right, and its self-check section went further than any other test: it reconciled all KPIs back to the source tables, flagged Q4 OTIF as a borderline vulnerability (+0.1pp above target), marked the Responsiveness score as an estimate rather than verified data, and caught a mathematical observation in the Q1 figures. That's analyst-level quality control, not a formality.

One genuine point for 4.6: it chose PowerPoint for the QBR output, which is the right format for a supplier meeting. 4.7 chose Word + Excel, which is more rigorous but not what you'd take into a room. 4.6 also did something we liked: it asked for format confirmation before generating the pack. The user skipped it, which contributed to the errors, but that consultative check is good practice in real workflows.

For 4.7 users: The model won't ask for format confirmation; it'll make assumptions and execute. Build your format requirements into the prompt from the start.

4.7 does what you say. Not what you mean.

This showed up in every single test. Opus 4.6 would look at a prompt, infer the likely intent, and deliver a polished version of what it thought you wanted, cover page and all. Opus 4.7 reads what's in the prompt and executes that, nothing more. If you didn't ask for a Word document, you don't get one.

That's not a regression. It's a deliberate shift toward predictability. But it means prompts that relied on 4.6's inference (and plenty do, even ones never written with that in mind) will need updating.

⚠ Before you switch

Go through your highest-frequency prompts and add explicit output format instructions. What you had to imply with 4.6, you now need to state with 4.7. It's a one-time prompt rewrite, and once it's done the behaviour is actually more consistent.

The flip side: the energy 4.6 spent inferring your intent has clearly gone somewhere more useful. 4.7 thinks harder before it writes, and it shows.

The self-check is real this time

4.7 has a built-in verification loop that runs before it presents output. We watched it work in three of the five tests: it revised two supplier scores mid-check in UC1, stopped to verify the spend total before analysis in UC3, and produced a dedicated reconciliation section in UC5 that caught a borderline OTIF result we hadn't specifically flagged.

This is different from 4.6's approach. 4.6 included self-check sections in some tests, but they were mostly declarative: the model stated it had checked without showing any evidence of what it had actually reviewed. 4.7's checks are visible and they produce changes.

✓ Prompt simplification

If you've been adding "verify your calculations before presenting" or "run a consistency check" to your prompts, you can take those out now. 4.7 runs this automatically. Depending on how many of your prompts carry that instruction, removing it saves tokens and keeps things cleaner.

4.6 executive summary
4.6 executive summary The £200k discrepancy appears as a closing note, after the analysis is already framed around the wrong total.
4.7 opening
4.7 opening Data quality flag is line one. The model stops before any calculation runs.

Deeper domain knowledge, applied without prompting

The clearest upgrade in 4.7 is domain depth. Across multiple tests it applied specific, current knowledge (regulations, market participants, sub-category practices) that 4.6 either missed or covered at a much higher level. And in no case were we prompting for this detail. It just appeared.

Risks that tell you something

Across UC2 and UC4, 4.6 produced flags like "product launch disruption" and "governing law risk." Technically correct. Analytically thin. 4.7 produced: "agency failure mid-campaign → financial distress / talent attrition / key person dependency." You can write an early warning indicator around that. You can't do much with "product launch disruption."

4.7 on the liability cap
4.7 on the liability cap Same clause, 4.6 wrote 'too low'. 4.7 calculated the multiplier, stated the floor, and included the proposed redline.

Sustainability as a category expert, not a checklist

In the Marketing Services category strategy, 4.7 cited the UK CMA Green Claims Code, the EU Green Claims Directive 2026, FSC stock, vegetable-based inks, Ad Net Zero, modular reusable event builds, and post-event waste diversion reports. 4.6 hit FSC and Scope 1/2. Neither was prompted for this level of detail. 4.7 wrote like someone who has actually managed this category before.

Numbers over labels

4.7 — sustainability section
4.7: sustainability section FSC stock, vegetable-based inks, Ad Net Zero, EU Green Claims Directive liability. None of this was prompted.

In the contract test, 4.7 translated every risk into a business impact figure. A liability cap issue became a "6x liability cap gap." A termination clause became a "6x termination gap." 4.6 named the same issues without quantifying them. The first version gives a Finance Director something to work with; the second gives them homework.

Sharp and strategic, but not always what you need

4.7 writes in a consulting register: tight, executive-ready, minimal padding. Its executive summaries are strong: the kind of thing you could paste into a board paper without editing. It uses bullets where they add value rather than as a default structure. It doesn't recap its methodology at the start of every section.

4.6 is more narrative and more accessible. Its longer rationales are useful when the audience needs to follow the logic rather than just read the conclusions. For stakeholder communications, non-expert audiences, or anything that needs warmth alongside analysis, 4.6 still reads better.

Where to use each voice

4.7 for exec summaries, strategy briefs, category plans, risk assessments, anything going to a decision-maker who wants conclusions fast.

4.6 or Sonnet for creative writing, supplier relationship content, stakeholder communications, or any document written for a general audience. 4.7's consulting tone can feel cold when connection matters more than precision.


Where to switch, where to stay

Switch to 4.7
Strategy & market research
Category strategy, competitive landscape, sourcing design, market intelligence. This is where 4.7's domain depth and consulting-register writing pay off most.
Switch to 4.7
Spend analysis & data work
Category spend, savings identification, supplier deduplication, Kraljic mapping. The data pre-check behaviour alone makes 4.7 the safer choice here.
Switch to 4.7
Contract redlining & supplier reviews
Risk extraction, severity calibration, QBR analysis, KPI reconciliation. 4.7's self-verification and quantified risk framing are directly useful in these workflows.
Split approach
RFP evaluation
Use 4.7 for the scoring and analytical evaluation. Use 4.6 or Sonnet to build the stakeholder-facing summary pack. Each model doing what it does best.
Stay on 4.6 / Sonnet
Creative & narrative writing
Procurement storytelling, change communications, supplier relationship content. 4.7's consulting register works against you here; 4.6 reads warmer and more human.
Stay on 4.6 / Sonnet
Presentation decks & general audiences
QBR packs, board slides, documents for non-expert readers. 4.6's format inference and narrative style are better suited to content designed to be presented or read aloud.

Will it cost more to run?

We didn't run API-level token counts in this evaluation; testing was done through the Cowork and Claude.ai interfaces. So treat this section as informed estimation. Pricing for both models is identical at $5/$25 per MTok input/output, so the question is purely about volume.

Anthropic's guidance suggests 4.7 at effort: xhigh could use up to 30% more tokens than 4.6 at effort: high, due to extended thinking and the self-verification pass. Our estimate is real-world overhead sits closer to 15-20% initially, for a few reasons.

4.6 baseline
baseline
4.7 (worst)
+30%
4.7 (est.)
+15-20%
4.7 (mature)
~4.6 level

4.7's outputs are tighter: shorter prose, less methodology recap, fewer filler paragraphs. It may think longer but it writes less. The built-in self-check also removes a reprompt loop: if you were previously running a separate "check your work" prompt, that's gone now. And 4.7 tends to get to the right answer in one shot, which matters. Cheaper-per-token models that need three rounds of refinement often end up costing more overall, plus the time you spend on those rounds.

As teams learn to write proper prompts for 4.7 (explicit format instructions, tighter specs) we'd expect the overhead to normalise toward 4.6 levels. The extra cost right now is largely the cost of adapting.

📋 What we didn't test

PDF and table reading: 4.7 is supposed to be meaningfully better at complex PDF structures and multi-table documents (3x the image resolution, per Anthropic) but we didn't get to test this. It's on the list for the next issue.

API-level token counting: The token analysis above is qualitative. A proper API-level comparison with matched prompts is in progress.


A strong step forward, and a bridge to whatever comes next

4.7 is the better model for procurement analytics. The self-verification, the domain depth, the risk calibration: these are real improvements that show up in outputs you'd actually use.

It also feels like a model in transition. The prompt-literalism and the formatting restraint suggest something that is being pushed in a deliberate direction and hasn't fully settled there yet. Our read is that the next major release will smooth some of these edges. For now, 4.7 is what you want for anything analytical. Hold on 4.6 for writing and presentation work, update your prompts before you flip the switch, and don't mistake the formatting plainness for a lack of capability. The analysis underneath is genuinely better.

If you're deciding which AI tools to deploy across your procurement function — not just which model to use, but which workflows to start with and how to build team adoption — that's exactly what we work through with clients. See our 12 AI use cases in procurement that actually work, or start with our AI Readiness Assessment to understand where your team stands today. And if you're evaluating the consulting vs. software question, this article lays out an honest framework.

4.6 vs 4.7 at a glance

DimensionOpus 4.6Opus 4.7
Output format infers Produces polished docs unprompted literal Delivers exactly what the prompt specifies
Self-verification declarative States it checked; rarely shows revision active Revises scores, flags errors, shows its work
Domain knowledge Category-level, solid but surface deep Named regulations, sub-category specifics
Risk framing Category-level: what to worry about mechanism Causal chains: what to monitor and do
Writing register Narrative, accessible, good for general audiences Consulting-style, tight, exec-ready
Data handling Flags issues quietly; proceeds on stated data pre-checks Stops and verifies before analysis starts
Execution mode consultative Asks for confirmation on complex tasks autonomous Executes and delivers on its own assumptions
Prompt requirement Infers context; tolerant of loose instructions explicit Format instructions must be stated
Methodology & Scoring Criteria
How the evaluation was designed and scored

Approach

Five self-contained prompts were developed using synthetic procurement data (fictional suppliers, spend figures, and contract terms created specifically for this evaluation). Each prompt was run identically in two separate sessions: Claude Opus 4.6 at effort: high and Claude Opus 4.7 at effort: xhigh. Outputs were scored independently across four dimensions for a maximum of 40 points per use case and 200 points overall.

Where possible, output files were read programmatically and compared against source data. Arithmetic was verified independently in Python where discrepancies were suspected.

Scoring Dimensions

DimensionWhat was assessedMax
Accuracy / Completeness Coverage of all required elements; correct facts, calculations, and references; detection of embedded data traps 10
Self-consistency Internal agreement between numbers, scores, and narrative; evidence of self-check execution 10
Output quality Usability without further editing; actionability for the intended audience (CPO, CMO, Finance Director) 10
Instruction-following All numbered steps completed; format requirements met; no required elements missed 10

Caveats

Skill contamination (UC4 and UC5): Both use cases were run in a Cowork environment with active skills: category-strategy-builder for UC4 and supplier-scorecard-engine for UC5. Some document structure and section content in both outputs was influenced by skill prompting. Instruction-following scores were adjusted upward accordingly. Analytical content was scored independently and is unaffected.

Synthetic data: All supplier names, spend figures, contract terms, and performance data are fictional. Results reflect model behaviour on these scenarios and should not be read as real supplier performance.

No API token counts: Token counts were not captured via the API. Relative verbosity was noted qualitatively. The token section of this article is an estimate based on observed output length and Anthropic's published guidance.

Single-run results: Each use case was run once per model. LLM outputs have inherent variability; a different run on the same prompt could produce a different score within a reasonable range. The patterns we observed were consistent enough across five tests to draw conclusions, but treat individual use case scores as directional rather than definitive.

Molecule One builds AI-native procurement tools for mid-market and enterprise buyers. We help procurement teams deploy and get measurable value from AI in the workflows you run every day, not just in theory.

The Handshake is our series where we review new AI model launches. Every issue covers what just dropped, what actually changed, and what procurement teams should do about it.

Issue #001 · April 2026 · sk@moleculeone.ai · moleculeone.ai
© 2026 Molecule One · The Handshake, Issue #001 · Claude Opus 4.6 (effort: high) vs Claude Opus 4.7 (effort: xhigh)
Evaluation based on synthetic procurement scenarios. All data is fictional.