AI Prompt Engineering for Procurement: The Data [2026]
The Handshake · Companion Piece · May 2026

We Tested AI Prompt Engineering on Five Procurement Tasks. One Model Improved Twice as Much.

A prompt engineering experiment across five procurement tasks using GPT-5.5 and Claude Opus 4.7. Same rubric, same evaluator. The only variable: how we structured the question.

14 min read
May 2026
Sandeep Karangula, Molecule One

Does AI Prompt Engineering Actually Matter in Procurement?

Prompt engineering matters in procurement AI. We always knew that. It's why we insist on building standard prompt libraries for procurement teams when we deploy AI for our customers, because we've seen first-hand how much the quality of the question shapes the quality of the answer. But until now, we'd never actually run a controlled experiment to measure the difference between a generic prompt and a model-optimised one.

Most procurement teams don't have prompt engineers. They type questions into ChatGPT or Claude the same way they'd email a colleague: "Review this contract and flag the risks." It works. The output is useful. But is there measurable value in spending an extra ten minutes structuring your prompt differently?

So we ran an experiment. Not with toy examples, but with the kind of procurement work that ends up in steering committee packs and supplier negotiations.

"With the generic prompt, GPT thought for 3 minutes. With the optimised prompt it is almost 6 minutes thinking time."

Sandeep Karangula, during the RFP Analysis test

We took five procurement tasks (RFP evaluation, contract redlining, spend analysis, category strategy, and supplier scorecards) and ran each one through Claude Opus 4.7 and GPT-5.5. Twice. First with a generic, copy-paste prompt. Then with a prompt specifically structured for each model's strengths. Same tasks. Same scoring rubric. Same evaluator. The only variable was how we wrote the question.

The Setup

Mode A: Control
Identical Prompt
The exact same vendor-neutral prompt submitted to both models simultaneously. No model-specific formatting, no special tags, no architecture-aware structuring. The same text, copy-pasted.
Mode B: Treatment
Model-Optimised Prompt
Prompts restructured for each model's architecture. For Opus: structured formatting with hierarchical sections, self-check instructions, and explicit verification steps. For GPT: outcome-first framing, numbered deliverables, and explicit citation requests.

The optimised prompts asked for the same deliverables. The difference was how they were structured. For Opus, we used hierarchical sections with clear role framing, data separation, task specification, output format requirements, and a dedicated self-check instruction. For GPT, we led with the desired outcome, numbered the deliverables explicitly, and asked for external citations.

Methodology

Scoring rubric: Four dimensions, 0–10 each, max 40 per mode: Accuracy & Completeness, Self-Consistency, Output Quality, and Instruction-Following.

Controls: Same five use cases. Same evaluator. Same browser-based interfaces (claude.ai and chatgpt.com). Same day, same session, same source documents. Scoring completed before moving to the next test, with no retroactive adjustment.

The Headline Numbers

+5.0
GPT improvement
(generic → optimised)
+2.0
Opus improvement
(generic → optimised)
2.5×
More improvement
for GPT vs Opus

Both models got better when we improved the prompts. But GPT improved by +5.0 points (2.8%) while Opus improved by +2.0 points (1.1%). GPT responded two-and-a-half times more strongly to prompt optimisation.

The competitive gap also shifted. With the generic prompt, Opus led by 9.5 points (185.5 vs 176.0 out of 200). With the optimised prompt, that gap narrowed to 6.5 points (187.5 vs 181.0).

Before & After: Full Results

Use Case Opus Generic Opus Optimised Opus Δ GPT Generic GPT Optimised GPT Δ
RFP Analysis 37.0 39.0 +2.0 34.0 35.0 +1.0
Contract Redlining 36.0 36.5 +0.5 33.5 36.0 +2.5
Spend Analysis 35.5 37.5 +2.0 36.5 36.5 +0.0
Category Strategy 38.5 38.0 −0.5 36.0 37.0 +1.0
Supplier Scorecard 38.5 36.5 −2.0 36.0 36.5 +0.5
TOTAL (/200) 185.5 187.5 +2.0 176.0 181.0 +5.0

See the full model comparison for detailed use-case breakdowns by scoring dimension.

How Claude Opus 4.7 Responded to Prompt Engineering

Opus's improvements were specific and narrow. The structured formatting with self-check instructions triggered verification behaviours that didn't appear with the generic prompt:

Supplier Scorecard: the inclusion-exclusion catch. Before producing any analysis of a supplier's delivery performance, Opus applied set theory to verify the source data: "For Q1: 128 on-time + 134 in-full out of 142 orders. By inclusion-exclusion, the minimum possible OTIF count is 128 + 134 − 142 = 120, but the data says 116. That's mathematically impossible." This only appeared in the structured prompt run. It's the single most technically impressive moment of the entire evaluation.

RFP Analysis: self-verification with documented revisions. Rather than asserting "I checked my work," Opus with the optimised prompt documented three specific score revisions it made during its analysis. The verification instruction converted a generic quality signal into an audit trail.

Contract Redlining: quantified financial exposure. The optimised prompt's output opened with an executive summary quantifying the commercial impact of contract deviations: £800k total exposure, £395k termination cost, £66.7k working-capital impact from payment terms. The generic prompt flagged the same risks but as observations rather than numbers.

But Opus Also Lost Something

In the RFP Analysis test with the generic prompt, Opus produced a Total Cost of Ownership table that the prompt never asked for. It was a valuable unrequested addition that showed genuine procurement domain reasoning. With the optimised prompt, that table disappeared. The structured format specified "Score table → Rationale → Risk flags → Recommendation." Opus followed the structure precisely and didn't add anything extra.

Trade-Off

Strict formatting can suppress good instincts. The TCO table was genuinely useful. A CFO would want it. But the structured prompt told Opus exactly what sections to produce, and Opus complied. More structure gives you more control, but less room for the model to add insight you didn't know to ask for.


How GPT-5.5 Responded to Prompt Engineering

GPT's improvements were larger and more visible. The outcome-first prompt with explicit deliverable numbering changed how the model structured its output:

Contract Redlining: from prose to tables. The generic prompt produced long paragraph-form analysis with headers and blockquotes. The optimised prompt adopted a 4-column table format (Clause / Deviation / Business Impact / Required Revision) and opened with "Dear Nexus team," a supplier-ready letter you could send directly to the counterparty. This was GPT's single biggest score movement across all tests (+2.5 points).

Category Strategy: citations doubled. The generic prompt included 9 external citations with market context. The optimised prompt jumped to 16, adding named supplier profiles (WPP, Publicis, Omnicom/IPG, Dentsu, Havas), EU regulatory timelines (EUDR, PPWR), and specific market data. The explicit citation request pulled out research that GPT clearly had access to but didn't volunteer without being asked.

Spend Analysis: external data as standard. The optimised prompt brought in NLW rates, BCIS construction forecasts, and ONS indices. These are specific, named UK data sources that gave the spend analysis an evidence base beyond the dataset itself.

"GPT outperforms itself and gets closer to Opus formatting with the optimised prompt. It uses tables like Opus does. Interestingly, with the optimised prompt, Opus doesn't use tables as much as it did with the generic one."

Sandeep Karangula, during the Contract Redlining evaluation

GPT vs Claude Format Convergence: The Models Swapped Styles

This one caught us off guard.

With the generic prompt, the models had distinct formatting personalities. Opus defaulted to tables, structured sections, and concise analytical formats. GPT defaulted to prose, longer explanations, and narrative flow. When we optimised the prompts, both models moved toward each other's default style.

Generic Prompt Defaults

Opus: Tables, structured sections, separated "additional analysis" from prompted output, concise advisory tone

GPT: Prose-heavy, headers with blockquotes, everything in one stream, longer explanations

Optimised Prompt Outputs

Opus: Shifted toward narrative executive-report style, fewer tables, more detailed explanation

GPT: Adopted tables, structured layouts, produced supplier-ready formats, added "Dear Nexus team" letter framing

Hypothesis

Prompt optimisation may push models toward a common "optimal" output format for procurement work. Both models, when given architecture-appropriate instructions, converged on a middle ground: structured but explained, tabular but contextualised, concise but complete. The "optimal" procurement output may be less about which model you use and more about how well you communicate what you need.

Five Procurement AI Tasks: What Prompt Engineering Changed

The effect varied significantly across tasks. Here is what happened in each one.

01 RFP Analysis & Evaluation
Opus +2.0 GPT +1.0

Both models scored well with the generic prompt. They ranked four laptop procurement suppliers identically and reached the same conclusion on which to shortlist. The optimised prompt pushed Opus toward its strongest self-verification (documenting three specific score revisions) but cost it the unrequested TCO table. GPT's thinking time doubled from ~3 to ~6 minutes, and the extra reasoning produced cleaner formatting with dual scales.

Key finding: GPT's doubled thinking time with the optimised prompt is direct evidence that prompt structure affects model reasoning effort, not just output formatting. The model spent more cognitive budget when the prompt was structured for outcome-first delivery.
02 Contract Redlining
Opus +0.5 GPT +2.5

GPT's biggest single improvement. The generic prompt produced prose-heavy analysis that was accurate but hard to action. The optimised prompt adopted a 4-column table (Clause / Deviation / Impact / Revision) and opened as a supplier-ready letter: "Dear Nexus team, We have reviewed the draft IT Managed Services Agreement against Hartwell Retail Group plc's standard contracting positions." Meanwhile, Opus shifted from tables toward narrative with quantified financial exposure (£800k, £395k, £66.7k).

Key finding: This is where format convergence was most visible. GPT moved toward Opus's tabular style while Opus moved toward GPT's narrative style. The gap narrowed from 2.5 points (generic prompt) to 0.5 points (optimised prompt).
03 Spend Analysis & Savings
Opus +2.0 GPT +0.0

Spend Analysis with the generic prompt was GPT's only outright win across all 10 tests. Its clean Kraljic table and external citations (NLW rates, BCIS forecasts) edged out Opus's more analytical but less visual approach. With the optimised prompt, Opus surged: it added a data verification table, mid-point savings estimates with phasing, and a formal calculation verification section triggered by the self-check instruction. GPT's score stayed flat because it was already near its ceiling for this task.

Key finding: Opus's verification instruction produced a structured data integrity table (7 checks with findings and actions) that didn't appear with the generic prompt. Self-verification is a capability that can be explicitly triggered by prompt structure. It doesn't happen automatically.
04 Category Strategy
Opus −0.5 GPT +1.0

The only use case where Opus scored lower with the optimised prompt. The generic prompt already produced a near-perfect output (38.5/40), a 14-page .docx with title page, executive summary, Kraljic mapping, sourcing roadmap, and signature lines. The optimised prompt produced a slightly different document (119 paragraphs vs 96, more aggressive savings targets) but didn't improve. GPT, meanwhile, went from 9 to 16 external citations, adding named supplier profiles and EU regulatory timelines (EUDR, PPWR).

Key finding: When a model is already performing at 96%+ on a task, prompt optimisation has limited room to help, and can even slightly hurt by constraining useful default behaviours. GPT's 10.0 on Accuracy/Completeness with the optimised prompt was its only perfect dimension score in the entire evaluation.
05 Supplier Scorecard / QBR
Opus −2.0 GPT +0.5

Opus with the generic prompt produced the strongest individual output of the entire evaluation: a .md file with OTIF set-theory verification, a transparent deduction-based quality scoring model, NC root cause analysis, and a supplier-facing executive summary. The optimised prompt changed the quality severity weights (Critical ×2.0 became ×2.5, Major ×0.5 became ×0.7) and didn't produce a file. GPT caught something Opus missed: NC-003's production date fell in Q1 despite being reported in Q2.

Key finding: Opus's optimised prompt quality weights (harsher multipliers) produced Q1 quality of 0.10/10 vs the generic prompt's already-low figure. The structured prompt triggered a recalibration that was arguably too aggressive. Meanwhile, GPT's NC-003 timing catch showed that prompt optimisation can surface different analytical strengths.

What We Didn't Expect

1. The TCO Table That Vanished

In the RFP Analysis test with the generic prompt, Opus inferred that a proper RFP evaluation needs total cost of ownership analysis and built a full TCO normalisation table the prompt never asked for. It was exactly what a procurement director would want. With the optimised prompt, the structured format specified four output sections. Opus followed the structure faithfully and the TCO table disappeared. The prompt was too prescriptive.

2. The Thinking Time That Doubled

GPT's visible "thinking" time in the RFP Analysis test went from approximately 3 minutes (generic prompt) to approximately 6 minutes (optimised prompt). The outcome-first prompt didn't add complexity to the task. It restructured how the task was framed. The model's reasoning infrastructure responded to the framing change by investing more computation. For single queries this is fine. For batch workflows processing 50 supplier responses, doubling processing time matters.

3. The File That Stopped Appearing

Opus produced a .md file in the Supplier Scorecard test with the generic prompt but not with the optimised prompt. It produced files in Category Strategy (both prompts), Contract Redlining (optimised only), and Spend Analysis (optimised only). No clear pattern emerged for when the structured prompt triggered file generation and when it didn't. GPT never produced a file with either prompt across any use case. At least that was consistent.

4. The Format Swap

We covered this above, but it bears repeating: the models didn't just get better with optimised prompts. They also got more similar. GPT adopted Opus's table-first approach. Opus adopted GPT's narrative-explanation approach. If you're choosing a model for its "style," that style is partially a function of how you prompt it.

5. Copy-Paste Contamination

In the Category Strategy test, GPT's optimised-prompt response included phrasing that echoed the prompt itself. More detailed, more structured prompts create more surface area for the model to inadvertently recycle prompt language into its output. When your prompt contains specific phrases like "preferred supplier panel with 10–15 pre-qualified partners," the model may parrot that framing back rather than generating its own analysis.

How to Prompt AI for Procurement: Practical Takeaways

For GPT Users

Investing 10 minutes in prompt structure is worth it. Your model benefits more than Opus does. GPT gained 2.5× more from optimised prompts than Opus did, meaning prompt engineering has higher ROI if you're on the OpenAI platform. Specifically: lead with the outcome you want, number your deliverables explicitly, and ask for external citations by name (e.g., "cite industry benchmarks with sources"). GPT has access to research it won't volunteer unless you ask. For a step-by-step guide on implementing these techniques, see our practical guide to implementing AI in procurement.

For Claude Users

Focus on verification instructions rather than output format. Opus's biggest improvements came from self-check sections, not format prescriptions. Asking Opus to "verify your calculations and document any revisions" triggers audit-trail behaviour. But avoid over-specifying the output structure. Opus's best unrequested additions (the TCO table, the set-theory data verification) appeared when it had format flexibility.

For Both

Match prompt intensity to task ceiling. If a task is already scoring 95%+ with a generic prompt (like Opus on category strategy), optimisation may not help and can hurt. Save your structured prompts for tasks where the model's default output needs improvement, especially format-dependent tasks like contract redlining or supplier QBR packs. Not sure where to start? Our AI Readiness Assessment identifies which procurement tasks will benefit most.

Quick-Reference: What Worked

Technique Works For Evidence
Self-check / verification instruction Claude Opus Triggered OTIF set-theory catch (Supplier Scorecard), documented score revisions (RFP Analysis), data verification table (Spend Analysis)
Outcome-first framing GPT Doubled thinking time (RFP Analysis), improved format quality across all use cases
Explicit citation requests GPT Citations went from 9 to 16 in Category Strategy; added NLW/BCIS/ONS data in Spend Analysis
Hierarchical section structure Claude Opus Precise section adherence; consistent format across all use cases
Numbered deliverables GPT Adopted structured table formats, produced supplier-ready outputs (Contract Redlining)
Strict output format specification Use with caution TCO table disappeared when format was over-prescribed (RFP Analysis, Opus)
Free Download

Prompting Best Practices for Procurement

We put together the prompting best practices from this experiment as two printable two-page guides: one for Claude, one for ChatGPT. Each covers the techniques that produced measurable improvement, with example prompts and task-by-task quick reference.

Two PDF guides: Claude prompting for procurement + ChatGPT prompting for procurement. No spam.

Best AI Model for Procurement: Diminishing Returns and Promptability

There's a pattern worth calling out. Opus started higher (185.5/200 with the generic prompt, or 92.8%) and improved less (+2.0). GPT started lower (176.0/200, or 88.0%) and improved more (+5.0). Two possible explanations:

Ceiling effect: Opus's defaults are already strong for structured analytical procurement work. There's less room for prompt-driven improvement when the baseline is already 92%+.

Promptability: GPT may be more responsive to how you talk to it, more elastic in its output format, and more willing to shift behaviour based on prompt structure. The gap narrowing from 9.5 to 6.5 points supports this: GPT is more "promptable."

For procurement teams choosing between platforms, this is worth thinking about. If you're willing to invest in prompt templates, GPT's ceiling may be closer to Opus's than the defaults suggest. If you want strong output from day one without prompt engineering, Opus's defaults give you a higher floor. We've explored this trade-off in more depth in our guide on AI procurement consulting vs. software.

Limitations

Here is what this experiment can and can't tell you:

These two models only. This review applies specifically to Claude Opus 4.7 and GPT-5.5. Other models, including future versions of these same models, could respond to prompt optimisation very differently. Don't assume the ratios or techniques transfer without testing.

Single run. We ran each prompt style once per model per use case. LLM outputs vary between runs. A rigorous study would run each test multiple times and average results. Our data shows one data point per cell, not a distribution.

Browser-based. Both models were tested via their web interfaces (claude.ai, chatgpt.com), not via API. Opus could only use "adaptive" compute because the "High" setting wasn't available in browser. With higher compute, Opus's improvement might have been larger.

Five use cases. Procurement has hundreds of task types. Our five are representative of analytical work but don't cover negotiation simulation, market intelligence research, or operational procurement workflows. For a broader view of where AI delivers in procurement, see our 12 AI use cases in procurement that actually work.

Prompt optimisation is a spectrum. Our "optimised" prompts represent one approach to prompt engineering. Different structuring choices, or the same techniques applied differently, might produce different results.

The Bottom Line

Prompt engineering is measurable. Across five procurement tasks, restructuring how we asked the question improved output quality by 1 to 3 percent, with GPT benefiting two-and-a-half times more than Opus. Ten minutes of prompt structuring gets you better formats, more citations, and explicit verification. The trade-off is that over-structuring can suppress valuable model instincts. For any procurement team using AI regularly, building a small library of model-specific prompt templates is one of the highest-ROI investments you can make.

Full Methodology & Scoring Rubric
Expand for complete experiment design, scoring criteria, and controls

Scoring Dimensions (0–10 each)

DimensionWhat It Measures
Accuracy & CompletenessFactual correctness, coverage of all prompt requirements, depth of analysis relative to source data
Self-ConsistencyInternal coherence: do the numbers, conclusions, and recommendations align with each other throughout?
Output QualityReadability, actionability, professional format. Could you send this to a stakeholder without rework?
Instruction-FollowingDid the model do what was asked? Did it cover all sections, produce requested deliverables, stay within scope?

Test Environment

Interfaces: claude.ai (Opus 4.7, adaptive compute only) and chatgpt.com (GPT-5.5).

Simultaneous submission: Mode A prompts submitted to both models at approximately the same time.

No API: Deliberate. Browser UX, file generation, and follow-up behaviour are part of the evaluation. Desktop tooling excluded.

Evaluator: Senior procurement professional with category management and supplier management experience.

Mode A Prompt Design

Vendor-neutral, conversational. Written as you'd write to a knowledgeable colleague: "Please analyse these four RFP responses and score them on Technical (30%), Commercial (25%), Compliance (25%), and Delivery Risk (20%). Rank them and provide a recommendation." No formatting instructions, no output structure specification, no verification requests.

Mode B Prompt Design: Opus

Hierarchical structure with clear sections: role framing (you are a senior procurement analyst), data section (the source material), task specification (numbered deliverables), output format (exact section order), and a self-check instruction asking the model to verify calculations and flag any inconsistencies before finalising. Structure used clear separators and hierarchy.

Mode B Prompt Design: GPT

Outcome-first: opened with what success looks like ("A complete evaluation pack ready for the sourcing committee"). Numbered deliverables (1. Scoring table, 2. Supplier rationale, 3. Risk flags, 4. Recommendation). Explicit citation request ("support recommendations with external benchmarks or industry data where available"). Conversational tone maintained but with structural clarity.

Controls

Same source documents (synthetic procurement data created for this evaluation). Same scoring rubric applied to all outputs. Scoring completed immediately after reviewing each output, with no retroactive adjustment. Both models tested in the same session on the same day. No cherry-picking of runs. First output used for scoring.

What's Next

We build model-specific prompt libraries like these for procurement teams. If you're deploying AI across procurement and want structured prompts that get measurably better output from day one, we can help. Our AI training for procurement teams includes hands-on prompt engineering workshops tailored to your use cases.

Not sure where to start? Take our AI Readiness Assessment or speak to our consulting team.

Molecule One is a procurement consultancy that uses AI to make category management, sourcing, and supplier development faster and sharper. We test these tools on real work so our clients don't have to.

The Handshake is our series testing new AI models on procurement workflows. This is a companion piece to Issue #2: GPT-5.5 vs Claude Opus 4.7. Read the full model comparison for detailed use-case breakdowns.
© 2026 Molecule One. All rights reserved.