We Tested AI Prompt Engineering on Five Procurement Tasks. One Model Improved Twice as Much.
A prompt engineering experiment across five procurement tasks using GPT-5.5 and Claude Opus 4.7. Same rubric, same evaluator. The only variable: how we structured the question.
Does AI Prompt Engineering Actually Matter in Procurement?
Prompt engineering matters in procurement AI. We always knew that. It's why we insist on building standard prompt libraries for procurement teams when we deploy AI for our customers, because we've seen first-hand how much the quality of the question shapes the quality of the answer. But until now, we'd never actually run a controlled experiment to measure the difference between a generic prompt and a model-optimised one.
Most procurement teams don't have prompt engineers. They type questions into ChatGPT or Claude the same way they'd email a colleague: "Review this contract and flag the risks." It works. The output is useful. But is there measurable value in spending an extra ten minutes structuring your prompt differently?
So we ran an experiment. Not with toy examples, but with the kind of procurement work that ends up in steering committee packs and supplier negotiations.
"With the generic prompt, GPT thought for 3 minutes. With the optimised prompt it is almost 6 minutes thinking time."
Sandeep Karangula, during the RFP Analysis testWe took five procurement tasks (RFP evaluation, contract redlining, spend analysis, category strategy, and supplier scorecards) and ran each one through Claude Opus 4.7 and GPT-5.5. Twice. First with a generic, copy-paste prompt. Then with a prompt specifically structured for each model's strengths. Same tasks. Same scoring rubric. Same evaluator. The only variable was how we wrote the question.
The Setup
The optimised prompts asked for the same deliverables. The difference was how they were structured. For Opus, we used hierarchical sections with clear role framing, data separation, task specification, output format requirements, and a dedicated self-check instruction. For GPT, we led with the desired outcome, numbered the deliverables explicitly, and asked for external citations.
Scoring rubric: Four dimensions, 0–10 each, max 40 per mode: Accuracy & Completeness, Self-Consistency, Output Quality, and Instruction-Following.
Controls: Same five use cases. Same evaluator. Same browser-based interfaces (claude.ai and chatgpt.com). Same day, same session, same source documents. Scoring completed before moving to the next test, with no retroactive adjustment.
The Headline Numbers
(generic → optimised)
(generic → optimised)
for GPT vs Opus
Both models got better when we improved the prompts. But GPT improved by +5.0 points (2.8%) while Opus improved by +2.0 points (1.1%). GPT responded two-and-a-half times more strongly to prompt optimisation.
The competitive gap also shifted. With the generic prompt, Opus led by 9.5 points (185.5 vs 176.0 out of 200). With the optimised prompt, that gap narrowed to 6.5 points (187.5 vs 181.0).
Before & After: Full Results
| Use Case | Opus Generic | Opus Optimised | Opus Δ | GPT Generic | GPT Optimised | GPT Δ |
|---|---|---|---|---|---|---|
| RFP Analysis | 37.0 | 39.0 | +2.0 | 34.0 | 35.0 | +1.0 |
| Contract Redlining | 36.0 | 36.5 | +0.5 | 33.5 | 36.0 | +2.5 |
| Spend Analysis | 35.5 | 37.5 | +2.0 | 36.5 | 36.5 | +0.0 |
| Category Strategy | 38.5 | 38.0 | −0.5 | 36.0 | 37.0 | +1.0 |
| Supplier Scorecard | 38.5 | 36.5 | −2.0 | 36.0 | 36.5 | +0.5 |
| TOTAL (/200) | 185.5 | 187.5 | +2.0 | 176.0 | 181.0 | +5.0 |
See the full model comparison for detailed use-case breakdowns by scoring dimension.
How Claude Opus 4.7 Responded to Prompt Engineering
Opus's improvements were specific and narrow. The structured formatting with self-check instructions triggered verification behaviours that didn't appear with the generic prompt:
Supplier Scorecard: the inclusion-exclusion catch. Before producing any analysis of a supplier's delivery performance, Opus applied set theory to verify the source data: "For Q1: 128 on-time + 134 in-full out of 142 orders. By inclusion-exclusion, the minimum possible OTIF count is 128 + 134 − 142 = 120, but the data says 116. That's mathematically impossible." This only appeared in the structured prompt run. It's the single most technically impressive moment of the entire evaluation.
RFP Analysis: self-verification with documented revisions. Rather than asserting "I checked my work," Opus with the optimised prompt documented three specific score revisions it made during its analysis. The verification instruction converted a generic quality signal into an audit trail.
Contract Redlining: quantified financial exposure. The optimised prompt's output opened with an executive summary quantifying the commercial impact of contract deviations: £800k total exposure, £395k termination cost, £66.7k working-capital impact from payment terms. The generic prompt flagged the same risks but as observations rather than numbers.
But Opus Also Lost Something
In the RFP Analysis test with the generic prompt, Opus produced a Total Cost of Ownership table that the prompt never asked for. It was a valuable unrequested addition that showed genuine procurement domain reasoning. With the optimised prompt, that table disappeared. The structured format specified "Score table → Rationale → Risk flags → Recommendation." Opus followed the structure precisely and didn't add anything extra.
Strict formatting can suppress good instincts. The TCO table was genuinely useful. A CFO would want it. But the structured prompt told Opus exactly what sections to produce, and Opus complied. More structure gives you more control, but less room for the model to add insight you didn't know to ask for.
How GPT-5.5 Responded to Prompt Engineering
GPT's improvements were larger and more visible. The outcome-first prompt with explicit deliverable numbering changed how the model structured its output:
Contract Redlining: from prose to tables. The generic prompt produced long paragraph-form analysis with headers and blockquotes. The optimised prompt adopted a 4-column table format (Clause / Deviation / Business Impact / Required Revision) and opened with "Dear Nexus team," a supplier-ready letter you could send directly to the counterparty. This was GPT's single biggest score movement across all tests (+2.5 points).
Category Strategy: citations doubled. The generic prompt included 9 external citations with market context. The optimised prompt jumped to 16, adding named supplier profiles (WPP, Publicis, Omnicom/IPG, Dentsu, Havas), EU regulatory timelines (EUDR, PPWR), and specific market data. The explicit citation request pulled out research that GPT clearly had access to but didn't volunteer without being asked.
Spend Analysis: external data as standard. The optimised prompt brought in NLW rates, BCIS construction forecasts, and ONS indices. These are specific, named UK data sources that gave the spend analysis an evidence base beyond the dataset itself.
"GPT outperforms itself and gets closer to Opus formatting with the optimised prompt. It uses tables like Opus does. Interestingly, with the optimised prompt, Opus doesn't use tables as much as it did with the generic one."
Sandeep Karangula, during the Contract Redlining evaluationGPT vs Claude Format Convergence: The Models Swapped Styles
This one caught us off guard.
With the generic prompt, the models had distinct formatting personalities. Opus defaulted to tables, structured sections, and concise analytical formats. GPT defaulted to prose, longer explanations, and narrative flow. When we optimised the prompts, both models moved toward each other's default style.
Opus: Tables, structured sections, separated "additional analysis" from prompted output, concise advisory tone
GPT: Prose-heavy, headers with blockquotes, everything in one stream, longer explanations
Opus: Shifted toward narrative executive-report style, fewer tables, more detailed explanation
GPT: Adopted tables, structured layouts, produced supplier-ready formats, added "Dear Nexus team" letter framing
Prompt optimisation may push models toward a common "optimal" output format for procurement work. Both models, when given architecture-appropriate instructions, converged on a middle ground: structured but explained, tabular but contextualised, concise but complete. The "optimal" procurement output may be less about which model you use and more about how well you communicate what you need.
Five Procurement AI Tasks: What Prompt Engineering Changed
The effect varied significantly across tasks. Here is what happened in each one.
Both models scored well with the generic prompt. They ranked four laptop procurement suppliers identically and reached the same conclusion on which to shortlist. The optimised prompt pushed Opus toward its strongest self-verification (documenting three specific score revisions) but cost it the unrequested TCO table. GPT's thinking time doubled from ~3 to ~6 minutes, and the extra reasoning produced cleaner formatting with dual scales.
GPT's biggest single improvement. The generic prompt produced prose-heavy analysis that was accurate but hard to action. The optimised prompt adopted a 4-column table (Clause / Deviation / Impact / Revision) and opened as a supplier-ready letter: "Dear Nexus team, We have reviewed the draft IT Managed Services Agreement against Hartwell Retail Group plc's standard contracting positions." Meanwhile, Opus shifted from tables toward narrative with quantified financial exposure (£800k, £395k, £66.7k).
Spend Analysis with the generic prompt was GPT's only outright win across all 10 tests. Its clean Kraljic table and external citations (NLW rates, BCIS forecasts) edged out Opus's more analytical but less visual approach. With the optimised prompt, Opus surged: it added a data verification table, mid-point savings estimates with phasing, and a formal calculation verification section triggered by the self-check instruction. GPT's score stayed flat because it was already near its ceiling for this task.
The only use case where Opus scored lower with the optimised prompt. The generic prompt already produced a near-perfect output (38.5/40), a 14-page .docx with title page, executive summary, Kraljic mapping, sourcing roadmap, and signature lines. The optimised prompt produced a slightly different document (119 paragraphs vs 96, more aggressive savings targets) but didn't improve. GPT, meanwhile, went from 9 to 16 external citations, adding named supplier profiles and EU regulatory timelines (EUDR, PPWR).
Opus with the generic prompt produced the strongest individual output of the entire evaluation: a .md file with OTIF set-theory verification, a transparent deduction-based quality scoring model, NC root cause analysis, and a supplier-facing executive summary. The optimised prompt changed the quality severity weights (Critical ×2.0 became ×2.5, Major ×0.5 became ×0.7) and didn't produce a file. GPT caught something Opus missed: NC-003's production date fell in Q1 despite being reported in Q2.
What We Didn't Expect
1. The TCO Table That Vanished
In the RFP Analysis test with the generic prompt, Opus inferred that a proper RFP evaluation needs total cost of ownership analysis and built a full TCO normalisation table the prompt never asked for. It was exactly what a procurement director would want. With the optimised prompt, the structured format specified four output sections. Opus followed the structure faithfully and the TCO table disappeared. The prompt was too prescriptive.
2. The Thinking Time That Doubled
GPT's visible "thinking" time in the RFP Analysis test went from approximately 3 minutes (generic prompt) to approximately 6 minutes (optimised prompt). The outcome-first prompt didn't add complexity to the task. It restructured how the task was framed. The model's reasoning infrastructure responded to the framing change by investing more computation. For single queries this is fine. For batch workflows processing 50 supplier responses, doubling processing time matters.
3. The File That Stopped Appearing
Opus produced a .md file in the Supplier Scorecard test with the generic prompt but not with the optimised prompt. It produced files in Category Strategy (both prompts), Contract Redlining (optimised only), and Spend Analysis (optimised only). No clear pattern emerged for when the structured prompt triggered file generation and when it didn't. GPT never produced a file with either prompt across any use case. At least that was consistent.
4. The Format Swap
We covered this above, but it bears repeating: the models didn't just get better with optimised prompts. They also got more similar. GPT adopted Opus's table-first approach. Opus adopted GPT's narrative-explanation approach. If you're choosing a model for its "style," that style is partially a function of how you prompt it.
5. Copy-Paste Contamination
In the Category Strategy test, GPT's optimised-prompt response included phrasing that echoed the prompt itself. More detailed, more structured prompts create more surface area for the model to inadvertently recycle prompt language into its output. When your prompt contains specific phrases like "preferred supplier panel with 10–15 pre-qualified partners," the model may parrot that framing back rather than generating its own analysis.
How to Prompt AI for Procurement: Practical Takeaways
Investing 10 minutes in prompt structure is worth it. Your model benefits more than Opus does. GPT gained 2.5× more from optimised prompts than Opus did, meaning prompt engineering has higher ROI if you're on the OpenAI platform. Specifically: lead with the outcome you want, number your deliverables explicitly, and ask for external citations by name (e.g., "cite industry benchmarks with sources"). GPT has access to research it won't volunteer unless you ask. For a step-by-step guide on implementing these techniques, see our practical guide to implementing AI in procurement.
Focus on verification instructions rather than output format. Opus's biggest improvements came from self-check sections, not format prescriptions. Asking Opus to "verify your calculations and document any revisions" triggers audit-trail behaviour. But avoid over-specifying the output structure. Opus's best unrequested additions (the TCO table, the set-theory data verification) appeared when it had format flexibility.
Match prompt intensity to task ceiling. If a task is already scoring 95%+ with a generic prompt (like Opus on category strategy), optimisation may not help and can hurt. Save your structured prompts for tasks where the model's default output needs improvement, especially format-dependent tasks like contract redlining or supplier QBR packs. Not sure where to start? Our AI Readiness Assessment identifies which procurement tasks will benefit most.
Quick-Reference: What Worked
| Technique | Works For | Evidence |
|---|---|---|
| Self-check / verification instruction | Claude Opus | Triggered OTIF set-theory catch (Supplier Scorecard), documented score revisions (RFP Analysis), data verification table (Spend Analysis) |
| Outcome-first framing | GPT | Doubled thinking time (RFP Analysis), improved format quality across all use cases |
| Explicit citation requests | GPT | Citations went from 9 to 16 in Category Strategy; added NLW/BCIS/ONS data in Spend Analysis |
| Hierarchical section structure | Claude Opus | Precise section adherence; consistent format across all use cases |
| Numbered deliverables | GPT | Adopted structured table formats, produced supplier-ready outputs (Contract Redlining) |
| Strict output format specification | Use with caution | TCO table disappeared when format was over-prescribed (RFP Analysis, Opus) |
Prompting Best Practices for Procurement
We put together the prompting best practices from this experiment as two printable two-page guides: one for Claude, one for ChatGPT. Each covers the techniques that produced measurable improvement, with example prompts and task-by-task quick reference.
Two PDF guides: Claude prompting for procurement + ChatGPT prompting for procurement. No spam.
Best AI Model for Procurement: Diminishing Returns and Promptability
There's a pattern worth calling out. Opus started higher (185.5/200 with the generic prompt, or 92.8%) and improved less (+2.0). GPT started lower (176.0/200, or 88.0%) and improved more (+5.0). Two possible explanations:
Ceiling effect: Opus's defaults are already strong for structured analytical procurement work. There's less room for prompt-driven improvement when the baseline is already 92%+.
Promptability: GPT may be more responsive to how you talk to it, more elastic in its output format, and more willing to shift behaviour based on prompt structure. The gap narrowing from 9.5 to 6.5 points supports this: GPT is more "promptable."
For procurement teams choosing between platforms, this is worth thinking about. If you're willing to invest in prompt templates, GPT's ceiling may be closer to Opus's than the defaults suggest. If you want strong output from day one without prompt engineering, Opus's defaults give you a higher floor. We've explored this trade-off in more depth in our guide on AI procurement consulting vs. software.
Limitations
Here is what this experiment can and can't tell you:
These two models only. This review applies specifically to Claude Opus 4.7 and GPT-5.5. Other models, including future versions of these same models, could respond to prompt optimisation very differently. Don't assume the ratios or techniques transfer without testing.
Single run. We ran each prompt style once per model per use case. LLM outputs vary between runs. A rigorous study would run each test multiple times and average results. Our data shows one data point per cell, not a distribution.
Browser-based. Both models were tested via their web interfaces (claude.ai, chatgpt.com), not via API. Opus could only use "adaptive" compute because the "High" setting wasn't available in browser. With higher compute, Opus's improvement might have been larger.
Five use cases. Procurement has hundreds of task types. Our five are representative of analytical work but don't cover negotiation simulation, market intelligence research, or operational procurement workflows. For a broader view of where AI delivers in procurement, see our 12 AI use cases in procurement that actually work.
Prompt optimisation is a spectrum. Our "optimised" prompts represent one approach to prompt engineering. Different structuring choices, or the same techniques applied differently, might produce different results.
The Bottom Line
Prompt engineering is measurable. Across five procurement tasks, restructuring how we asked the question improved output quality by 1 to 3 percent, with GPT benefiting two-and-a-half times more than Opus. Ten minutes of prompt structuring gets you better formats, more citations, and explicit verification. The trade-off is that over-structuring can suppress valuable model instincts. For any procurement team using AI regularly, building a small library of model-specific prompt templates is one of the highest-ROI investments you can make.
Scoring Dimensions (0–10 each)
| Dimension | What It Measures |
|---|---|
| Accuracy & Completeness | Factual correctness, coverage of all prompt requirements, depth of analysis relative to source data |
| Self-Consistency | Internal coherence: do the numbers, conclusions, and recommendations align with each other throughout? |
| Output Quality | Readability, actionability, professional format. Could you send this to a stakeholder without rework? |
| Instruction-Following | Did the model do what was asked? Did it cover all sections, produce requested deliverables, stay within scope? |
Test Environment
Interfaces: claude.ai (Opus 4.7, adaptive compute only) and chatgpt.com (GPT-5.5).
Simultaneous submission: Mode A prompts submitted to both models at approximately the same time.
No API: Deliberate. Browser UX, file generation, and follow-up behaviour are part of the evaluation. Desktop tooling excluded.
Evaluator: Senior procurement professional with category management and supplier management experience.
Mode A Prompt Design
Vendor-neutral, conversational. Written as you'd write to a knowledgeable colleague: "Please analyse these four RFP responses and score them on Technical (30%), Commercial (25%), Compliance (25%), and Delivery Risk (20%). Rank them and provide a recommendation." No formatting instructions, no output structure specification, no verification requests.
Mode B Prompt Design: Opus
Hierarchical structure with clear sections: role framing (you are a senior procurement analyst), data section (the source material), task specification (numbered deliverables), output format (exact section order), and a self-check instruction asking the model to verify calculations and flag any inconsistencies before finalising. Structure used clear separators and hierarchy.
Mode B Prompt Design: GPT
Outcome-first: opened with what success looks like ("A complete evaluation pack ready for the sourcing committee"). Numbered deliverables (1. Scoring table, 2. Supplier rationale, 3. Risk flags, 4. Recommendation). Explicit citation request ("support recommendations with external benchmarks or industry data where available"). Conversational tone maintained but with structural clarity.
Controls
Same source documents (synthetic procurement data created for this evaluation). Same scoring rubric applied to all outputs. Scoring completed immediately after reviewing each output, with no retroactive adjustment. Both models tested in the same session on the same day. No cherry-picking of runs. First output used for scoring.
We build model-specific prompt libraries like these for procurement teams. If you're deploying AI across procurement and want structured prompts that get measurably better output from day one, we can help. Our AI training for procurement teams includes hands-on prompt engineering workshops tailored to your use cases.
Not sure where to start? Take our AI Readiness Assessment or speak to our consulting team.
Molecule One is a procurement consultancy that uses AI to make category management, sourcing, and supplier development faster and sharper. We test these tools on real work so our clients don't have to.