April 26, 2026 10 min read

Anthropic's Project Deal: when both sides of a sale are AI, your data does the negotiating

Anthropic ran a marketplace where Claude agents bought and sold real goods on behalf of employees. The headline isn't that it worked — it's that buyers using weaker models lost money and didn't notice. Here's what merchants should take from it.

ai-commerceai-agentsstrategy

Most of the AI commerce conversation so far has been about one side of the deal: how does a brand show up when a shopper asks ChatGPT for a recommendation? The unspoken assumption is that the shopper is human, the brand is human, and the AI is a referral channel sitting between them.

Anthropic just ran an experiment that breaks that assumption.

In December 2025, Anthropic stood up an internal classified marketplace called Project Deal. 69 employees got $100 budgets and listed personal belongings for sale to each other. Crucially, no human ever negotiated. Each participant was represented by a Claude agent that listed items, found matches, made offers, fielded counteroffers, and closed deals — all in natural language, in Slack, with no prebaked negotiation protocol. 186 deals closed, totalling more than $4,000 in real money for real goods.

TechCrunch covered the experiment on April 25. The headline finding most outlets ran with — that the experiment “worked” — is the least interesting thing about it. The interesting findings are the ones that should change how brands think about their product data.

Four runs, two model tiers, one deliberate mismatch

Anthropic ran four parallel marketplaces simultaneously, all populated by the same employees:

Runs A and D — every agent powered by Claude Opus 4.5 (the frontier model)
Runs B and C — 50/50 mix of Opus 4.5 and Haiku 4.5 (the smaller, faster, cheaper model)
One run was “real” — the deals in it produced actual goods exchanged. Participants didn’t know which.

The mixed runs are the load-bearing part of the design. They let Anthropic compare what happens when an Opus agent negotiates against an Opus agent versus what happens when an Opus agent negotiates against a Haiku agent across the same marketplace, with the same listings, and the same humans behind them.

The results, per Anthropic’s writeup:

Opus completed about two more deals per participant than Haiku.
Opus extracted $3.64 more per item on average.
A broken bike sold for $38 through a Haiku agent and $65 through an Opus agent. A lab-grown ruby sold for $35 vs. $65. Same item, same marketplace, different agents.
Opus sellers averaged a $2.68 price premium. Opus buyers paid $2.45 less.
Aggressive negotiation prompts had no statistically significant effect. Model capability mattered. Instructions didn’t.

That last finding is the one to hold onto. The humans weren’t able to instruct their way out of an agent-quality gap. The model did what the model was capable of, and the prompt was decoration.

The finding nobody noticed

Here’s the part of the experiment that stops being a curiosity and starts being a policy problem.

Participants rated marketplace fairness at roughly 4.06 out of 7 when their agent was Opus, and 4.05 out of 7 when their agent was Haiku — statistically indistinguishable. In one of Anthropic’s surveys, 11 of 28 participants actually ranked their Haiku outcomes higher than their Opus outcomes despite the measurable gap.

Anthropic’s own framing, in their writeup: the risk is “agent quality gaps” where “people on the losing end might not realize they’re worse off.”

In an agent-mediated market, the loser doesn’t see the loss. The deal closes. The price feels reasonable. The other side felt reasonable too. The $30 left on the table is invisible to everyone except the model that left it there.

Why this is a merchant problem, not just a research problem

Project Deal was a peer-to-peer marketplace with no structured catalog. Each agent had to figure out, from a human-written description, what a thing was and what it was worth. That’s the agent commerce setting where structured data matters most — and where most merchants today have the least leverage.

In real ecommerce, the geometry is different. The merchant controls one side of the conversation: the catalog data the buyer agent ingests. The merchant doesn’t control which model the buyer brings. They might be on Claude Opus. They might be on a much smaller model running locally. They might be on a free-tier assistant whose first instinct is to confabulate.

Project Deal proves that the gap between a frontier-tier and a commodity-tier agent shows up in dollars, not in user experience. Your buyer’s agent makes worse decisions on their behalf, the buyer rates the experience as fine, and you — the merchant — eat the consequences silently:

The smaller agent misreads your title and confidently recommends a competitor.
The smaller agent picks the wrong variant (“size 9” when the shopper said “size 9 wide”).
The smaller agent hallucinates a feature you don’t have, and when the product arrives without it, you get the return.
The smaller agent quotes stale price or stale stock, the shopper places the order, and your support team eats the gap.

In every one of those scenarios, the buyer doesn’t know they got a bad outcome. The merchant pays for it.

The agent-on-agent flow, drawn out

The merchant has zero control over the box marked “buyer agent.” Its capability is unknown and will change run to run, query to query, customer to customer. Everything in the dashed column — title, attributes, description, identifier, price, stock — is the merchant’s leverage. It’s the only leverage.

The implication isn’t subtle: catalog data has to be unambiguous enough that even the weakest plausible agent gets it right. Optimizing for the frontier model is the wrong target. Frontier models will forgive thin titles, ambiguous variants, and floral descriptions. Smaller models won’t, and a meaningful share of the buyer agents in the next two years are going to be smaller models.

What “unambiguous enough for a weak agent” actually looks like

The patterns we see in catalogs that survive multi-tier agent matching are boring on purpose:

Titles that lead with the disambiguators a smaller model can grab. Brand → product type → the two or three attributes that decide a match in your category → variant. Not adjectives. Not adverbs. A weaker model reading “Acme Pro Runner — Women’s — Wide (D) — 9 — Mesh — Neutral cushion” doesn’t have to reason about what kind of shoe this is. A weaker model reading “Acme’s Best-Selling Comfort Runner” has to reason about everything, and reasoning is exactly where smaller models fail.
Every relevant attribute populated as a structured field, not buried in prose. Material, size system, age group, gender, intended use, GTIN, MPN. These are the fields a smaller model can read directly. A field a smaller model has to extract from prose is a field a smaller model gets wrong.
Each variant individually addressable. A parent product with twenty variants invites the buyer agent to guess which one is in stock, which one is the right size, which one matches the shopper’s price. Weaker agents guess wrong.
Identifier hygiene that lets the buyer agent verify the product is real. GTIN/MPN absence is a confidence-killer for any agent, and frontier models forgive it more than smaller ones do.
Q&A pairs and usage scenarios populated. Project Deal showed agents reasoning about what something is for, not just what it is. “Marathon training” and “occasional treadmill” are different products to a buyer agent, even if they have identical specs.

None of this is new advice. What Project Deal changes is the urgency. The ceiling on how badly bad data gets misread isn’t set by the smartest model in the market — it’s set by the dumbest one that touches your catalog.

The second-order finding

There’s one more line from Anthropic’s writeup worth sitting with. Inside the experiment, agents that talked to other agents started developing odd behaviors: confabulating backstories, role-playing, spinning small narrative excuses for offers and counteroffers. These were emergent. Nobody trained them. Nobody asked for them.

In a real merchant context, this is what “optimizing for agent attention” eventually looks like. If buyer agents respond to certain framings, sellers will frame their data that way. If they respond to certain narrative hooks, sellers will add them. The loop already exists in human-facing copy — that’s why product descriptions are stuffed with “perfect for” and “ideal for” — and it’ll be louder, not quieter, when the audience is a model.

The honest reading of Project Deal is that the brands who win the agent-on-agent era are the ones with the most accurate, machine-readable, attribute-rich catalogs first — and the discipline to keep them that way as the temptation to game agent attention increases. Accuracy is the moat. Embellishment is the trap.

What to do this week

Concretely:

Pull a sample of your top 100 products and read them as a smaller model would. Strip out marketing language. Ask: from the structured fields alone, can a model decide whether this product matches a specific buyer query? If the answer requires reading the description, that’s a brittle match.
Audit attribute coverage against your category, not against the spec minimum. The optional fields in feed specs (material, intended use, audience, energy/efficiency class) are the ones smaller models lean on hardest, because they’re the ones least obscured by prose.
Test your products in the cheap models, not just the expensive ones. Run real shopping queries through Claude Haiku, GPT-5 mini, and the free tiers of consumer assistants. The gap between how the frontier and commodity models describe your product is your buyer’s risk surface.
Watch the protocol layer. Project Deal didn’t use ACP or UCP — it was Slack-native. But the structural insight (agent-on-agent, no human in the loop, merchant catalog as the only stable signal) generalizes to every agentic protocol shipping in 2026.

The bottom line

Project Deal was small — 69 people, one week, one company — and Anthropic is upfront that its findings won’t generalize cleanly to the open market. It also wasn’t framed as a product launch. It was framed as a probe.

The probe came back with one finding the agent-commerce industry should not move past: in agent-mediated transactions, capability gaps translate directly into financial outcomes, and the side that loses doesn’t notice.

For brands, that’s a directive. The buyer agent isn’t your enemy and isn’t your customer — it’s a translator of unknown quality, working from your data, on behalf of someone who’ll never know whether the translation was accurate. The job is to make the translation hard to get wrong.

If you want to see how your product data reads to agents across the capability spectrum — frontier, mid-tier, and commodity — get in touch with Lumio. Lumio scores every product on the dimensions that decide whether the agent on the other side gets the deal right.