I keep getting the same call. A team has shipped an LLM feature, the demo went well, and three months in something is off. Customer support gets the same complaint twice a week. The CEO asks if it's the model. The team retunes the prompt, swaps in a larger model, runs the bench, and ships again. Nothing meaningful changes.
Almost every time, the model is fine. The model has always been fine. What's broken is one of three seams that nobody owns.
The model is the smallest part of the system you ship.
§1 — The retrieval seam
The first seam is between your retrieval layer and the rest of your stack. Most teams treat retrieval as a tuning problem — chunk size, embedding model, reranker. It is not a tuning problem. It is a product surface.
Consider a support assistant grounded on a customer's documents. In staging it works because your test corpus is well-shaped: consistent headings, no duplicates, deduped versions. In production the customer uploads sixteen versions of the same contract, three of them final_FINAL_v2.docx, and your top-k retrieval returns four of those sixteen.
# the retrieval seam, sketched
user query
│
▼
┌────────────┐ ┌──────────────────┐
│ retriever │ ─▶ │ context selector │ ─▶ model
└────────────┘ └──────────────────┘
▲ ▲
│ │
(your customer's (you, at 2am)
actual docs)The fix is rarely better retrieval. It is exposing retrieval as a surface the user can see and correct. Showing which document was used. Letting them pin or strike one. Treating "sources" as a first-class UI element rather than a footnote.
§2 — The eval seam
The second seam is between your eval set and reality. Eval sets rot. They rot because the world moves, because your customers change, and most of all because your team gets better at gaming them.
- Half your eval set will be obsolete in nine months.
- The other half is wrong in a way you haven't noticed.
- The team that built it has moved on to other projects.
The discipline I keep coming back to is continuous re-grounding: every quarter, you sample 200 real production traces, hand-label them, and treat that as the new north-star bench. Old benches stay around for regression — but they stop being the score that ships features.
§3 — The human escape hatch
The third seam is the one nobody wants to design. What happens when the model is wrong, and the user knows it's wrong, and there is no one to escalate to?
I've seen this kill three otherwise good products. The interaction model assumed correctness. There was no "this is wrong" button, or there was, and clicking it dropped the feedback into a queue nobody read. By month four, users had learned to bypass the AI surface entirely.
Build the escape hatch on day one. Wire it to a person, not a queue. More on this in essay 020.
§4 — What this means for hiring
If your AI feature lives or dies on three seams that aren't the model, your hiring should reflect that. A team of pure ML researchers will build the best model in the world and ship a bad product. A team of product engineers with a competent ML generalist will ship a product that improves quarter over quarter.
§5 — Notes & sources
Drawn from advisory engagements across six companies (2023–2026). Names redacted; patterns retained. Thanks to M. Iyer, D. Park, and the staff engineer at a bank-I-cannot-name for the long whiteboard sessions that produced §1.