Lesson 1: Latency Is UX

Papers measure accuracy. Production measures latency. These are different problems with different solutions, and conflating them is how teams ship systems that work on benchmarks and fail in user sessions.

A 200ms retrieval that's 90% accurate is better than a 2000ms retrieval that's 95% accurate. The 10% accuracy gap can be closed with better design. The 1800ms gap breaks the interaction model.

Lesson 2: The Second Failure Is the One That Matters

First failures are expected. Systems are tested for them. The second failure — the one that happens when your recovery path itself fails — is where systems die.

Build for second failures. Test your fallbacks. Assume your fallbacks will also fail.

Lesson 3: Users Will Tell You What's Wrong If You Let Them

Every production AI system I've shipped has been improved more by user signals than by eval sets. Not because evals are bad, but because users encounter the actual distribution of inputs.

Build a feedback surface on day one. Not a star rating. A specific, low-friction signal: "this was wrong" + a text box.