The Decision Point
When building a RAG ingestion pipeline, there's a moment where you choose: do you validate at the document level, or at the chunk level?
The answer changes everything downstream.
Documents Are Not Chunks
A document is a unit of meaning with metadata — an owner, a version, a source. A chunk is an artifact of processing.
Treating them as equivalent leads to:
- Deduplication failures (same document, multiple chunk versions)
- Metadata loss (who owns this piece of text?)
- Retrieval confusion (which version did this come from?)
The Boundary Rule
Validate at the document boundary. Enforce contracts on the chunk. Separate these concerns and your ingestion pipeline becomes auditable, recoverable, and debuggable.
A boundary you can name is a boundary you can test.
This sounds obvious. It wasn't, until I'd built three systems that got it wrong.