Why Off-the-Shelf Chunkers Silently Break Internal RAG
Default chunkers optimize for token budget—not organizational knowledge. That's a fundamental mismatch.
Your RAG system looks fine in the demo. It answers questions about your internal docs. Stakeholders are happy.
Six weeks later, the coding assistant suggests deprecated patterns. The compliance bot misses exceptions. The onboarding agent gives contradictory answers about the same policy.
You tune prompts. Swap rerankers. Increase context windows. Nothing helps—because the problem isn't downstream. It's the chunker.
The uncomfortable truth
Generic text splitters were designed for Wikipedia and blog posts. They optimize for token budget and overlap—not for policy, intent, or semantic atomicity. When your internal docs prescribe how things should be done, embedding random slices destroys meaning.
What Off-the-Shelf Chunkers Actually Optimize For
Popular frameworks ship with reasonable defaults. Those defaults make three assumptions:
- Text is descriptive. The chunker assumes you're indexing content that describes things—encyclopedia entries, blog posts, documentation that explains concepts. It has no concept of prescriptive content: rules, policies, constraints, requirements.
- Any chunk is retrievable in isolation. The splitter assumes that if you hit a token limit, you can cut anywhere and both halves remain useful. For public corpora, this is often true. For internal standards, it's catastrophic.
- Overlap handles context. The theory: if you overlap chunks by 10-20%, you preserve continuity. The reality: overlap preserves text continuity, not semantic continuity. A policy and its exception can land in different chunks even with aggressive overlap.
Why This Matters for Internal Codebases
Internal documentation isn't Wikipedia. It's normative: it tells people what they should do, what's required, and what exceptions exist.
Consider what's actually in your internal knowledge base:
| Content Type | What It Encodes | What Generic Chunking Destroys |
|---|---|---|
| Architecture standards | Required patterns + exceptions | Rule/exception relationships |
| Security policies | Constraints + approval paths | Constraint completeness |
| API documentation | Endpoint + auth + schema + examples | Endpoint coherence |
| Config schemas | Keys + defaults + constraints | Key/value relationships |
| Migration guides | Before/after + breaking changes | Migration context |
| Anti-patterns | What NOT to do + correct alternative | Problem/solution pairing |
Generic chunkers have no awareness of these structures. They see text. They count tokens. They cut.
The Failure Modes (With Examples)
Failure Mode 1: Splitting rules from exceptions
Original policy:
"All database columns must use snake_case naming.
Exception: Legacy tables in the `billing_v1` schema
may retain their original camelCase naming until
the Q3 migration is complete."
What the chunker produces:
Chunk 1: "All database columns must use snake_case naming."
Chunk 2: "Exception: Legacy tables in the billing_v1 schema..."
Query: "What naming convention should I use for billing tables?"
Retrieved: Chunk 1
Answer: "Use snake_case."
Correct answer: "It depends—check if it's in billing_v1."
Failure Mode 2: Breaking config file coherence
Original config:
database:
host: ${DB_HOST}
port: 5432
pool_size: 20 # Required for production
ssl_mode: verify-full # Do NOT change
What the chunker produces:
Chunk 1: "database:\n host: ${DB_HOST}\n port: 5432"
Chunk 2: "pool_size: 20 # Required for production\n ssl_mode..."
Query: "What database config do I need for production?"
Retrieved: Chunk 1 (higher similarity to "database config")
Answer: Missing pool_size and ssl_mode—both critical.
Failure Mode 3: Separating anti-patterns from solutions
Original anti-pattern doc:
"DON'T: Use SELECT * in production queries.
This causes schema drift issues when columns are added.
DO: Explicitly list required columns.
SELECT id, name, email FROM users WHERE..."
What the chunker produces:
Chunk 1: "DON'T: Use SELECT * in production queries..."
Chunk 2: "DO: Explicitly list required columns..."
Query: "How should I write SQL queries?"
Retrieved: Either chunk, but not both
Answer: Missing either the anti-pattern or the solution.
Failure Mode 4: Losing temporal context
Original deprecation notice:
"The v1 API is deprecated as of 2024-01-15.
All new integrations must use v2.
Existing v1 integrations have until 2024-06-01 to migrate."
What the chunker produces:
Chunk 1: "The v1 API is deprecated as of 2024-01-15."
Chunk 2: "All new integrations must use v2. Existing..."
Query: "Can I use the v1 API?"
Retrieved: Chunk 1
Answer: "The v1 API is deprecated."
Correct answer: "Depends on when and whether you're new or existing."
Why Tuning Parameters Doesn't Fix This
The instinct is to tune: smaller chunks, bigger chunks, more overlap, less overlap. None of it works—because the problem isn't parameter tuning. It's a modeling problem.
The core issue: Generic chunkers optimize for token distribution. Your internal docs encode semantic relationships that don't map to token boundaries.
- Smaller chunks → more fragmentation, more context loss
- Bigger chunks → fewer chunks retrieved, key details buried
- More overlap → duplicated text, wasted context window
- Less overlap → harder boundaries, same fragmentation
You can't tune your way out of a fundamental mismatch between the chunker's assumptions and your content's structure.
What to Do Instead
The fix isn't better parameters. It's treating chunking as knowledge modeling—not text processing.
1. Define semantic boundaries
What are the natural units of knowledge in your corpus? Policies. API endpoints. Config schemas. Anti-patterns. Each of these is a semantic unit that should stay together.
2. Build type-aware extractors
Instead of one generic splitter, build extractors for each content type. An API doc extractor keeps endpoint + auth + schema + example together. A policy extractor keeps rule + exception + effective date together.
3. Preserve relationships explicitly
When you must split, preserve parent-child relationships in metadata. Tag chunks with parent_chunk_id, relationship_type, retrieve_with_parent.
4. Add organizational metadata
Embed the context that chunkers can't infer: doc_type, status, effective_date, team, deprecated. Let retrieval filter before similarity.
5. Test chunking, not just retrieval
Write tests that verify: given this source document, does chunking produce the expected semantic units? Given this query, does retrieval return complete context?
A Diagnostic Checklist
Before you blame your embeddings or reranker, check these:
- Audit 10 failure cases. For each, trace back: what chunk was retrieved? What was missing?
- Check rule/exception separation. Take 5 policies with exceptions. Are they in the same chunk?
- Check config coherence. Take 3 config files. Are related keys in the same chunk?
- Check anti-pattern completeness. Do anti-pattern docs retrieve with their solutions?
- Check temporal context. Do deprecation notices include effective dates and migration paths?
If any of these fail, no amount of downstream tuning will fix your system. The problem is upstream.
The takeaway
Off-the-shelf chunkers work for public corpora because public content is descriptive. Internal content is prescriptive—rules, policies, constraints, exceptions. These require semantic chunking, not token splitting. Overlap and size tuning cannot fix a modeling problem.
What's one policy document in your system where you know the chunking is wrong? Start the audit there.