Why Off-the-Shelf Chunkers Silently Break Internal RAG

Default chunkers optimize for token budget—not organizational knowledge. That's a fundamental mismatch.

Your RAG system looks fine in the demo. It answers questions about your internal docs. Stakeholders are happy.

Six weeks later, the coding assistant suggests deprecated patterns. The compliance bot misses exceptions. The onboarding agent gives contradictory answers about the same policy.

You tune prompts. Swap rerankers. Increase context windows. Nothing helps—because the problem isn't downstream. It's the chunker.

The uncomfortable truth

Generic text splitters were designed for Wikipedia and blog posts. They optimize for token budget and overlap—not for policy, intent, or semantic atomicity. When your internal docs prescribe how things should be done, embedding random slices destroys meaning.

What Off-the-Shelf Chunkers Actually Optimize For

Popular frameworks ship with reasonable defaults. Those defaults make three assumptions:

Text is descriptive. The chunker assumes you're indexing content that describes things—encyclopedia entries, blog posts, documentation that explains concepts. It has no concept of prescriptive content: rules, policies, constraints, requirements.
Any chunk is retrievable in isolation. The splitter assumes that if you hit a token limit, you can cut anywhere and both halves remain useful. For public corpora, this is often true. For internal standards, it's catastrophic.
Overlap handles context. The theory: if you overlap chunks by 10-20%, you preserve continuity. The reality: overlap preserves text continuity, not semantic continuity. A policy and its exception can land in different chunks even with aggressive overlap.

Why This Matters for Internal Codebases

Internal documentation isn't Wikipedia. It's normative: it tells people what they should do, what's required, and what exceptions exist.

Consider what's actually in your internal knowledge base:

Content Type	What It Encodes	What Generic Chunking Destroys
Architecture standards	Required patterns + exceptions	Rule/exception relationships
Security policies	Constraints + approval paths	Constraint completeness
API documentation	Endpoint + auth + schema + examples	Endpoint coherence
Config schemas	Keys + defaults + constraints	Key/value relationships
Migration guides	Before/after + breaking changes	Migration context
Anti-patterns	What NOT to do + correct alternative	Problem/solution pairing

Generic chunkers have no awareness of these structures. They see text. They count tokens. They cut.

The Failure Modes (With Examples)

Failure Mode 1: Splitting rules from exceptions

Original policy:
"All database columns must use snake_case naming.
Exception: Legacy tables in the `billing_v1` schema
may retain their original camelCase naming until
the Q3 migration is complete."

What the chunker produces:
  Chunk 1: "All database columns must use snake_case naming."
  Chunk 2: "Exception: Legacy tables in the billing_v1 schema..."

Query: "What naming convention should I use for billing tables?"
Retrieved: Chunk 1
Answer: "Use snake_case."
Correct answer: "It depends—check if it's in billing_v1."

Failure Mode 2: Breaking config file coherence

Original config:
database:
  host: ${DB_HOST}
  port: 5432
  pool_size: 20  # Required for production
  ssl_mode: verify-full  # Do NOT change

What the chunker produces:
  Chunk 1: "database:\n  host: ${DB_HOST}\n  port: 5432"
  Chunk 2: "pool_size: 20  # Required for production\n  ssl_mode..."

Query: "What database config do I need for production?"
Retrieved: Chunk 1 (higher similarity to "database config")
Answer: Missing pool_size and ssl_mode—both critical.

Failure Mode 3: Separating anti-patterns from solutions

Original anti-pattern doc:
"DON'T: Use SELECT * in production queries.
This causes schema drift issues when columns are added.

DO: Explicitly list required columns.
SELECT id, name, email FROM users WHERE..."

What the chunker produces:
  Chunk 1: "DON'T: Use SELECT * in production queries..."
  Chunk 2: "DO: Explicitly list required columns..."

Query: "How should I write SQL queries?"
Retrieved: Either chunk, but not both
Answer: Missing either the anti-pattern or the solution.

Failure Mode 4: Losing temporal context

Original deprecation notice:
"The v1 API is deprecated as of 2024-01-15.
All new integrations must use v2.
Existing v1 integrations have until 2024-06-01 to migrate."

What the chunker produces:
  Chunk 1: "The v1 API is deprecated as of 2024-01-15."
  Chunk 2: "All new integrations must use v2. Existing..."

Query: "Can I use the v1 API?"
Retrieved: Chunk 1
Answer: "The v1 API is deprecated."
Correct answer: "Depends on when and whether you're new or existing."

Why Tuning Parameters Doesn't Fix This

The instinct is to tune: smaller chunks, bigger chunks, more overlap, less overlap. None of it works—because the problem isn't parameter tuning. It's a modeling problem.

The core issue: Generic chunkers optimize for token distribution. Your internal docs encode semantic relationships that don't map to token boundaries.

Smaller chunks → more fragmentation, more context loss
Bigger chunks → fewer chunks retrieved, key details buried
More overlap → duplicated text, wasted context window
Less overlap → harder boundaries, same fragmentation

You can't tune your way out of a fundamental mismatch between the chunker's assumptions and your content's structure.

What to Do Instead

The fix isn't better parameters. It's treating chunking as knowledge modeling—not text processing.

1. Define semantic boundaries

What are the natural units of knowledge in your corpus? Policies. API endpoints. Config schemas. Anti-patterns. Each of these is a semantic unit that should stay together.

2. Build type-aware extractors

Instead of one generic splitter, build extractors for each content type. An API doc extractor keeps endpoint + auth + schema + example together. A policy extractor keeps rule + exception + effective date together.

3. Preserve relationships explicitly

When you must split, preserve parent-child relationships in metadata. Tag chunks with parent_chunk_id, relationship_type, retrieve_with_parent.

4. Add organizational metadata

Embed the context that chunkers can't infer: doc_type, status, effective_date, team, deprecated. Let retrieval filter before similarity.

5. Test chunking, not just retrieval

Write tests that verify: given this source document, does chunking produce the expected semantic units? Given this query, does retrieval return complete context?

A Diagnostic Checklist

Before you blame your embeddings or reranker, check these:

Audit 10 failure cases. For each, trace back: what chunk was retrieved? What was missing?
Check rule/exception separation. Take 5 policies with exceptions. Are they in the same chunk?
Check config coherence. Take 3 config files. Are related keys in the same chunk?
Check anti-pattern completeness. Do anti-pattern docs retrieve with their solutions?
Check temporal context. Do deprecation notices include effective dates and migration paths?

If any of these fail, no amount of downstream tuning will fix your system. The problem is upstream.

The takeaway

Off-the-shelf chunkers work for public corpora because public content is descriptive. Internal content is prescriptive—rules, policies, constraints, exceptions. These require semantic chunking, not token splitting. Overlap and size tuning cannot fix a modeling problem.

What's one policy document in your system where you know the chunking is wrong? Start the audit there.