Chunking Is Knowledge Modeling, Not Preprocessing
Why "split by tokens" fails for internal RAG—and how artifact-based chunking changes everything.
Chunking isn't a preprocessing step. It's how your RAG system thinks.
Teams treat chunking as infrastructure—run a splitter, pick a token limit, move on. But for internal codebases, that approach guarantees retrieval failures. Every internal repo encodes meaning differently, and a universal chunker cannot exist here.
The shift
Move from "split by tokens" to "extract artifacts." Define boundaries, dependencies, and rules per artifact type. Encode organizational intent into chunk metadata so retrieval filters before similarity.
Why This Is Breaking Now
Two years of production RAG deployments have taught us something painful: the same chunking strategy that works for Wikipedia fails catastrophically for internal documentation.
Consider what's actually in your codebase:
- API services need endpoint + schema + examples together
- Infrastructure repos encode meaning in YAML comments and implied defaults
- SDK usage lives in method signatures + usage patterns
- Config files must be retrieved as a whole—or not at all
- Migration rules require before/after context plus constraints
A token-based splitter treats all of this identically. It doesn't know an endpoint from an anti-pattern. It doesn't know which YAML key is required and which is a default. It just counts tokens and cuts.
The Artifact Model
The fix is to stop thinking about "chunks" and start thinking about artifacts—first-class knowledge units with explicit types, boundaries, and metadata.
An artifact is:
- A complete unit of meaning that can be retrieved and understood in isolation
- Typed—so retrieval can filter by what kind of knowledge is needed
- Metadata-rich—encoding organizational context that embeddings can't infer
- Boundary-aware—knowing what must stay together vs. what can be split
Token-based chunking:
Input: 50,000 tokens of documentation
Output: 100 chunks of ~500 tokens each
Lost: Structure, relationships, meaning
Artifact-based chunking:
Input: 50,000 tokens of documentation
Output: 47 API endpoints, 12 config schemas,
8 migration rules, 15 anti-patterns
Preserved: Type, boundaries, metadata, relationships
Artifact Types for Internal Codebases
Different artifact types need different extraction strategies. Here's a practical taxonomy:
API Endpoints
An endpoint artifact includes: route, method, parameters, request/response schema, authentication requirements, and example usage. These must stay together.
{
"artifact_type": "api_endpoint",
"route": "/api/v2/users/{id}",
"method": "PATCH",
"auth": "bearer_token",
"team": "identity",
"valid_since": "2024-06-01",
"content": "... full endpoint documentation ..."
}
Config Schemas
Configuration files are whole-document artifacts. Splitting a YAML file by tokens destroys meaning—defaults, required fields, and comments lose their relationship to the keys they describe.
{
"artifact_type": "config_schema",
"file_path": "config/database.yaml",
"scope": "production",
"retrieve_whole": true,
"content": "... entire file with comments preserved ..."
}
Anti-Patterns
Anti-patterns are high-value artifacts: they encode what not to do, plus the correct alternative. Separating the "don't do this" from the "do this instead" creates wrong answers.
{
"artifact_type": "anti_pattern",
"category": "database_access",
"severity": "critical",
"wrong_approach": "... what not to do ...",
"correct_approach": "... what to do instead ...",
"rationale": "... why this matters ..."
}
Migration Rules
Migrations have before/after states, constraints, and sequencing. A migration rule artifact preserves all of this context.
{
"artifact_type": "migration_rule",
"from_version": "v2",
"to_version": "v3",
"breaking": true,
"requires": ["database_backup", "feature_flag"],
"content": "... full migration documentation ..."
}
SDK Patterns
SDK usage patterns combine method signatures with initialization, error handling, and real-world examples. Token splitting fragments these across multiple chunks, making retrieval unreliable.
Metadata That Enables Filtering
The power of artifact-based chunking comes from metadata. Embeddings match similarity. Metadata filters before similarity—ensuring you only retrieve relevant artifact types.
Essential metadata fields:
| Field | Purpose | Example |
|---|---|---|
| artifact_type | Filter by knowledge category | api_endpoint, config, anti_pattern |
| scope | Filter by environment/context | production, staging, internal |
| team | Filter by ownership | platform, payments, identity |
| valid_since | Filter by temporal validity | 2024-06-01 |
| status | Filter by lifecycle state | approved, deprecated, draft |
| retrieve_whole | Flag for atomic retrieval | true / false |
When a developer asks "How do I configure the database connection for production?"—filter to artifact_type=config and scope=production before running similarity search. The embedding only needs to match within the relevant subset.
Building an Artifact Extractor
Artifact extraction is not generic text splitting. It's domain-specific parsing that understands your codebase's structure.
1. Define your artifact taxonomy
Enumerate the artifact types that exist in your codebase. Start with 5-10 high-value types. You can add more as you learn what queries fail.
2. Write extractors per type
Each artifact type needs a custom extractor. For OpenAPI specs, parse the YAML and extract each endpoint as an artifact. For markdown docs, parse headings and extract sections with their context.
# Pseudo-code: endpoint extractor
def extract_endpoints(openapi_spec):
artifacts = []
for path, methods in openapi_spec["paths"].items():
for method, details in methods.items():
artifacts.append({
"artifact_type": "api_endpoint",
"route": path,
"method": method.upper(),
"content": render_endpoint_doc(path, method, details),
"team": details.get("x-team"),
"auth": details.get("security", []),
})
return artifacts
3. Version control your extractors
Extractors are code. They should live in source control, be reviewed, and evolve with your documentation. When docs change structure, extractors need updating.
4. Test extraction, not just retrieval
Write tests that validate extraction: given this source file, does the extractor produce the expected artifacts with correct metadata?
Why Engineers Must Own Chunking
Chunking is too important to delegate to a generic library. Here's why engineers—not just ML teams—need to own this:
- Engineers understand the domain. They know which config keys are required, which API endpoints are deprecated, which patterns are anti-patterns.
- Engineers maintain the docs. When docs change, extraction logic may need to change. This is a code review, not a re-run.
- Engineers debug failures. When RAG returns wrong answers, the fix is often in extraction—not in prompts or models.
The pattern: Chunking logic should be reviewed like code, tested like code, and owned by the team that owns the documentation.
Implementation Checklist
- Audit your sources. What document types exist? What structure does each have?
- Define artifact taxonomy. 5-10 artifact types that cover your high-value knowledge.
- Build per-type extractors. Custom parsing for each artifact type—not generic splitting.
- Design metadata schema. artifact_type, scope, team, valid_since, status, retrieve_whole.
- Test extraction. Given source X, does extractor produce artifacts Y with metadata Z?
- Version control extractors. Treat extraction logic as production code.
- Test retrieval by type. Given query Q with filter F, does retrieval return correct artifacts?
- Iterate on failures. When RAG fails, trace back to extraction—not just prompts.
The Bigger Picture
Token-based chunking made sense when RAG was "answer questions about this PDF." For internal codebases, it's the wrong abstraction.
Your codebase is a knowledge graph of interconnected artifacts: endpoints that reference schemas, configs that imply constraints, patterns that depend on conventions. Retrieval needs to respect those relationships.
The takeaway
Chunking isn't preprocessing—it's knowledge modeling. Define your artifact types. Build extractors that preserve meaning. Encode organizational context in metadata. Let retrieval filter before similarity.
What's one artifact type in your codebase that's currently being destroyed by token splitting? Start there.