Chunking Is Knowledge Modeling, Not Preprocessing

Why "split by tokens" fails for internal RAG—and how artifact-based chunking changes everything.

Chunking isn't a preprocessing step. It's how your RAG system thinks.

Teams treat chunking as infrastructure—run a splitter, pick a token limit, move on. But for internal codebases, that approach guarantees retrieval failures. Every internal repo encodes meaning differently, and a universal chunker cannot exist here.

The shift

Move from "split by tokens" to "extract artifacts." Define boundaries, dependencies, and rules per artifact type. Encode organizational intent into chunk metadata so retrieval filters before similarity.


Why This Is Breaking Now

Two years of production RAG deployments have taught us something painful: the same chunking strategy that works for Wikipedia fails catastrophically for internal documentation.

Consider what's actually in your codebase:

  • API services need endpoint + schema + examples together
  • Infrastructure repos encode meaning in YAML comments and implied defaults
  • SDK usage lives in method signatures + usage patterns
  • Config files must be retrieved as a whole—or not at all
  • Migration rules require before/after context plus constraints

A token-based splitter treats all of this identically. It doesn't know an endpoint from an anti-pattern. It doesn't know which YAML key is required and which is a default. It just counts tokens and cuts.


The Artifact Model

The fix is to stop thinking about "chunks" and start thinking about artifacts—first-class knowledge units with explicit types, boundaries, and metadata.

An artifact is:

  • A complete unit of meaning that can be retrieved and understood in isolation
  • Typed—so retrieval can filter by what kind of knowledge is needed
  • Metadata-rich—encoding organizational context that embeddings can't infer
  • Boundary-aware—knowing what must stay together vs. what can be split
Token-based chunking:
  Input: 50,000 tokens of documentation
  Output: 100 chunks of ~500 tokens each
  Lost: Structure, relationships, meaning

Artifact-based chunking:
  Input: 50,000 tokens of documentation
  Output: 47 API endpoints, 12 config schemas,
          8 migration rules, 15 anti-patterns
  Preserved: Type, boundaries, metadata, relationships

Artifact Types for Internal Codebases

Different artifact types need different extraction strategies. Here's a practical taxonomy:

API Endpoints

An endpoint artifact includes: route, method, parameters, request/response schema, authentication requirements, and example usage. These must stay together.

{
  "artifact_type": "api_endpoint",
  "route": "/api/v2/users/{id}",
  "method": "PATCH",
  "auth": "bearer_token",
  "team": "identity",
  "valid_since": "2024-06-01",
  "content": "... full endpoint documentation ..."
}

Config Schemas

Configuration files are whole-document artifacts. Splitting a YAML file by tokens destroys meaning—defaults, required fields, and comments lose their relationship to the keys they describe.

{
  "artifact_type": "config_schema",
  "file_path": "config/database.yaml",
  "scope": "production",
  "retrieve_whole": true,
  "content": "... entire file with comments preserved ..."
}

Anti-Patterns

Anti-patterns are high-value artifacts: they encode what not to do, plus the correct alternative. Separating the "don't do this" from the "do this instead" creates wrong answers.

{
  "artifact_type": "anti_pattern",
  "category": "database_access",
  "severity": "critical",
  "wrong_approach": "... what not to do ...",
  "correct_approach": "... what to do instead ...",
  "rationale": "... why this matters ..."
}

Migration Rules

Migrations have before/after states, constraints, and sequencing. A migration rule artifact preserves all of this context.

{
  "artifact_type": "migration_rule",
  "from_version": "v2",
  "to_version": "v3",
  "breaking": true,
  "requires": ["database_backup", "feature_flag"],
  "content": "... full migration documentation ..."
}

SDK Patterns

SDK usage patterns combine method signatures with initialization, error handling, and real-world examples. Token splitting fragments these across multiple chunks, making retrieval unreliable.


Metadata That Enables Filtering

The power of artifact-based chunking comes from metadata. Embeddings match similarity. Metadata filters before similarity—ensuring you only retrieve relevant artifact types.

Essential metadata fields:

Field Purpose Example
artifact_type Filter by knowledge category api_endpoint, config, anti_pattern
scope Filter by environment/context production, staging, internal
team Filter by ownership platform, payments, identity
valid_since Filter by temporal validity 2024-06-01
status Filter by lifecycle state approved, deprecated, draft
retrieve_whole Flag for atomic retrieval true / false

When a developer asks "How do I configure the database connection for production?"—filter to artifact_type=config and scope=production before running similarity search. The embedding only needs to match within the relevant subset.


Building an Artifact Extractor

Artifact extraction is not generic text splitting. It's domain-specific parsing that understands your codebase's structure.

1. Define your artifact taxonomy

Enumerate the artifact types that exist in your codebase. Start with 5-10 high-value types. You can add more as you learn what queries fail.

2. Write extractors per type

Each artifact type needs a custom extractor. For OpenAPI specs, parse the YAML and extract each endpoint as an artifact. For markdown docs, parse headings and extract sections with their context.

# Pseudo-code: endpoint extractor
def extract_endpoints(openapi_spec):
    artifacts = []
    for path, methods in openapi_spec["paths"].items():
        for method, details in methods.items():
            artifacts.append({
                "artifact_type": "api_endpoint",
                "route": path,
                "method": method.upper(),
                "content": render_endpoint_doc(path, method, details),
                "team": details.get("x-team"),
                "auth": details.get("security", []),
            })
    return artifacts

3. Version control your extractors

Extractors are code. They should live in source control, be reviewed, and evolve with your documentation. When docs change structure, extractors need updating.

4. Test extraction, not just retrieval

Write tests that validate extraction: given this source file, does the extractor produce the expected artifacts with correct metadata?


Why Engineers Must Own Chunking

Chunking is too important to delegate to a generic library. Here's why engineers—not just ML teams—need to own this:

  • Engineers understand the domain. They know which config keys are required, which API endpoints are deprecated, which patterns are anti-patterns.
  • Engineers maintain the docs. When docs change, extraction logic may need to change. This is a code review, not a re-run.
  • Engineers debug failures. When RAG returns wrong answers, the fix is often in extraction—not in prompts or models.

The pattern: Chunking logic should be reviewed like code, tested like code, and owned by the team that owns the documentation.


Implementation Checklist

  • Audit your sources. What document types exist? What structure does each have?
  • Define artifact taxonomy. 5-10 artifact types that cover your high-value knowledge.
  • Build per-type extractors. Custom parsing for each artifact type—not generic splitting.
  • Design metadata schema. artifact_type, scope, team, valid_since, status, retrieve_whole.
  • Test extraction. Given source X, does extractor produce artifacts Y with metadata Z?
  • Version control extractors. Treat extraction logic as production code.
  • Test retrieval by type. Given query Q with filter F, does retrieval return correct artifacts?
  • Iterate on failures. When RAG fails, trace back to extraction—not just prompts.

The Bigger Picture

Token-based chunking made sense when RAG was "answer questions about this PDF." For internal codebases, it's the wrong abstraction.

Your codebase is a knowledge graph of interconnected artifacts: endpoints that reference schemas, configs that imply constraints, patterns that depend on conventions. Retrieval needs to respect those relationships.

The takeaway

Chunking isn't preprocessing—it's knowledge modeling. Define your artifact types. Build extractors that preserve meaning. Encode organizational context in metadata. Let retrieval filter before similarity.

What's one artifact type in your codebase that's currently being destroyed by token splitting? Start there.