Why We Still Need Traditional Classification Models in the Age of LLMs

A technical-but-approachable deep dive using real code from a multi-model vision pipeline.

Introduction

We live in the era of Large Language Models (LLMs) — systems capable of answering questions, generating code, summarizing documents, and even reasoning about visual inputs. Because of this, many teams assume traditional machine learning models (like classifiers and detectors) are obsolete.

But here's the truth:

LLMs are generalists; traditional models are specialists. You still need both.

In the same way a highly educated person may know a little about everything but not specialize in veterinary medicine, a general-purpose LLM can describe an image — but it cannot reliably identify 120 dog breeds or determine the health of coral reefs with scientific consistency.

This blog post introduces why classification models still matter, using a real-world example from my vision processing pipeline:

YOLOv8 (object detection)
Dog breed classifier
Cat breed classifier
Wildlife classifier (iNaturalist)
Scene classifier (Places365)
Coral classifier (SigLIP model)
CLIP for image-level + crop-level embeddings

This approach forms the foundation of my "animal vs. people" processing stack — but the exact same architecture works for people, plants, products, or anything else you care about.

Why LLMs Can't Replace Specialized Models (Yet)

A generalist robot looking confused at a lineup of dogs vs a specialized detective robot identifying them correctly — Generalist LLMs struggle with specifics where specialized models thrive.

Think of an LLM as a very talented intern: great at adapting, reasoning, and improvising — but not great at highly consistent, domain-specific tasks where precision matters.

Analogies help here:

YOLO is a police K9 trained to detect a specific scent.
A dog-breed classifier is a world-class veterinarian specializing in one domain.
LLMs are charismatic generalists capable of giving "pretty good" answers about everything — but rarely perfect at niche tasks.

That's why modern vision systems increasingly look like orchestration layers, where many narrow models assist a few broad ones. LLMs thrive when fed clean, structured signals — and those signals come from traditional models.

The Architecture: A Multi-Model Vision Worker

Below is an excerpt from an actual production vision worker Lambda responsible for:

Downloading an image
Detecting dogs/cats/people
Refining classification via multiple specialized models
Generating CLIP embeddings for full images + crops
Storing everything in vector databases (Qdrant)
Sending structured metadata to downstream systems

All using open-source models that anyone can use.

Step 1: General Detection with YOLOv8

LLMs could analyze pixels, but YOLO has been optimized for this single job for a decade.

def _detect_entities(img):
    model = _ensure_yolo_loaded()
    results = model(np.asarray(img))[0]
    ...

YOLO gives us bounding boxes for people, dogs, cats, sheep, goats, cows, horses, deer.

But YOLO does not tell us what kind of dog. Or whether a "goat" is actually a misdetected dog (common).

Step 2: Refining Animal Identification

Here is where traditional classification models shine.

🐶 Dog Breed Classification

Model: prithivMLmods/Dog-Breed-120 (SigLIP-based)

breed_info = _classify_dog_breed(crop)

If YOLO says "goat" — but the dog classifier says "Golden Retriever" with 0.92 confidence? We override YOLO.

🐱 Cat Breed Classification

Model: Custom or microsoft/resnet-50 finetuned for cats

breed_info = _classify_cat_breed(crop)

🐾 Wildlife Species (Non-Dog/Cat)

Model: bryanzhou008/vit-base-patch16-224-in21k-finetuned-inaturalist

This can recognize deer, sheep, goats, horses, and hundreds of species.

Specialized knowledge like this is far outside what LLMs can reliably infer from raw pixels.

Step 3: Scene Classification

Model: Places365 (MIT CSAIL)

scene_info = _classify_scene(img)

This gives us context like "living room", "forest path", "dog park", "pasture", "city street", etc.

LLMs could guess from a description, but Places365 was trained explicitly for this.

Often, the combination of scene + entities gives us richer downstream reasoning.

Step 4: Coral Health Example (Optional)

Model: prithivMLmods/Coral-Health

Detects: healthy coral, bleached coral.

Again, something too specialized for an LLM alone. This optional module demonstrates how flexible the pipeline is for biological/ecological domains.

Step 5: CLIP Embeddings for Search + Similarity

Once we've detected entities, we compute embeddings.

Full Image Embedding

image_embedding = _compute_clip_embedding(img)

Crop-Level Embeddings

crop_embedding = _compute_clip_embedding(crop)

These 512‑dimensional vectors power reverse image search, similarity lookups, clustering, and retrieval for LLMs (RAG).

LLMs excel once they have structured context — embeddings turn pixels into usable signals.

Why Not Use GPT‑4 Vision for Everything?

Two reasons:

1. Consistency

Specialized models give deterministic outputs on the same image. LLMs do not.

2. Traceability

A wildlife classifier with a published dataset is auditable. An LLM hallucinating "this looks like a goat" is not.

3. Latency & Cost

Running YOLO + small classifiers is far cheaper and faster than sending every image to a large vision-capable LLM.

4. Accuracy on Niche Domains

LLMs are "broadly capable" — but rarely state‑of‑the‑art in specialty tasks. Specialization matters.

The Orchestration Pattern (LLMs as the Final Layer)

The real magic happens when you combine:

Detection → what's in the image
Classification → specialized understanding
Embeddings → vector search + memory
LLMs → reasoning based on structured data

LLMs become much more accurate when fed structured inputs like:

{
  "animals": [
    { 
      "type": "dog",
      "breed": "Golden Retriever",
      "confidence": 0.91
    }
  ],
  "scene": "park",
  "has_person": false,
  "brightness": 0.78
}

Instead of raw pixel data.

This stack is not "LLM vs ML models." It's LLMs + ML models working together.

Real-World Example: My Animal-Focused Pipeline

At the end of a single image analysis run, I store:

full-image embedding
multiple crop embeddings
detection metadata
refined classifications
scene data
coral/wildlife notes
brightness/orientation

Everything goes into Qdrant, enabling fast similarity search, retrieval-augmented generation, and filtering ("show me bright images of dogs in parks").

This architecture is general-purpose: Animals today — people, plants, products tomorrow.

Why This Matters for Your AI Strategy

If you are building vision systems:

Don't rely on LLMs alone.
Don't rely on single-model pipelines.
Don't assume a bigger model solves every problem.

Instead:

Use the right tool for the job — orchestrate many small models with one large one.

This is the future of practical AI systems.

Coming Next in This Series

This blog is the first in a multi-part series:

Why LLMs Still Need Traditional Classifiers (this post)
How to Build a Multi-Model Vision Pipeline
Building "Animal vs. People" Classification Systems
How to Store Embeddings in Vector Databases (Qdrant Edition)
How to Combine LLMs with Structured Vision Data (RAG for Images)