How to Build a Multi-Model Vision Pipeline

The engineering reality of building a production-grade vision system that combines the best of both worlds.

Following up on my previous post, "Why LLMs Still Need Traditional Classifiers", let's dive into the engineering reality of building a production-grade vision system that combines the best of both worlds.

In the age of GPT and Gemini, it's tempting to send every image to a massive multimodal LLM and ask, "What's in this picture?"

But when you're building a system like our pet detector—which needs to process thousands of images, identify specific dog breeds, check for coral reef health, and make everything searchable by semantic meaning—a single giant model is often too slow, too expensive, or surprisingly imprecise at specific tasks.

Instead, we built a Multi-Model Vision Pipeline. It uses a "router" model to detect objects, specialized "expert" models to classify them, and a "generalist" model to understand the vibe.

Here is how we architected it in our actual vision-stack.


The Architecture: Event-Driven & Async

We don't process images synchronously in the API request. That’s a recipe for timeouts. Instead, we use a classic async worker pattern:

  1. Ingestion API: Accepts a lightweight request (Image URL + ID), validates it, and pushes a job to an SQS Queue.
  2. Vision Worker: A GPU-accelerated Lambda/Container that pulls jobs, downloads images, and runs the heavy lifting.
  3. Vector Store (Qdrant): Stores the resulting mathematical representations (embeddings) for search.
  4. Callback: Notifies the main application when processing is done.

The Brains: A Committee of Experts

The core logic isn't one model; it's a workflow. We use different models for what they are best at.

1. The Traffic Controller: YOLOv8

We start with YOLOv8 (You Only Look Once). It is incredibly fast and efficient at Object Detection. It doesn't know what kind of dog it is, but it knows exactly where the dog is in the pixel coordinates.

In services/vision_api/worker.py, we detect entities first:

def _detect_entities(img: PILImage.Image) -> List[Dict[str, Any]]:
    # Load YOLOv8m (Medium)
    model = _ensure_yolo_loaded()
    results = model(np.asarray(img))[0]
    
    entities = []
    for box in results.boxes:
        # Extract coordinates and class
        # ...
        entities.append({
            "type": cls_name,  # e.g., "dog", "person"
            "box": [x1, y1, x2, y2],
            "confidence": conf
        })
    return entities

2. The Specialists: Breed & Species Classifiers

Once YOLO tells us "There is a dog at [x,y]", we crop that specific region of the image and pass it to a specialist.

We don't ask the Dog Classifier to look at a sunset; we only show it the dog. This cascading logic drastically improves accuracy.

# From _process_job in worker.py

# 1. Get the crop based on YOLO coordinates
crop = img.crop((ix1, iy1, ix2, iy2))

# 2. Route to the correct specialist
if ent_type == "dog":
    # Use a SigLIP model fine-tuned on 120 dog breeds
    breed_info = _classify_dog_breed(crop)
    if breed_info:
        ent["breed"] = breed_info["breed"]

elif ent_type == "cat":
    # Use a ResNet model fine-tuned on cat breeds
    breed_info = _classify_cat_breed(crop)
    
elif ent_type in wildlife_classes:
    # Use an iNaturalist ViT model for wild animals
    species_info = _classify_wildlife_species(crop)

We even have a specialized Coral Health model. If the detection logic (or metadata) suggests an underwater scene, we run a classifier specifically trained to spot coral bleaching.

3. The Generalist: OpenCLIP

While classifiers give us discrete labels ("Golden Retriever", "Bleached Coral"), they miss the nuance. A classifier can't tell you if a dog looks "happy" or if the lighting is "moody."

For this, we use OpenCLIP (ViT-B/32). It converts the image into a 512-dimensional vector. This allows us to perform semantic search later (e.g., "Find me photos of dogs running on the beach") even if we never explicitly trained a "running on beach" classifier.

We compute embeddings for both the entire image and the individual crops:

# Image-level embedding
image_embedding = _compute_clip_embedding(img)
image_store.add(photo_id=str(image_id), vector=image_embedding, ...)

# Crop-level embedding (allows searching specifically for the animal, ignoring the background)
crop_embedding = _compute_clip_embedding(crop)
animal_store.add(photo_id=crop_id, vector=crop_embedding, ...)

Storage: The Vector Store

We use Qdrant to store these embeddings. This essentially turns our images into a database we can query by meaning.

The VectorStore class wraps the complexity of Qdrant, handling the normalization and upsert logic. Importantly, we store the metadata (Breed, Confidence, Source URL) alongside the vector. This allows hybrid filtering: "Show me pictures that look like this (Vector Similarity) BUT only if it is a Golden Retriever (Metadata Filter)."


Why This Wins

By chaining models, we get:

  1. Modularity: We can swap out the Dog Classifier for a better one without retraining the Object Detector.
  2. Performance: We don't run the heavy Wildlife classifier on pictures of kitchens.
  3. Explainability: We know why the system thinks it's a Poodle (because the Poodle classifier said so with 98% confidence on the crop), rather than an LLM hallucinating a detail in the background.

Coming Next in This Series

This blog is the second in a multi-part series:

  1. Why LLMs Still Need Traditional Classifiers
  2. How to Build a Multi-Model Vision Pipeline (this post)
  3. Building "Animal vs. People" Classification Systems
  4. How to Store Embeddings in Vector Databases (Qdrant Edition)
  5. How to Combine LLMs with Structured Vision Data (RAG for Images)