Object Detection from Scratch: Part 1 - Why This Project Is Worth Building

Most machine learning write-ups begin with the model. That is usually the wrong place to start.

The real question is not "how do I train YOLO?" The real question is: what user problem am I trying to solve, and what system must exist around the model for the answer to be useful?

In this project, the problem is surprisingly concrete. A user points a webcam at a Magic: The Gathering card and expects more than a bounding box. They want the card name, the oracle text, the price, and ideally the exact printing. That single interaction forces us to think beyond prediction accuracy and into pipeline design.

The Product Behind the Model#

At a high level, the system needs to do four things well:

Find the card and its meaningful regions.
Read the most useful text from the image.
Resolve that text into real card metadata.
Disambiguate printings when text alone is not enough.

That already tells us something important. A single image classifier would not be enough.

Loading diagram...

The useful output is not a box. It is a structured answer assembled from multiple stages.

Why Object Detection Instead of Classification?#

This repo makes the right architectural choice early: it treats the card as a structured document, not a single-label image.

The detector predicts seven classes:

card
art
title
description
tags
mana-cost
power

That matters because each region serves a different downstream purpose. The title box feeds OCR. The art box feeds DINOv2 matching. The card boundary helps stabilize the overall scene. The smaller regions provide explainability and future extension points.

detect(image) -> [
    {"class": "title", "bbox": [...], "confidence": 0.95},
    {"class": "art", "bbox": [...], "confidence": 0.97},
    {"class": "description", "bbox": [...], "confidence": 0.94},
]

If the model only returned "mtg_card", the rest of the product would still be unsolved.

The Two Tracks of the System#

One of the strongest qualities of the project is that the repo documents the full lifecycle instead of stopping at a notebook.

Loading diagram...

The offline track optimizes for learning. The online track optimizes for latency, reliability, and user trust. Mixing those concerns is how ML projects become messy.

Reading the Repo as a System#

The project structure is unusually clear:

scripts/01_setup_dataset.py downloads and prepares the dataset.
scripts/03_train.py handles local training.
scripts/04_validate.py turns a trained model into metrics and plots.
scripts/09_identify_card.py and scripts/10_live_identify.py connect detection to OCR and lookup.
web/app.py exposes the pipeline through FastAPI.
web/services/ cleanly separates detection, OCR, Scryfall, and image matching concerns.

That is not just organization. It reveals design intent. Every stage is isolated enough to reason about independently, but practical enough to form one working pipeline.

A Model Is a Dependency, Not the Whole Product#

This is the mental model I keep returning to when I read this repo:

The detector is a dependency.
OCR is a dependency.
Scryfall is a dependency.
DINOv2 is a dependency.
The product is the orchestration layer that makes those dependencies feel like one answer.

That shift matters because it changes how we evaluate success. A detector with excellent mAP can still produce a poor user experience if OCR fails on the wrong crop, or if printing disambiguation is weak, or if the web app cannot explain what it found.

What Success Actually Looks Like#

The README states the headline result clearly: roughly 96.7% mAP50 and 77.7% mAP50-95 at the best checkpoint. Those are strong numbers, but they are only meaningful because the pipeline turns them into usable card identification.

The most compelling success case in the repo is not "the model converged." It is this:

a photo goes in,
the detector isolates the relevant regions,
OCR extracts the title,
Scryfall resolves the card,
DINOv2 compares the art crop,
the user gets the exact printing.

That is a complete engineering story.

What This Series Will Cover#

This series follows the project in the same order an engineer would need to understand it:

We will spend time on theory, but only when the implementation demands it. The goal is not to decorate the project with ML vocabulary. The goal is to explain why each design choice exists, what it costs, and where it can fail.

Conclusion#

The best thing about this repo is that it treats machine learning like software engineering instead of theater. The model matters, but so do the file formats, scripts, APIs, metrics, failure modes, and product boundaries.

That is why this project is worth writing about in depth. It turns a common ML fantasy into a concrete system you can inspect, reason about, and extend.