When you glance at a 3D object like a chair, you do not think about it. Your visual cortex processes the scene in milliseconds, segmenting the cup from the desk, inferring its depth, recognising it from an angle you have probably never seen that exact cup from before. You do all of this despite the lighting being different from yesterday, the chair being partially obscured by a table, and your eye receiving only a flat 2D projection of a 3D object.
This is the fundamental problem computer vision is trying to solve, and it is harder than it sounds.
A camera collapses three dimensions into two. Every photograph is a projection - depth information is discarded. A sphere and a flat circle painted to look like a sphere produce identical images from the right viewpoint. This is called the projection problem and it is one of the core reasons computer vision is difficult.
Human vision sidesteps this through a combination of mechanisms that took millions of years to evolve:
We do not compute depth - we infer it from dozens of overlapping signals simultaneously, most of which we are not conscious of.
A monocular camera has none of this by default. It has pixels. Every computer vision system has to reconstruct meaning from those pixels and the challenge is that the same pixel values can correspond to wildly different real-world scenes depending on lighting, distance, occlusion and viewpoint.
Illumination is particularly treacherous. A white wall in shadow reflects less light than a black wall in direct sunlight. A naive pixel-matching system will conclude they are the same surface, or fail to match the same surface across two photos taken an hour apart. Human vision compensates for this so automatically that we call it “colour constancy” - we perceive a white piece of paper as white whether it is under fluorescent office lighting or a sunset. Computer vision systems have to learn this invariance explicitly.
Occlusion is another failure mode. If half a chair is hidden behind a table, a human instantly understands there is a whole chair there. We complete the occluded region from context and prior knowledge. A model that has not seen enough partial chairs during training may classify the visible portion as something else entirely.
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
PyTorch’s torchvision library handles most of the boilerplate for standard vision tasks - dataset loading, common transforms, and pretrained model weights. The transforms pipeline is where you encode your assumptions about invariance. If you horizontally flip training images, you are telling the model that left-right orientation should not affect the label.
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
The normalisation values here are the ImageNet channel means and standard deviations. If you are using a pretrained backbone, this matters - the model was trained expecting inputs in this range, and feeding it raw [0, 1] tensors will degrade performance.
The core building block of most vision models is the convolutional layer. Rather than connecting every pixel to every neuron (which would be computationally catastrophic for a 224x224 image), convolutions apply small learned filters across the spatial dimensions of the image.
A filter is a small matrix - typically 3x3 or 5x5 - that slides across the image and computes a dot product at each position. A 3x3 filter trained to detect vertical edges will produce a high activation wherever it encounters a vertical edge in the input, regardless of where in the image that edge appears. This property is called translation equivariance and it is what makes CNNs efficient for vision the same feature detector works everywhere in the image.
class ConvBlock(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size=3, padding=1):
super().__init__()
self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, padding=padding)
self.bn = nn.BatchNorm2d(out_channels)
self.relu = nn.ReLU(inplace=True)
def forward(self, x):
return self.relu(self.bn(self.conv(x)))
Batch normalisation after the convolution stabilises training by normalising activations across the batch dimension. Without it, deep networks tend to suffer from internal covariate shift. The distribution of activations at each layer keeps changing as the weights update, making it hard for subsequent layers to learn stable representations.
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
ConvBlock(3, 32),
nn.MaxPool2d(2, 2),
ConvBlock(32, 64),
nn.MaxPool2d(2, 2),
ConvBlock(64, 128),
nn.MaxPool2d(2, 2),
)
self.classifier = nn.Sequential(
nn.AdaptiveAvgPool2d((1, 1)),
nn.Flatten(),
nn.Linear(128, 256),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(256, num_classes)
)
def forward(self, x):
return self.classifier(self.features(x))
Each MaxPool2d halves the spatial dimensions while the number of channels grows. By the time the tensor reaches the classifier head, the spatial information has been compressed into increasingly abstract feature maps, early layers respond to edges and textures, deeper layers to shapes and object parts. AdaptiveAvgPool2d((1, 1)) collapses the remaining spatial dimensions to a single value per channel regardless of input size, making the classifier head resolution-agnostic.
Training a vision model from scratch requires a large dataset and significant compute. For most practical tasks, transfer learning is the better starting point. Take a model pretrained on ImageNet, freeze most of its weights and replace only the final classification head.
import torchvision.models as models
def build_transfer_model(num_classes, freeze_backbone=True):
model = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
if freeze_backbone:
for param in model.parameters():
param.requires_grad = False
model.fc = nn.Sequential(
nn.Linear(model.fc.in_features, 512),
nn.ReLU(inplace=True),
nn.Dropout(0.3),
nn.Linear(512, num_classes)
)
return model
The intuition is that the early layers of any vision model trained on natural images learn general-purpose features, Gabor-like edge detectors, colour blobs, texture patterns - that are useful regardless of the downstream task. Only the later layers encode task-specific abstractions. Freezing the backbone and training only the new head is fast and often achieves strong results with a few hundred images per class.
def train(model, loader, optimiser, criterion, device):
model.train()
total_loss, correct = 0, 0
for images, labels in loader:
images, labels = images.to(device), labels.to(device)
optimiser.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimiser.step()
total_loss += loss.item()
correct += (outputs.argmax(dim=1) == labels).sum().item()
return total_loss / len(loader), correct / len(loader.dataset)
def evaluate(model, loader, criterion, device):
model.eval()
total_loss, correct = 0, 0
with torch.no_grad():
for images, labels in loader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
total_loss += criterion(outputs, labels).item()
correct += (outputs.argmax(dim=1) == labels).sum().item()
return total_loss / len(loader), correct / len(loader.dataset)
model.eval() switches batch normalisation and dropout to inference mode. Dropout stops randomly zeroing activations and batch norm uses running statistics rather than batch statistics. Forgetting this call is a common source of inconsistent evaluation results.
Classification tells you what is in an image. Detection tells you where. The output is no longer a class label but a set of bounding boxes, each with a class and a confidence score.
The standard approach is to use an anchor-based detector. A grid is overlaid on the feature map, and at each grid cell the model predicts offsets from a set of predefined anchor boxes of varying aspect ratios and scales. This reframes detection as a regression problem on top of classification.
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection import FasterRCNN_ResNet50_FPN_Weights
model = fasterrcnn_resnet50_fpn(weights=FasterRCNN_ResNet50_FPN_Weights.DEFAULT)
model.eval()
def detect(image_tensor, threshold=0.5):
with torch.no_grad():
predictions = model([image_tensor])[0]
mask = predictions['scores'] > threshold
return {
'boxes': predictions['boxes'][mask],
'labels': predictions['labels'][mask],
'scores': predictions['scores'][mask]
}
Faster R-CNN uses a Feature Pyramid Network (FPN) backbone, which maintains feature maps at multiple scales simultaneously. This is what allows it to detect both a person in the foreground and a distant car in the background of the same image. Small objects are detected from high-resolution early feature maps, large objects from lower-resolution deep ones.
A well-trained detection model will confidently draw a bounding box around a stop sign even if the stop sign has been defaced with stickers. It has learned the shape and colour, not the meaning. It does not understand that a stop sign is a legal instruction. It cannot reason about whether the object is relevant to the current task.
This is where the analogy to human vision breaks down most sharply. We do not just see, we perceive. We attach significance, intention and context to what we observe. A model that achieves 95% accuracy on a benchmark has learned a very sophisticated pattern-matching function. It has not learned to see in the way that word is usually meant.