Skip to main content

Command Palette

Search for a command to run...

Teaching Machines to See Like Us

Published
7 min read
Teaching Machines to See Like Us

How AI learns to understand depth without anyone telling it the answers


Close your eyes, then open them and look around. Without thinking about it, you instantly know that lamp is closer than the wall behind it, the floor is flat and extends away from you, and that coffee mug is cylindrical even though you can only see one side.

Now here's the weird part: your eyes only give you flat, 2D images. The back of your eyeball is basically a curved screen, like a camera sensor. No depth information reaches it directly - just colors and brightness at different points. Yet somehow, you perceive a full 3D world.

Human Vision

Think about looking at a photograph. Mathematically speaking, that flat image could represent an infinite number of different 3D scenes. A small toy car close to the camera creates the exact same image as a real car far away.

[Image from Sinha and Adelson, 1993]

When a 3D point P = (X, Y, Z) gets projected onto a 2D image, it lands at pixel coordinates (x, y) through perspective projection:

x = f · X/Z

y = f · Y/Z

where f is the camera's focal length. Notice what happens - the depth Z appears in the denominator but completely disappears from the final 2D coordinates. The depth information is lost. Given just (x, y), there are infinitely many (X, Y, Z) points that could have created it.

One Dimension is lost forever. So we try to reconstruct it in different ways, one being discussed below.

So why aren't you constantly confused about depth? Because you've been learning about 3D space your entire life. As a baby, you moved around, saw the same objects from different angles, reached for things and learned what "close" and "far" actually meant. Over thousands of hours, your brain built up an intuition - when I see this 2D pattern, it usually means that 3D arrangement.

You learned to reverse-engineer reality.

Can Machines Learn the Same Way?

For decades, computer vision researchers tried to teach computers depth estimation. The early successes after deep learning took off around 2005 were impressive, but they all needed someone to tell them the answers - laser scanners measuring exact distances, humans labeling depth in images, expensive equipment visiting every location.

Imagine trying to teach a child about depth by handing them a ruler and saying "measure everything before you look at it." That's not how humans learn. We learn by moving through the world and observing what happens.

The Setup: Just Watch Videos

Researchers asked a deceptively simple question: what if we just showed the AI videos of the world, like a human infant sees, and let it figure out depth on its own?

Videos capture the same physical 3D world from slightly different viewpoints as the camera moves, moments apart in time. If you knew the 3D shape of everything and how the camera moved, you could mathematically predict what the next frame should look like.

Here's the math: if a 3D point p has depth D, and the camera moves with transformation T (rotation + translation), that point appears at a new location p':

p' = K · T · D · K⁻¹ · p

where K represents the camera's intrinsic parameters.

The insight: flip this around and use prediction accuracy as a way to learn depth. If your depth or camera movement guess is wrong, you won't predict the next frame correctly. But if both are right, the prediction should match reality.

The Architecture: Two Networks Working Together

The system consists of two convolutional neural networks that are trained jointly but serve different purposes.

[Image from Tinghui Zhou et al., 2017]

The Depth Network is an encoder-decoder architecture. The encoder progressively downsamples the input image through convolutional layers, extracting increasingly abstract features. The decoder then upsamples these features back to the original resolution, predicting a depth value for every pixel. Crucially, it predicts depth at multiple scales - producing outputs at 1/2, 1/4, 1/8, and full resolution. This multi-scale prediction isn't just for speed - it's essential for learning. Coarse scales can make large corrections when predictions are initially way off, while fine scales refine the details.

The Pose Network takes a different approach. It receives a stack of consecutive frames (typically the target frame plus several neighbors) and processes them through convolutional layers followed by fully connected layers. Its job is to output six numbers representing the camera's ego-motion: three for rotation (roll, pitch, yaw) and three for translation (x, y, z movement). This is a much more compact output than the depth network's pixel-wise predictions.

Here's what makes the architecture elegant: the depth network learns to be a universal depth predictor - given any single image, estimate its depth. The pose network learns to be a motion estimator - given any sequence of frames, figure out how the camera moved. Neither network ever sees ground truth for what it's trying to predict. They only get feedback on how well they work together to explain the video.

Training

During training, the networks play a game. Take a target frame and predict its depth map using the Depth Network. Take that frame plus nearby frames and predict camera motion using the Pose Network. Now use geometry to warp the target frame's pixels into a neighboring frame's viewpoint. Compare your synthesized image with the actual frame.

The loss function is beautifully simple - just the difference in pixel colors:

Photometric Loss, L = Σ |I_source(p') - I_target(p)|

When the prediction matches reality, both networks learn they're on track. When it doesn't, the gradients flow backward through the geometric warping operation into both networks, adjusting their weights.

No one tells them the right depth. No one tells them the right camera movement. They figure it out by trying to understand what they see, just like you did as a child.

The Beautiful Part

Once training is done, you can throw away the Pose Network entirely. The Depth Network has learned to estimate depth from a single image, even though it trained on video sequences.

This mirrors human development perfectly - movement helped you learn about depth, but now you can perceive 3D structure from a still photograph. The motion was training wheels, not a permanent requirement.

Making It Work

One challenge: when you project a pixel to a new location, you get fractional coordinates like (127.3, 45.7). The solution is bilinear interpolation - sampling from the four surrounding pixels, weighted by distance. This makes the warping operation differentiable, allowing gradients to flow smoothly during training.

The multi-scale architecture is critical here. When initial depth predictions are poor, coarse-scale predictions can make large corrections quickly. Fine-scale predictions then refine the details. It's like sketching rough shapes before adding detail - a natural learning progression.

When Reality Gets Messy

The geometric model assumes nothing moves except the camera, surfaces don't reflect, and everything is visible. Real life breaks these assumptions constantly - cars drive by, windows reflect, buildings block each other.

To handle this, the system adds a third small network that predicts an "explainability mask" - essentially a confidence map for each pixel:

L = Σ E(p) · |I_source(p') - I_target(p)|

where E(p) downweights unreliable pixels. The network learns which regions can't be explained by rigid geometry and excludes them from training.

A smoothness regularization term encourages neighboring pixels to have similar depths:

L_smooth = Σ |∂D/∂x| + |∂D/∂y|

This reflects a prior about the physical world - most surfaces are continuous, not wildly discontinuous.

Different from Classical Approaches

If you've heard of "Structure from Motion" (SfM) - the technique where you take multiple photos and reconstruct 3D shape - this might sound familiar. But there's a crucial difference.

Classical SfM reconstructs a specific scene. Give it 50 photos of one statue, it optimizes that particular statue's 3D structure using hand-crafted optimization algorithms. This learning-based approach learns a function. Train on thousands of videos, and the Depth Network estimates depth for any new scene - even ones never seen before. The Pose Network similarly generalizes to new motion patterns.

It's the difference between solving one math problem and learning mathematics.

Struggles

This approach has trouble with flat, textureless areas (blank walls, clear sky) - no visual features means no training signal. Objects extremely close to the camera make the math unstable. Underrepresented camera motions confuse it. Too much independent motion breaks the static scene assumption.

The networks can only learn patterns they're exposed to. Just like you'd struggle with depth in environments completely unlike anything you've experienced, the AI needs diverse training experiences.

Why This Matters Beyond Depth

The deeper idea transcends depth estimation. It's about teaching machines through consistency rather than explicit labels. We didn't tell the networks what depth values should be - we told them: "Your estimates should be consistent with what you see when the viewpoint changes."

This is self-supervised learning - the supervision signal comes from the data's structure itself. This pattern shows up everywhere now: language models predicting the next word, image generators reconstructing from noise, robots learning from actions and outcomes.

It's also how you learned most things. Nobody handed you a manual on depth perception or physics or language. You experienced the world, noticed patterns and consistencies, and built understanding from that structure.