Your Camera Doesn't Know This.

Your ATE Is 59 Meters And You Can't Figure Out Why
What a broken evaluation teaches you about the one thing cameras will never know
The trajectory shape looks perfect. The depth maps look great. The network converged smoothly. But the evaluation prints one number that breaks everything:
59 meters ATE? On a 10-meter path?
You check the code. Clean. You check the data.
Everything is fine.
You check the architecture. Nothing wrong.
You check the training curves. Smooth as you'd want.
So what the hell is wrong?
Nothing, as it turns out.
The previous post covered how self-supervised networks learn depth — two networks, photometric loss, geometry doing the teaching. It works beautifully. But there's something that post never told you. A crack running underneath the whole system that only shows up when you ask the network to tell you something real about the world.
This is that crack.
The network learned depth. Just not your depth.
Here's what the previous post glossed over.
The network learned to predict depth. But which depth?
In what units? At what scale?
The answer is: it doesn't know. And neither do you, until evaluation day.
When the system trains, it learns that this wall is further than this chair. That this road stretches away from the camera. The relative structure — what's closer, what's further, the shape of the scene — that's all real and correct. But the actual numbers? The meters? Those are invented.
The network picked a scale somewhere during initialization and everything it learned is relative to that arbitrary choice.
The depth map looks right.
It just might be describing a world where your living room is the size of a football stadium.
Why the training never caught this
Go back to the warp equation from the previous post:
p' = K · T · D · K⁻¹ · p
The network learns by warping one frame to look like another and measuring the difference. If the warp is right, the loss is low. If not, gradients flow and weights update.
Now do something subtle. Double the depth D. Double the translation T. What happens to p'?
Nothing. The warp is identical. The synthesized frame is identical. The loss is identical.
Triple them. Same result. Multiply by a thousand. Same result.
| Depth | Translation | Loss |
| 1× | 1× | 0.05 |
| 10× | 10× | 0.05 |
| 1000× | 1000× | 0.05 |
The photometric loss cannot see scale. It is structurally, mathematically blind to it. Depth and translation are coupled — they always move together — and as long as they stay proportional, every frame prediction comes out identical.
The network is trying to solve a puzzle where an entire dimension of the answer doesn't change the picture on the box.
What this looks like from the optimizer's perspective
Picture the optimizer as someone dropped into a hilly landscape, tasked with finding the lowest valley. Every step downhill means better predictions. Every step uphill means the loss is getting worse. That's gradient descent.
Now add a scale dimension to this landscape. Stretch it out in one direction, infinitely. And make that entire direction perfectly flat — no slope whatsoever, in either direction, forever.
The optimizer finds the valley in every other direction. It gets rotation right. It gets the direction of motion right. It gets the relative depths right. But in the scale direction, it just wanders out onto that flat plain and stops wherever it happens to be standing.
That's your 59 meters. The optimizer landed somewhere on the plain, decided that was good enough, and learned everything else relative to that arbitrary spot.
You can train for a million epochs. You can use a better architecture. You can throw more data at it. None of it helps. You can't walk downhill when there's no hill.
The geometry behind it
A camera moving through space has 6 degrees of freedom — 3 for rotation, 3 for translation.
Think of it like your head. Nod, shake, tilt — that's rotation, 3 numbers. Move left, right, up, down, forward, back — that's translation, 3 more. Six total to fully describe any motion.
Monocular vision recovers 5 of them cleanly. All 3 rotations. The direction of translation — which way you moved. But the magnitude of translation — how far you actually moved — that one slips through. The 6th degree of freedom. Global scale.
The shape of your trajectory gets drawn perfectly. A right turn is a right turn. A curve is a curve. But the size of that trajectory is a guess. Whether you drove 10 meters or 10 kilometers, the shape looks the same to a monocular camera.
This is why the evaluation went wrong. The shape was correct. The scale was invented.
Why this is a property of geometry, not a flaw in the code
Here's the thing that makes this hard to accept: it's not fixable. Not with this setup.
The moment a 3D world gets projected onto a 2D image, something gets destroyed. The previous post called it "one dimension, lost forever." That dimension carries the absolute scale of everything — how big things actually are, how far the camera actually moved. The projection equation divides it away and it's gone.
Imagine a miniature model city on an architect's table. Perfect scale replica, every building correct, tiny cars on tiny roads. Now imagine the real city. Photograph both from the right angle and distance — you cannot tell which is which. Not because the camera isn't precise enough. Because the information that would tell you was never in the image.
A single camera can reconstruct the shape of the city — which building is taller, which street curves left, how the skyline is arranged. But is this a model you could hold in your hands, or a city you could get lost in? The image doesn't say. It never did.
That's what the network learned. The shape. Not the size.
So where does scale actually come from?

[Triangulation, Image from StackExchange]
If the images don't contain it, you have to bring it in from somewhere else. Every real solution to scale ambiguity does exactly one thing: it injects physical units that projection destroyed.
The most direct way is a second camera. Two lenses at a known distance apart — a fixed, measured baseline. The difference in what each sees can be triangulated into real depth. The same trick your two eyes use constantly without you noticing. Try it now: hold your finger in front of your face and close one eye, then the other, alternating. Watch your finger jump against the background. The size of that jump is your brain's ruler. Your eyes are two cameras with a known baseline, and your brain has been doing this triangulation your entire life.
The other clean solution is an inertial sensor. An IMU measures acceleration in meters per second squared. Physical units. Real ones. You cannot arbitrarily scale acceleration — physics doesn't allow it. If the visual system says the camera moved some distance and the IMU says it accelerated at a specific rate, those two constraints together nail down a specific, metric scale. It's like giving the camera a body — suddenly it can feel how fast it's actually moving through real space.
Beyond hardware, you can anchor scale to things you already know the size of. A standard door. A lane marking. A person. The moment the camera spots something with a known real-world dimension, it has a ruler, and everything else can be measured relative to that.
And sometimes the right answer is to stop asking the question. If you only need to know that the chair is closer than the wall — ordering, not measuring — monocular vision does that beautifully. The limitation only bites when you need a real number in meters.
The reframe that changes everything
For a while it's tempting to treat this as an implementation problem. A better loss function, more data, a smarter architecture, longer training — the answer must be in there somewhere. You tune hyperparameters. You swap backbones. You read papers at 2am convinced someone solved it and you just haven't found the right one yet.
Picture someone meticulously rearranging deck chairs and polishing railings, completely unaware the ship has a hole in it below the waterline.
The constraint isn't in the code. It's in the sensor. A single camera destroys scale when it projects 3D onto 2D, and no computation afterward recovers information that was never captured. This isn't an engineering problem waiting for a clever solution. It's an information problem. The data doesn't contain it. End of story.
The correct move isn't to try harder. It's to ask the right question: do you actually need metric scale? If yes, the answer lives outside the images. Another sensor. Another camera. A known reference. If no, stop using metrics that pretend you have something you don't.
The network was right all along. The shape was correct. The depths were right relative to each other. It just didn't know if it was describing a model city on a table or the real thing sprawling across the horizon.
From a single camera, it never will.
That's scale ambiguity. Not a bug. Geometry.
If you haven't read the previous post on self-supervised monocular depth estimation, that's the foundation this builds on. The two posts are the same coin, front and back.


