The Google Pixel 2 and Pixel 2 XL don’t have dual-camera setups, but they still ship with an Apple-inspired Portrait Mode for depth-of-field effects.
Effects like this traditionally require a dual-camera system for depth calculations and the like. But Google opted for a software and machine learning approach, detailing the techniques and advantages/disadvantages in a new blog post.
No ad to show here.
Google engineers Marc Levoy and Yael Pritch explain that if they could figure out the distance between the smartphone camera and various points in the scene, they could create depth-of-field and bokeh effects.
How the effect is traditionally achieved
The duo explains that the traditional way to figure out distance is to use two smartphone cameras on the back.
“Then, for each patch in the left camera’s image, we look for a matching patch in the right camera’s image. The position in the two images where this match is found gives the depth of that scene feature through a process of triangulation,” reads an excerpt from the post.
But what if you don’t have a dual camera phone? Well, they explain that simple, single-camera smartphone apps see the image separated into two layers. “This separation, sometimes called semantic segmentation, lets you blur the background, but it has no notion of depth, so it can’t tell you how much to blur it. Also, if there is an object in front of the person, i.e. very close to the camera, it won’t be blurred out, even though a real camera would do this.”
So how did they get the Pixel 2’s results? After all, the devices only have one main camera.
The duo says that Portrait Mode on the selfie camera makes use of the aforementioned “semantic segmentation” technique. As for the back? Well, the first step is to capture an HDR+ snap.
“Starting from an HDR+ picture, we next decide which pixels belong to the foreground (typically a person) and which belong to the background. This is a tricky problem, because unlike chroma keying (aka green-screening) in the movie industry, we can’t assume that the background is green (or blue, or any other colour) Instead, we apply machine learning.”
Here’s where it gets even more interesting…
Combining a variety of techniques
“In particular, we have trained a neural network, written in TensorFlow, that looks at the picture, and produces an estimate of which pixels are people and which aren’t,” the post reads. In essence, the network filters the image repeatedly, from “low-level features” such as colours and edges to the likes of hats, sunglasses and ice cream cones, to produce a “mask”.
To achieve even better results, Google opted to introduce a stereo algorithm. Yes, it doesn’t have two cameras on the back, but the camera is essentially split in half owing to dual-pixel autofocus.
“If one imagines splitting the (tiny) lens of the phone’s rear-facing camera into two halves, the view of the world as seen through the left side of the lens and the view through the right side are slightly different. These two viewpoints are less than 1mm apart (roughly the diameter of the lens), but they’re different enough to compute stereo and produce a depth map.”
But aren’t the two halves of the same camera too close together to be truly useful for depth information? The duo say this is especially tough in low light, but they have a solution.
“To reduce this noise and improve depth accuracy we capture a burst of left-side and right-side images, then align and average them before applying our stereo algorithm,” they explain, cautioning that, much like HDR+, ghosting can be a concern.
And for the final background blur effect?
“Actually applying the blur is conceptually the simplest part; each pixel is replaced with a translucent disk of the same colour but varying size.”
What could go wrong?
The company also explains a few limitations or weaknesses of the approach, starting with an HDR+ shot gone awry.
“The portraits produced by the Pixel 2 depend on the underlying HDR+ image, segmentation mask, and depth map; problems in these inputs can produce artifacts in the result. For example, if a feature is overexposed in the HDR+ image (blown out to white), then it’s unlikely the left-half and right-half images will have useful information in them, leading to errors in the depth map.”
The duo also gives an example of the neural network’s weaknesses. “It’s a neural network, which has been trained on nearly a million images, but we bet it has never seen a photograph of a person kissing a crocodile, so it will probably omit the crocodile from the mask, causing it to be blurred out.”
The stereo algorithm (left and right halves of camera taking snaps) might also have trouble with “featureless textures” such as blank walls or patterns like plaid shirts.
Interestingly enough, Google zooms in slightly when using Portrait Mode (1.5x for main camera, 1.2x for selfie camera) because “narrower fields of view encourage you to stand back further, which in turn reduces perspective distortion, leading to better portraits”.
The Google engineers conclude that you should still keep your SLR camera for optical zoom and that dedicated cameras won’t disappear. Still, smartphone photography has come in leaps and bounds thanks to computational photography techniques.
“Both of us travel with a big camera and a Pixel 2. At the beginning of our trips we dutifully take out our SLRs, but by the end, it mostly stays in our luggage.”