Deep Photo: Model-Based Photograph Enhancement and Viewing

Johannes Kopf Boris Neubert Billy Chen Michael Cohen Daniel Cohen-Or
University of Konstanz University of Konstanz Microsoft Microsoft Research Tel Aviv University

Oliver Deussen Matt Uyttendaele Dani Lischinski
University of Konstanz Microsoft Research The Hebrew University

Original Dehazed Relighted Annotated

Figure 1: Some of the applications of the Deep Photo system.

Abstract

In this paper, we introduce a novel system for browsing, enhanc-
ing, and manipulating casual outdoor photographs by combining
them with already existing georeferenced digital terrain and urban
models. A simple interactive registration process is used to align a
photograph with such a model. Once the photograph and the model
have been registered, an abundance of information, such as depth,
texture, and GIS data, becomes immediately available to our sys-
tem. This information, in turn, enables a variety of operations, rang-
ing from dehazing and relighting the photograph, to novel view syn-
thesis, and overlaying with geographic information. We describe
the implementation of a number of these applications and discuss
possible extensions. Our results show that augmenting photographs
with already available 3D models of the world supports a wide vari-
ety of new ways for us to experience and interact with our everyday
snapshots.

Keywords: image-based modeling, image-based rendering, image
completion, dehazing, relighting, photo browsing

1 Introduction

Despite the increasing ubiquity of digital photography, the meta-
phors we use to browse and interact with our photographs have not
changed much. With few exceptions, we still treat them as 2D en-
tities, whether they are displayed on a computer monitor or printed
as a hard copy. It is well understood that augmenting a photograph
with depth can open the way for a variety of new exciting manip-
ulations. However, inferring the depth information from a single
image that was captured with an ordinary camera is still a long-
standing unsolved problem in computer vision. Luckily, we are
witnessing a great increase in the number and the accuracy of ge-
ometric models of the world, including terrain and buildings. By
registering photographs to these models, depth becomes available
at each pixel. The Deep Photo system described in this paper, con-
sists of a number of applications afforded by these newfound depth
values, as well as the many other types of information that are typ-
ically associated with such models.

Deep Photo is motivated by several recent trends now reaching crit-
ical mass. The first trend is that of geo-tagged photos. Many photo
sharing web sites now enable users to manually add location in-
formation to photos. Some digital cameras, such as the RICOH
Caplio 500SE and the Nokia N95, feature a built-in GPS, allowing
automatic location tagging. Also, a number of manufacturers offer
small GPS units that allow photos to be easily geo-tagged by soft-
ware that synchronizes the GPS log with the photos. In addition,
location tags can be enhanced by digital compasses that are able
to measure the orientation (tilt and heading) of the camera. It is
expected that, in the future, more cameras will have such function-
ality, and that most photographs will be geo-tagged.


The second trend is the widespread availability of accurate digi-
tal terrain models, as well as detailed urban models. Thanks to
commercial projects, such as Google Earth and Microsoft’s Virtual
Earth, both the quantity and the quality of such models is rapidly
increasing. In the public domain, NASA provides detailed satellite
imagery (e.g., Landsat [NASA 2008a]) and elevation models (e.g.,
Shuttle Radar Topography Mission [NASA 2008b]). Also, a num-
ber of cities around the world are creating detailed 3D models of
their cityscape (e.g., Berlin 3D).

The combination of geo-tagging and the availability of fairly ac-
curate 3D models allows many photographs to be precisely geo-
registered. We envision that in the near future automatic geo-
registration will be available as an online service. Thus, although
we briefly describe the simple interactive geo-registration technique
that we currently employ, the emphasis of this paper is on the ap-
plications that it enables, including:

• dehazing (or adding haze to) images,
• approximating changes in lighting,
• novel view synthesis,
• expanding the field of view,
• adding new objects into the image,
• integration of GIS data into the photo browser.

Our goal in this work has been to enable these applications for sin-
gle outdoor images, taken in a casual manner without requiring any
special equipment or any particular setup. Thus, our system is ap-
plicable to a large body of existing outdoor photographs, so long
as we know the rough location where each photograph was taken.
We chose New York City and Yosemite National Park as two of
the many locations around the world, for which detailed textured
models are already available1. We demonstrate our approach by
combining a number of photographs (obtained from flickrTM) with
these models.

It should be noted that while the models that we use are fairly de-
tailed, they are still a far cry from the degree of accuracy and the
level of detail one would need in order to use these models directly
to render photographic images. Thus, one of our challenges in this
work has been to understand how to best leverage the 3D informa-
tion afforded by the use of these models, while at the same time
preserving the photographic qualities of the original image.

In addition to exploring the applications listed above, this paper also
makes a number of specific technical contributions. The two main
ones are a new data-driven stable dehazing procedure, and a new
model-guided layered depth image completion technique for novel
view synthesis.

Before continuing, we should note some of the limitations of Deep
Photo in its current form. The examples we show are of outdoor
scenes. We count on the available models to describe the distant
static geometry of the scene, but we cannot expect to have access to
the geometry of nearby (and possibly dynamic) foreground objects,
such as people, cars, trees, etc. In our current implementation such
foreground objects are matted out before combining the rest of the
photograph with a model, and may be composited back onto the
photograph at a later stage. So, for some images, the user must
spend some time on interactive matting, and the fidelity of some
of our manipulations in the foreground may be reduced. That said,
we expect the kinds of applications we demonstrate will scale to

1For Yosemite, we use elevation data from the Shuttle Radar Topography
Mission [NASA 2008b] with Landsat imagery [NASA 2008a]. Such data is
available for the entire Earth. Models similar to that of NYC are currently
available for dozens of cities.

include any improvements in automatic computer vision algorithms
and depth acquisition technologies.

2 Related Work

Our system touches upon quite a few distinct topics in computer
vision and computer graphics; thus, a comprehensive review of all
related work is not feasible due to space constraints. Below, we at-
tempt to provide some representative references, and discuss in de-
tail only the ones most closely related to our goals and techniques.

Image-based modeling. In recent years, much work has been
done on image-based modeling techniques, which create high qual-
ity 3D models from photographs. One example is the pioneering
Façade system [Debevec et al. 1996], designed for interactive mod-
eling of buildings from collections of photographs. Other systems
use panoramic mosaics [Shum et al. 1998], combine images with
range data [Stamos and Allen 2000], or merge ground and aerial
views [Früh and Zakhor 2003], to name a few.

Any of these approaches may be used to create the kinds of textured
3D models that we use in our system; however, in this work we are
not concerned with the creation of such models, but rather with the
ways in which their combination with a single photograph may be
useful for the casual digital photographer. One might say that rather
than attempting to automatically or manually reconstruct the model
from a single photo, we exploit the availability of digital terrain
and urban models, effectively replacing the difficult 3D reconstruc-
tion/modeling process by a much simpler registration process.

Recent research has shown that various challenging tasks, such as
image completion and insertion of objects into photographs [Hays
and Efros 2007; Lalonde et al. 2007] can greatly benefit from the
availability of the enormous amounts of photographs that had al-
ready been captured. The philosophy behind our work is somewhat
similar: we attempt to leverage the large amount of textured geo-
metric models that have already been created. But unlike image
databases, which consist mostly of unrelated items, the geometric
models we use are all anchored to the world that surrounds us.

Dehazing. Weather and other atmospheric phenomena, such as
haze, greatly reduce the visibility of distant regions in images of
outdoor scenes. Removing the effect of haze, or dehazing, is a
challenging problem, because the degree of this effect at each pixel
depends on the depth of the corresponding scene point.

Some haze removal techniques make use of multiple images; e.g.,
images taken under different weather conditions [Narasimhan and
Nayar 2003a], or with different polarizer orientations [Schechner
et al. 2003]. Since we are interested in dehazing single images,
taken without any special equipment, such methods are not suitable
for our needs.

There are several works that attempt to remove the effects of haze,
fog, etc., from a single image using some form of depth informa-
tion. For example, Oakley and Satherley [1998] dehaze aerial im-
agery using estimated terrain models. However, their method in-
volves estimating a large number of parameters, and the quality of
the reported results is unlikely to satisfy today’s digital photography
enthusiasts. Narasimhan and Nayar [2003b] dehaze single images
based on a rough depth approximation provided by the user, or de-
rived from satellite orthophotos. The very latest dehazing methods
[Fattal 2008; Tan 2008] are able to dehaze single images by making
various assumptions about the colors in the scene.

Our work differs from these previous single image dehazing meth-
ods in that it leverages the availability of more accurate 3D models,


and uses a novel data-driven dehazing procedure. As a result, our
method is capable of effective, stable high-quality contrast restora-
tion even of extremely distant regions.

Novel view synthesis. It has been long recognized that adding
depth information to photographs provides the means to alter the
viewpoint. The classic “Tour Into the Picture” system [Horry
et al. 1997] demonstrates that fitting a simple mesh to the scene
is sometimes enough to enable a compelling 3D navigation experi-
ence. Subsequent papers, Kang [1998], Criminisi et al. [2000], Oh
et al. [2001], Zhang et al. [2002], extend this by providing more
sophisticated, user-guided 3D modelling techniques. More recently
Hoiem et al. [2005] use machine learning techniques in order to
construct a simple “pop-up” 3D model, completely automatically
from a single photograph. In these systems, despite the simplicity
of the models, the 3D experience can be quite compelling.

In this work, we use already available 3D models in order to add
depth to photographs. We present a new model-guided image com-
pletion technique that enables us to expand the field of view and to
perform high-quality novel view synthesis.

Relighting. A number of sophisticated relighting systems have
been proposed by various researchers over the years (e.g., [Yu and
Malik 1998; Yu et al. 1999; Loscos et al. 2000; Debevec et al.
2000]). Typically, such systems make use of a highly accurate geo-
metric model, and/or a collection of photographs, often taken under
different lighting conditions. Given this input they are often able to
predict the appearance of a scene under novel lighting conditions
with a very high degree of accuracy and realism. Another alterna-
tive to use a time-lapse video sequence [Sunkavalli et al. 2007]. In
our case, we assume the availability of a geometric model, but have
just one photograph to work with. Furthermore, although the model
might be detailed, it is typically quite far from a perfect match to
the photograph. For example, a tree casting a shadow on a nearby
building will typically be absent from our model. Thus, we cannot
hope to correctly recover the reflectance at each pixel of the photo-
graph, which is necessary in order to perform physically accurate
relighting. Therefore, in this work we propose a very simple re-
lighting approximation, which is nevertheless able to produce fairly
compelling results.

Photo browsing. Also related is the “Photo Tourism” system
[Snavely et al. 2006], which enables browsing and exploring large
collections of photographs of a certain location using a 3D inter-
face. But, the browsing experience that we provide is very differ-
ent. Moreover, in contrast to “Photo Tourism”, our system requires
only a single geo-tagged photograph, making it applicable even to
locations without many available photos.

The “Photo Tourism” system also demonstrates the transfer of an-
notations from one registered photograph to another. In Deep
Photo, photographs are registered to a model of the world, making
it possible to tap into a much richer source of information.

Working with geo-referenced images. Once a photo is reg-
istered to geo-referenced data such as maps and 3D models, a
plethora of information becomes available. For example, Cho [Cho
2007] notes that absolute geo-locations can be assigned to individ-
ual pixels and that GIS annotations, such as building and street
names, may be projected onto the image plane. Deep Photo sup-
ports similar labeling, as well as several additional visualizations,
but in contrast to Cho’s system, it does so dynamically, in the con-
text of an interactive photo browsing application. Furthermore, as
discussed earlier, it also enables a variety of other applications.

In addition to enhancing photos, location is also useful in organiz-
ing and visualizing photo collections. The system developed by
Toyama et al. [2003] enables a user to browse large collections of
geo-referenced photos on a 2D map. The map serves as both a vi-
sualization device, as well as a way to specify spatial queries, i.e.,
all photos within a region. In contrast, DeepPhoto focuses on en-
hancing and browsing of a single photograph; the two systems are
actually complementary, one focusing on organizing large photo
collections, and the other on enhancing and viewing single pho-
tographs.

3 Registration and Matting

We assume that the photograph has been captured by a simple pin-
hole camera, whose parameters consist of position, pose, and focal
length (seven parameters in total). To register such a photograph
to a 3D geometric model of the scene, it suffices to specify four
or more corresponding pairs of points [Gruen and Huang 2001].
Assuming that the rough position from which the photograph was
taken is available (either from a geotag, or provided by the user), we
are able to render the model from roughly the correct position, let
the user specify sufficiently many correspondences, and recover the
parameters by solving a nonlinear system of equations [Nister and
Stewenius 2007]. The details and user interface of our registration
system are described in a technical report [Chen et al. 2008].

For images that depict foreground objects not contained in the
model, we ask the user matte out the foreground. For the appli-
cations demonstrated in this paper the matte does not have to be too
accurate, so long as it is conservative (i.e., all the foreground pixels
are contained). We created mattes with the Soft Scissors system
[Wang et al. 2007]. The process took about 1-2 minutes per photo.
For every result produced using a matte we show the matte next to
the input photograph.

4 Image Enhancement

Many of the typical images we take are of a spectacular, often well
known, landscape or cityscape. Unfortunately in many cases the
lighting conditions or the weather are not optimal when the pho-
tographs are taken, and the results may be dull or hazy. Having
a sufficiently accurate match between a photograph and a geomet-
ric model offers new possibilities for enhancing such photographs.
We are able to easily remove haze and unwanted color shifts and to
experiment with alternative lighting conditions.

4.1 Dehazing

Atmospheric phenomena, such as haze and fog can reduce the vis-
ibility of distant regions in images of outdoor scenes. Due to at-
mospheric absorption and scattering, only part of the light reflected
from distant objects reaches the camera. Furthermore, this light is
mixed with airlight (scattered ambient light between the object and
camera). Thus, distant objects in the scene typically appear consid-
erably lighter and featureless, compared to nearby ones.

If the depth at each image pixel is known, in theory it should be
easy to remove the effects of haze by fitting an analytical model
(e.g., [McCartney 1976; Nayar and Narasimhan 1999]):

Ih = Io f (z) + A (1− f (z)) . (1)

Here Ih is the observed hazy intensity at a pixel, Io is the original
intensity reflected towards the camera from the corresponding scene
point, A is the airlight, and f (z) = exp(−β z) is the attenuation in
intensity as a function of distance due to outscattering. Thus, after


Input Model textures Final dehazed result

2000 4000 6000 8000
0

0.2

0.4

0.6

0.8

1

Depth

In
te
n
si
ty

Estimated haze curves f (z)

Figure 2: Dehazing. Note the artifacts in the model texture, and the significant deviation of the estimated haze curves from exponential shape.

Input Dehazed Input Dehazed

Figure 3: More dehazing examples.

estimating the parameters A and β the original intensity may be
recovered by inverting the model:

Io = A + (Ih −A)
1

f (z)
. (2)

As pointed out by Narasimhan and Nayar [2003a], this model as-
sumes single-scattering and a homogeneous athmosphere. Thus,
it is more suitable for short ranges of distance and might fail to
correctly approximate the attenuation of scene points that are more
than a few kilometers away. Furthermore, since the exponential
attenuation goes quickly down to zero, noise might be severely am-
plified in the distant areas. Both of these artifacts may be observed
in the “inversion result” of Figure 4.

While reducing the degree of dehazing [Schechner et al. 2003] and
regularization [Schechner and Averbuch 2007; Kaftory et al. 2007]
may be used to alleviate these problems, our approach is to estimate
stable values for the haze curve f (z) directly from the relationship
between the colors in the photograph and those of the model tex-
tures. More specifically, we compute a curve f (z) and an airlight
A, such that eq. (2) would map averages of colors in the photograph
to the corresponding averages of (color-corrected) model texture
colors. Note that although our f (z) has the same physical inter-
prertation as in the previous approaches, due to our estimation pro-
cess it is not subject to the constraints of a physicially-based model.
Since we estimate a single curve to represent the possibly spatially

varying haze it can also contain non-monotonicities. All of the pa-
rameters are estimated completely automatically.

For robustness, we operate on averages of colors over depth ranges.
For each value of z, we compute the average model texture color
Îm(z) for all pixels whose depth is in [z−δ , z + δ ], as well as the
average hazy image color Îh(z) for the same pixels. In our imple-
mentation, the depth interval parameter δ is set to 500 meters, for
all images we experimented with. The averaging makes our ap-
proach less sensitive to model texture artifacts, such as registration
and stitching errors, bad pixels, or contained shadows and clouds.

Before explaining the details of our method, we would like to point
out that the model textures typically have a global color bias. For
example, Landsat uses seven sensors whose spectral responses dif-
fer from the typical RGB camera sensors. Thus, the colors in the re-
sulting textures are only an approximation to ones that would have
been captured by a camera (see Figure 2). We correct this color
bias by measuring the ratio between the photo and the texture col-
ors in the foreground (in each channel), and using these ratios to
correct the colors of the entire texture. More precisely, we compute
a global multiplicative correction vector C as

C =
Fh

lum(Fh)
/

Fm
lum(Fm)

, (3)


Input Fattal’s Result Inversion Result Our Result

Figure 4: Comparison with other dehazing methods. The second row shows full-resolution zooms of the region indicated with a red rectangle
in the input photo. See the supplementary materials for more comparison images.

where Fh is the average of Îh(z) with z < zF , and Fm is a similarly
computed average of the model texture. lum(c) denotes the lumi-
nance of a color c. We set zF to 1600 meters for all our images.

Now we are ready to explain how to compute the haze curve f (z).
Ignoring for the moment the physical interpretation of A and f (z),
note that eq. (2) simply stretches the intensities of the image around
A, using the scale coefficient f (z)−1. Our goal is to find A and f (z)
that would map the hazy photo colors Îh(z) to the color-corrected
texture colors CÎm(z). Substituting Îh(z) for Ih, and CÎm(z) for Io,
in eq. (2) we get

f (z) =
Îh(z)−A

CÎm(z)−A
. (4)

Different choices of A will result in different scaling curves f (z).
We set A = 1 since this guarantees f (z) ≥ 0. Using A > 1 would
result in larger values of f (z), and hence less contrast in the dehazed
image, and using A < 1 might be prone to instabilities. Figure 2
shows the f (z) curve estimated as described above.

The recovered haze curve f (z) allows to effectively restore the con-
trasts in the photo. However, the colors in the background might
undergo a color shift. We compensate for this by adjusting A, while
keeping f (z) fixed, such that after the change the dehazing pre-
serves the colors of the photo in the background.

To adjust A, we first compute the average background color Bh of
the photo as the average of Îh(z) with z > zB, and a similarly com-
puted average of the model texture Bm. We set zB to 5000m for all
our images. The color of the background is preserved, if the ratio

R =
A + (Bh −A)· f−1

Bh
, f =

Bh −1
Bm −1

, (5)

has the same value for every color channel. Thus, we rewrite eq. (5)
to obtain A as

A = Bh
R− f−1

1− f−1
, (6)

and set R = max(Bm,red /Bh,red , Bm,green/Bh,green, Bm,blue/Bh,blue).
This particular choice of R results in the maximum A that guaran-
tees A ≤ 1. Finally, we use eq. (2) with the recovered f (z) and the
adjusted A to dehaze the photograph.

Figures 2 and 3 show various images dehazed with our method.
Figure 4 compares our method with other approaches. In this com-
parison we focused on methods that are applicable in our context
of working with a single image only. Fattal’s method [2008] de-
hazes the image nicely up to a certain distance (particularly con-
sidering that this method does not require any input in addition to
the image itself), but it is unable to effectively dehaze the more
distant parts, closer to the horizon. The “Inversion Result” was
obtained via eq. (2) with an exponential haze curve. This is how
dehazing was performed in a number of papers, e.g., [Schechner
et al. 2003; Narasimhan and Nayar 2003a; Narasimhan and Nayar
2003b]. Here, we use our accurate depth map instead of using mul-
tiple images or user-provided depth approximations. The airlight
color was set to the sky color near the horizon, and the optical depth
β was adjusted manually. The result suffers from amplified noise in
the distance, and breaks down next to the horizon. In contrast, our
result manages to remove more haze than the two other approaches,
while preserving the natural colors of the input photo.

Note that in practice one might not want to remove the haze com-
pletely as we have done, because haze sometimes provides percep-
tually significant depth cues. Also, dehazing typically amplifies
some noise in regions where little or no visible detail remain in the
original image. Still, almost every image benefits from some degree
of dehazing.

Having obtained a model for the haze in the photograph we can
insert new objects into the scene in a more seamless fashion by
applying the model to these objects as well (in accordance with the
depth they are supposed to be at). This is done simply by inverting
eq. (2):

Ih = A + (Io −A) f (z). (7)

This is demonstrated in the companion video.

4.2 Relighting

One cannot underestimate the importance of the role that lighting
plays in the creation of an interesting photograph. In particular,
in landscape photography, the vast majority of breathtaking pho-
tographs are taken during the “golden hour”, after sunrise, or be-
fore sunset [Reichmann 2001]. Unfortunately most of our outdoor
snapshots are taken under rather boring lighting. With Deep Photo


Input Relighted Relighted Input Relighted Relighted

Input Relighted Input Relighted

Figure 5: Relighting results produced with our system.

Original Relighted Lit Model

Figure 6: A comparison between the original photo, its relighted
version, and a rendering of the underlying model under the same
illumination.

it is possible to modify the lighting of a photograph, approximating
what the scene might look like at another time of day.

As explained earlier, our goal is to work on single images, aug-
mented with a detailed, yet not completely accurate geometric
model of the scene. This setup does not allow us to correctly re-
cover the reflectance at each pixel. Thus, we use the following sim-
ple workflow, which only approximates the appearance of lighting
changes in the scene. We begin by dehazing the image, as described
in the previous section, and modulate the colors using a lightmap
computed for the novel lighting. The original sky is replaced by a
new one simulating the desired time of day (we use Vue 6 Infinite
[E-on Software 2008] to synthesize the new sky). Finally, we add
haze back in using Eq. (7), after multiplying the haze curves f (z)
by a global color mood transfer coefficient.

The global color mood transfer coefficient LG is computed for each
color channel. Two sky domes are computed, one corresponding

to the actual (known or estimated) time of day the photograph was
taken, and the other corresponding to the desired sun position. Let
Iref and Inew be the average colors of the two sky domes. The color
mood transfer coefficients are then given by LG = Inew/Iref.

The lightmap may be computed in a variety of ways. Our current
implementation offers the user a set of controls for various aspects
of the lighting, including atmosphere parameters, diffuse and ambi-
ent colors, etc. We then compute the lightmap with a simple local
shading model and scale it by the color mood coefficient:

L = LG ·LS ·(LA + LD ·(n ·l)) , (8)

where LS ∈ [Ishadow, 1] is the shadow coefficient that indicates the
amount of light attenuation due to shadows, LA is the ambient co-
efficient, LD is the diffuse coefficient, n the point normal, and l the
direction to the sun. The final result is obtained simply by multi-
plying the image by L.

Note that we do not attempt to remove the existing illumination
before applying the new one. However, we found even this ba-
sic procedure yields convincing changes in the lighting (see Fig-
ure 5, and the dynamic relighting sequences in the video). Figure
6 demonstrates that relighting a geo-registered photo generates a
completely different (and more realistic) effect than simply render-
ing the underlying geometric model under the desired lighting.

5 Novel View Synthesis

One of the compelling features of Deep Photo is the ability to mod-
ify the viewpoint from which the original photograph was taken.
Bringing the static photo to life in this manner significantly en-
hances the photo browsing experience, as shown in the companion
video.

Assuming that the photograph has been registered with a suffi-
ciently accurate geometric model of the scene, the challenge in
changing the viewpoint is reduced to completing the missing tex-
ture in areas that are either occluded, or are simply outside the orig-
inal view frustum. We use image completion [Efros and Leung
1999; Drori et al. 2003] to fill the missing areas with texture from


other parts of the photograph. Our image completion process is
similar to texture-by-numbers [Hertzmann et al. 2001], where in-
stead of a hand-painted label map we use a guidance map derived
from the textures of the 3D model. In rural areas these are typically
aerial images of the terrain, while in urban models these are the
texture maps of the buildings.

The texture is synthesized over a cylindrical layered depth image
(LDI) [Shade et al. 1998], centered around the original camera po-
sition. The LDI image stores, for each pixel, the depths and nor-
mals of scene points intersected by the corresponding ray from the
viewpoint. We use this data structure, since it is able to represent
both the visible and the occluded parts of the scene (in our exam-
ples we used a LDI with four depth layers per pixel). The colors of
the frontmost layer in each pixel are taken from the original photo-
graph provided that they are inside the original view frustum, while
the remaining colors are synthesized by our guided texture transfer.

We begin the texture transfer process by computing the guiding
value for all of the layers at each pixel. The guiding value is a vector
(U,V, D), where U and V are the chrominance values of the corre-
sponding point in the model texture, and D is the distance to the
corresponding scene point from the location of the camera. In our
experiments, we tried various other features, including terrain nor-
mal, slope, height, and combinations thereof. We achieved the best
results, however, with the realtively simple feature vector above.
Including the distance D in the feature vector biases the synthesis
towards generating textures at the correct scale. D is normalized
so that distances from 0 to 5000 meters map to [0, 1]. We only
include chrominance information in the feature vector (and not lu-
minance) to alleviate problems associated with existing transient
features such as shading and shadows in the model textures.

Texture synthesis is carried out in a multi-resolution manner. The
first (coarsest) level is synthesized by growing the texture outwards
from the known regions. For each unknown pixel we examine
a square neighborhood around it, and exhaustively search for the
best matching neighborhood from the known region (using the L2
norm). Since our neighborhoods contain missing pixels we cannot
apply PCA compression and other speed-up structures in a straight
forward way. However, the first level is sufficiently coarse and its
synthesis is rather fast. To synthesize each next level we upsam-
ple the result of the previous level and perform a small number of
k-coherence synthesis passes [Ashikhmin 2001] to refine the result.
Here we use a 5×5 look-ahead region and k = 4. The total syn-
thesis time is about 5 minutes per image. The total texture size is
typically on the order of 4800×1600 pixels, times four layers.

It should be noted that when working with LDIs the concept of a
pixel’s neighborhood must be adjusted to account for the existence
of multiple depth layers at each pixel. We define the neighborhood
in the following way: On each depth layer, a pixel has up to 8 pixels
surrounding it. If the neighboring pixel has multiple depth layers,
the pixel on the layer with the closest depth value is assigned as the
immediate neighbor.

To render images from novel viewpoints, we use a shader to project
the LDI image onto the geometric model by computing the distance
of the model to the camera and using the pixel color from the depth
layer closest to this distance. Significant changes in the viewpoint
eventually cause texture distortions if one keeps using the texture
from the photograph. To alleviate this problem, we blend the pho-
tograph’s texture into the model’s texture as the new virtual camera
gets farther away from the original viewpoint. We found this to sig-
nificantly improve the 3D viewing experience, even for drastic view
changes, such as going to bird’s eye view.

Figure 7: Extending the field of view. The red rectangle indicates
the boundaries of the original photograph. The companion video
demonstrates changing the viewpoint.

Thus, the texture color T at each terrain point x is given by

T (x) = g(x) Tphoto(x) + (1−g(x)) Tmodel(x), (9)

where the blending factor g(x) is determined with respect to the
current view, according to the following principles: (i) pixels in the
original photograph which correspond to surfaces facing camera
are considered more reliable than those on oblique surfaces; and,
(ii) pixels in the original photograph are also preferred whenever
the corresponding scene point is viewed from the same direction in
the current view, as it was in the original one.

Specifically, let n(x) denote the surface normal, C0 the original
camera position from which the photograph was taken, and Cnew the
current camera position. Next, let v0 = (C0 −x)/‖C0 −x‖ denote
the normalized vector from the scene point to the original camera
position, and similarly vnew = (Cnew −x)/‖Cnew −x‖. Then

g(x) = max (n(x)·v0, vnew ·v0) . (10)

In other words, g is defined as the greater among the cosine of the
angle between the normal and the original view direction, and the
cosine of the angle between the two view directions.

Finally, we also apply re-hazing on-the-fly. First, we remove haze
from the texture completely as described in Section 4.1. Then, we
add haze back in, this time using the distances from the current
camera position. The results may be seen in Figure 7 and in the
video.

6 Information Visualization

Having registered a photograph with a model that has GIS data as-
sociated with it allows displaying various information about the
scene, while browsing the photograph. We have implemented a
simple application that demonstrates several types of information
visualization. In this application, the photograph is shown side-
by-side with a top view of the model, referred to as the map view.
The view frustum corresponding to the photograph is displayed in
the map view, and is updated dynamically whenever the view is
changed (as described in Section 5). Moving the cursor in either
of the two views highlights the corresponding location in the other
view. In the map view, the user is able to switch between a street
map, an orthographic photo, a combination thereof, etc. In addition


(a) (b) (c) (d) (e)

Figure 8: Different information visualization modes in our system. (a-b) Coupled map and photo views. As the user moves the mouse over
one of the views, the corresponding location is shown in the other view as well. The profile of a horizontal scanline in the map view (a) is
shown superimposed over the terrain in the photo view (b). Since the location of the mouse cursor is occluded by a mountain in the photo, its
location in the photo view is indicated using semi-transparent arrows. (c) Names of landmarks are automatically superimposed on the photo.
(d-e) Coupled photo and map views with superimposed street network. The streets under the mouse cursor are highlighted in both views.

to text labels it is also possible to superimpose graphical map ele-
ments, such as roads, directly onto the photo view. These abilities
are demonstrated in Figures 1 and 8 and in the companion video.

There are various databases with geo-tagged media available on the
web. We are able to highlight these locations in both views (photo
and map). Of particular interest are geo-tagged Wikipedia articles
about various landmarks. We display a small Wikipedia icon at such
locations, which opens a browser window with the corresponding
article, when clicked. This is also demonstrated in the companion
video.

Another nice visualization feature of our system is the ability to
highlight the object under the mouse in the photo view. This can
be useful, for example, when viewing night time photographs: in
an urban scene shot at night, the building under the cursor may be
shown using daylight textures from the underlying model.

7 Discussion and Conclusions

We presented Deep Photo, a novel system for editing and browsing
outdoor photographs. It leverages the high quality 3D models of
the earth that are now becoming widely available. We have demon-
strated that once a simple geo-registration of a photo is performed,
the models can be used for many interesting photo manipulations
that range from de- and rehazing and relighting to integrating GIS
information.

The applications we show are varied. Haze removal is a challenging
problem due to the fact that haze is a function of depth. We have
shown that now that depth is available in a geo-registered photo-
graph, excellent “haze editing” can be achieved. Similarly, having
an underlying geometric model makes it possible to generate con-
vincing relighted photographs, and dynamically change the view.
Finally, we demonstrate that the enormous wealth of information
available online can now be used to annotate and help browse pho-
tographs.

Within our framework we used models obtained from Virtual Earth.
The manual registration is done within a minute, matting out the

Figure 9: Failure cases: some of the described applications pro-
duce artifacts for badly registered (left) and/or insuffienctly accu-
rate models (right). In this case the dehazing application generated
halos around misaligned depth edges because it used wrong depth
values there. The same artifacts can be observed by zooming into
the full images in Figures 2 and 3.

foreground is also an easy task using state-of-the-art techniques
such as Soft Scissors [Wang et al. 2007]. All other operations such
as dehazing and relighting run at interactive speeds; however, com-
puting very detailed shadow maps for the relighting can be time
consuming.

As can be expected, there are always some differences and mis-
alignments between the photograph and the model. The may arise
due to insufficiently accurate models, and also due to the fact that
the photographs were not captured with an ideal pinhole camera.
Although they can lead to some artifacts (see Figure 9), we found
that in many cases these differences are less problematic than one
might fear. However, automatically resolving such differences is
certainly a challenging and interesting topic for future work.

We believe that the applications presented here represent just a
small fraction of possible geo-photo editing operations. Many of
the existing digital photography products could be greatly enhanced


with the use of geo information. Operations could encompass
noise-reduction and image sharpening with 3D model priors, post-
capture refocussing, object recovery in under or over-exposed areas
as well as illumination transfer between photographs.

GIS databases contain a wealth of information, of which we have
just used a small amount. Water, grass, pavement, building mate-
rials, etc, can all potentially be automatically labeled and used to
improve photo tone adjustment. Labels can be transferred automat-
ically from one image to others. Again, having a single consis-
tent 3D model for our photographs provides much more than just a
depth value per pixel.

In this paper we mostly dealt with single images. Most of the appli-
cations that we demonstrated become even stronger when combin-
ing multiple input photos. A particularly interesting direction might
be to combine Deep Photo with the Photo Tourism system. Once a
Photo Tour is geo-registered, the coarse 3D information generated
by Photo Tourism could be used to enhance online 3D data and
vice-versa. The information visualization and novel view synthe-
sis applications we demonstrate here could be combined with the
Photo Tourism viewer. This idea of fusing multiple images could
even be extended to video that could be registered to the models.

Acknowledgements

This research was supported in parts by grants from the the fol-
lowing funding agencies: the Lion foundation, the GIF founda-
tion, the Israel Science Foundation, and by DFG Graduiertenkol-
leg/1042 “Explorative Analysis and Visualization of Large Infor-
mation Spaces” at University of Konstanz, Germany.

References

ASHIKHMIN, M. 2001. Synthesizing natural textures. Proceedings
of the 2001 symposium on Interactive 3D graphics (I3D), 217–
226.

CHEN, B., RAMOS, G., OFEK, E., COHEN, M., DRUCKER, S.,
AND NISTER, D. 2008. Interactive techniques for registering
images to digital terrain and building models. Microsoft Re-
search Technical Report MSR-TR-2008-115.

CHO, P. L. 2007. 3D organization of 2D urban imagery. Proceed-
ings of the 36th Applied Imagery Pattern Recognition Workshop,
3–8.

CRIMINISI, A., REID, I. D., AND ZISSERMAN, A. 2000. Single
view metrology. International Journal of Computer Vision 40,
2, 123–148.

DEBEVEC, P. E., TAYLOR, C. J., AND MALIK, J. 1996. Mod-
eling and rendering architecture from photographs: A hybrid
geometry- and image-based approach. Proceedings of SIG-
GRAPH ’96, 11–20.

DEBEVEC, P., HAWKINS, T., TCHOU, C., DUIKER, H.-P.,
SAROKIN, W., AND SAGAR, M. 2000. Acquiring the re-
flectance field of a human face. Proceedings of SIGGRAPH
2000, 145–156.

DRORI, I., COHEN-OR, D., AND YESHURUN, H. 2003.
Fragment-based image completion. ACM Transactions on
Graphics (Proceedings of SIGGRAPH 2003) 22, 3, 303–312.

E-ON SOFTWARE, 2008. Vue 6 Infinite. http:
//www.e-onsoftware.com/products/vue/vue_
6_infinite.

EFROS, A. A., AND LEUNG, T. K. 1999. Texture synthesis by
non-parametric sampling. Proceedings of IEEE International
Conference on Computer Vision (ICCV) ’99 2, 1033–1038.

FATTAL, R. 2008. Single image dehazing. ACM Transactions on
Graphics (Proceedings of SIGGRAPH 2008) 27, 3, 73.

FRÜH, C., AND ZAKHOR, A. 2003. Constructing 3D city models
by merging aerial and ground views. IEEE Computer Graphics
and Applications 23, 6, 52–61.

GRUEN, A., AND HUANG, T. S. 2001. Calibration and Orienta-
tion of Cameras in Computer Vision. Springer-Verlag, Secaucus,
NJ, USA.

HAYS, J., AND EFROS, A. A. 2007. Scene completion using
millions of photographs. ACM Transactions on Graphics (Pro-
ceedings of SIGGRAPH 2007) 26, 3, 4.

HERTZMANN, A., JACOBS, C. E., OLIVER, N., CURLESS, B.,
AND SALESIN, D. H. 2001. Image analogies. Proceedings of
SIGGRAPH 2001, 327–340.

HOIEM, D., EFROS, A. A., AND HEBERT, M. 2005. Automatic
photo pop-up. ACM Transactions on Graphics (Proceedings of
SIGGRAPH 2005) 24, 3, 577–584.

HORRY, Y., ANJYO, K.-I., AND ARAI, K. 1997. Tour into the
picture: using a spidery mesh interface to make animation from
a single image. Proceedings of SIGGRAPH ’97, 225–232.

KAFTORY, R., SCHECHNER, Y. Y., AND ZEEVI, Y. Y. 2007.
Variational distance-dependent image restoration. Proceedings
of IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR) 2007, 1–8.

KANG, S. B. 1998. Depth painting for image-based rendering
applications. Tech. rep., Compaq Cambridge Research Lab.

LALONDE, J.-F., HOIEM, D., EFROS, A. A., ROTHER, C.,
WINN, J., AND CRIMINISI, A. 2007. Photo clip art. ACM
Transactions on Graphics (Proceedings of SIGGRAPH 2007)
26, 3, 3.

LOSCOS, C., DRETTAKIS, G., AND ROBERT, L. 2000. Interactive
virtual relighting of real scenes. IEEE Transactions on Visual-
ization and Computer Graphics 6, 4, 289–305.

MCCARTNEY, E. J. 1976. Optics of the Atmosphere: Scattering by
Molecules and Particles. John Wiley and Sons, New York, NY,
USA.

NARASIMHAN, S. G., AND NAYAR, S. K. 2003. Contrast restora-
tion of weather degraded images. IEEE Transactions on Pattern
Analysis and Machine Intelligence 25, 6, 713–724.

NARASIMHAN, S. G., AND NAYAR, S. K. 2003. Interactive
(de)weathering of an image using physical models. IEEE Work-
shop on Color and Photometric Methods in Computer Vision.

NASA, 2008. The landsat program. http://landsat.gsfc.
nasa.gov/.

NASA, 2008. Shuttle radar topography mission. http://www2.
jpl.nasa.gov/srtm/.

NAYAR, S. K., AND NARASIMHAN, S. G. 1999. Vision in
bad weather. Proceedings of IEEE International Conference on
Computer Vision (ICCV) ’99, 820–827.

NISTER, D., AND STEWENIUS, H. 2007. A minimal solution to
the generalised 3-point pose problem. Journal of Mathematical
Imaging and Vision 27, 1, 67–79.


OAKLEY, J. P., AND SATHERLEY, B. L. 1998. Improving image
quality in poor visibility conditions using a physical model for
contrast degradation. IEEE Transactions on Image Processing
7, 2, 167–179.

OH, B. M., CHEN, M., DORSEY, J., AND DURAND, F. 2001.
Image-based modeling and photo editing. Proceedings of ACM
SIGGRAPH 2001, 433–442.

REICHMANN, M., 2001. The art of photography.
http://www.luminous-landscape.com/essays/
theartof.shtml.

SCHECHNER, Y. Y., AND AVERBUCH, Y. 2007. Regularized im-
age recovery in scattering media. IEEE Transactions on Pattern
Analysis and Machine Intelligence 29, 9, 1655–1660.

SCHECHNER, Y. Y., NARASIMHAN, S. G., AND NAYAR, S. K.
2003. Polarization-based vision through haze. Applied Optics
42, 3, 511–525.

SHADE, J., GORTLER, S., HE, L.-W., AND SZELISKI, R. 1998.
Layered depth images. Proceedings of SIGGRAPH ’98, 231–
242.

SHUM, H.-Y., HAN, M., AND SZELISKI, R. 1998. Interactive con-
struction of 3-d models from panoramic mosaics. Proceedings of
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) 1998, 427–433.

SNAVELY, N., SEITZ, S. M., AND SZELISKI, R. 2006. Photo
tourism: exploring photo collections in 3d. ACM Transactions
on Graphics (Proceedings of SIGGRAPH 2006) 25, 3, 835–846.

STAMOS, I., AND ALLEN, P. K. 2000. 3-D model construction
using range and image data. Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) 1998, 531–
536.

SUNKAVALLI, K., MATUSIK, W., PFISTER, H., AND
RUSINKIEWICZ, S. 2007. Factored time-lapse video.
ACM Transactions on Graphics (Proceedings of SIGGRAPH
2007) 26, 3, 101.

TAN, R. T. 2008. Visibility in bad weather from a single image.
Proceedings of IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR) 2008, to appear.

TOYAMA, K., LOGAN, R., AND ROSEWAY, A. 2003. Geographic
location tags on digital images. Proceedings of the 11th ACM
international conference on Multimedia, 156–166.

WANG, J., AGRAWALA, M., AND COHEN, M. F. 2007. Soft scis-
sors: an interactive tool for realtime high quality matting. ACM
Transactions on Graphics (Proceedings of SIGGRAPH 2007)
26, 3.

YU, Y., AND MALIK, J. 1998. Recovering photometric proper-
ties of architectural scenes from photographs. Proceedings of
SIGGRAPH ’98, 207–217.

YU, Y., DEBEVEC, P., MALIK, J., AND HAWKINS, T. 1999. In-
verse global illumination: recovering reflectance models of real
scenes from photographs. Proceedings of SIGGRAPH ’99, 215–
224.

ZHANG, L., DUGAS-PHOCION, G., SAMSON, J.-S., AND SEITZ,
S. M. 2002. Single-view modelling of free-form scenes. The
Journal of Visualization and Computer Animation 13, 4, 225–
235.