key: cord-0549883-6slqye39 authors: Chen, Xingyu; Zhang, Qi; Li, Xiaoyu; Chen, Yue; Feng, Ying; Wang, Xuan; Wang, Jue title: Hallucinated Neural Radiance Fields in the Wild date: 2021-11-30 journal: nan DOI: nan sha: 1c754d92f2962c065aa264f8eafd311920f12855 doc_id: 549883 cord_uid: 6slqye39 Neural Radiance Fields (NeRF) has recently gained popularity for its impressive novel view synthesis ability. This paper studies the problem of hallucinated NeRF: i.e., recovering a realistic NeRF at a different time of day from a group of tourism images. Existing solutions adopt NeRF with a controllable appearance embedding to render novel views under various conditions, but they cannot render view-consistent images with an unseen appearance. To solve this problem, we present an end-to-end framework for constructing a hallucinated NeRF, dubbed as Ha-NeRF. Specifically, we propose an appearance hallucination module to handle time-varying appearances and transfer them to novel views. Considering the complex occlusions of tourism images, we introduce an anti-occlusion module to decompose the static subjects for visibility accurately. Experimental results on synthetic data and real tourism photo collections demonstrate that our method can hallucinate the desired appearances and render occlusion-free images from different views. The project and supplementary materials are available at https://rover-xingyu.github.io/Ha-NeRF/. In recent years, synthesizing photo-realistic novel views of a scene has become a research hotspot along with the rapid development of neural rendering technologies. Imagine you want to visit the Brandenburg Gate in Berlin and enjoy the landscapes at different times and weathers, but you cannot because of the coronavirus pandemic. For this hallucinated experience to be as engaging as possible, photorealistic images from different views that can change with the weather, time, and other factors are necessary. To achieve this, Neural Radiance Fields (NeRF) [33] and its following methods [25, 40, 58] have shown a remarkable capacity to recover the 3D geometry and appearance, giving the user an immersive feeling of physically being there. However, one significant drawback of NeRF is that they require a group of images without variable illumination and moving objects, i.e., the radiance of the scene is constant and visible for each view. Unfortunately, most images of tourist landmarks are internet photos captured at different times and occluded by various objects. Most NeRF-based methods would integrate variable appearances and transient occluders into the 3D volume when they occur, which disturbs the real scene in the volume. How to synthesize the occlusion-free views from images with variable appearances and occluders remains to be solved. Martin-Brualla et al. [28] attempt to tackle the aforementioned problem by proposing a NeRF in the Wild method (NeRF-W). They optimize an appearance embedding for each input image to address variable appearances and use a transient volume to decompose static components and their occlusion. Compared to NeRF, NeRF-W takes a step towards recovering a realistic world from tourism images with variable appearances and occluders. However, NeRF-W implements a controllable appearance by the optimized embeddings from train samples, making it need to optimize the embeddings when given a new image and can not hallucinate an appearance from other datasets. Furthermore, NeRF-W tries to optimize a transient volume for each input image with a transient embedding as input, which is highly ill-posed due to the randomness of transient occluders. And this leads to the inaccurate decomposition of the scene and further causes the entanglement of appearances and occlusion, e.g., results in the transient volume to remember the sunset glow. To address these limitations, we present a hallucinated NeRF (Ha-NeRF) framework that can hallucinate the realistic radiance field from unconstrained tourist images with variable appearances and occluders, as shown in Fig. 1 . For appearance hallucination, we propose a CNN-based appearance encoder and a view-consistent appearance loss to transfer consistent photometric appearance in different views. This design gives our method the flexibility to transfer the appearance of unlearned images. For anti-occlusion, we utilize an MLP to learn an image-dependent 2D visibility mask with an anti-occlusion loss that can automatically separate the static components with high accuracy during training. Experiments on several landmarks confirm the superior of the proposed method in terms of appearance hallucination and anti-occlusion. Our contributions can be summarized as follows: 1. The Ha-NeRF is proposed to recover the appearance hallucination radiance fields from a group of images with variable appearances and occluders. 2. An appearance hallucination module is developed to transfer the view-consistent appearance to novel views. 3. An anti-occlusion module is modeled imagedependently to perceive the ray visibility. Novel View Synthesis. Rendering photo-realistic images is at the heart of computer vision and has been the focus of decades of research. Traditionally, view synthesis could be considered as an image-based warping task combined with geometry structure [49] , such as implicit geometry from dense images [4, 10, 15, 24, 29] and explicit geometry [5, 11, 17, 18, 36] . Recent works have used a set of un-constrained photo collections to explicitly infer the light and reflectance of the objects in the scene [22, 48] . Others make use of semantic information to restore transient objects [39] . With the advancement of deep learning, many approaches have applied deep learning techniques to improve the performance of view synthesis. Researchers try to combine convolutional neural networks with scene geometry to predict depth or planar homography for novel view synthesis [7, 19, 26, 35, 56, 61] . Inspired by the layered depth images [47] , recent works exploit explicit scene representation (e.g., multi-plane images, multiple sphere images) and render novel views using alpha-compositing for novel view synthesis [3, 12, 32, 51, 55, 60] . More recently, researchers have focused on the challenging problem of learning implicit functions (e.g., encoded features, NeRF) to represent scenes for novel view synthesis [33, 44, 45, 58] . Neural Rendering. Neural rendering [53] is closely related and combines ideas from classical computer graphics and deep learning to create algorithms for synthesizing image and reconstruction geometry from real-world observations. Several works present different ways to inject learning components into the rendering pipeline, such as learned latent textures [54] , point clouds [1, 9] , occupancy fields [31] , signed distance functions [38] . Based on the image translation network, Meshry et al. [30] learned a neural re-rendering network conditioned on a learned latent appearance embedding module to recover point cloud for view synthesis. However, the utilization of an image translation network leads to temporal artifacts visible under camera motion. With the development of volume rendering [27, 33, 50] , it is easy to render realistic and consistent views. Mildenhall et al. [33] propose Neural Radiance Fields (NeRF) and use a multi-layer perceptron (MLP) to restore a radiance field. Many following works try to extend NeRF to the dynamic scene [6, 25, 40, 58] , fast training and rendering [8, 13, 43, 57] and scene edit [2, 28, 34, 59] . Martin-Brualla et al. [28] propose NeRF in the wild (NeRF-W) to optimize the appearance and tackle occlusion via static volume and dynamic volume respectively, but they failed in some scenes. Their dynamic volume is often used to describe the dramatic changes in appearance, such as viewdependent lighting. Besides, while NeRF-W implements a controllable appearance, it is hard to hallucinate consistent views at an appearance that has never been seen. Appearance Transfer. A given scene can take on dramatically diverse appearances in different weather conditions and at different times. Grag et al. [14] propose that the dimensionality of scene appearance in tourist images captured at the same position is relatively low, except for outliers like transient objects. One can recover appearance for a photo collection by estimating coherent albedos across the collection [22] , isolating sur-face albedo and scene illumination from the shape recovery [21] , retrieving the sun's location through timestamps and geolocation [16] , or assuming a fixed view [52] . However, these methods assume simple lighting models that do not apply to nighttime scene appearance. Radenovic et al. [42] restore distinct day and night reconstructions, but are unable to achieve a smooth gradation of appearance from day to night. Park et al. [37] propose an efficient technique to optimize the appearance of a collection of images depicting a common scene. Meshry [30] uses a data-driven implicit representation of appearance that is learned from the input image distribution, while Martin-Brualla et al. [28] extend the data-driven method to NeRF and optimize appearance latent code for each view for appearance controllable. In contrast, the proposed method tries to learn the appearance features that are decomposed from views, which means it could consistently hallucinate novel views at an unlearnt appearance. We first introduce Neural Radiance Fields (NeRF) [33] that Ha-NeRF extends. NeRF represents a scene using a continuous volumetric function F θ that is modeled as a multilayer perceptron (MLP). It takes a 3D location x = (x, y, z) and 2D viewing direction d = (α, β) as input and output an emitted color c = (r, g, b) and volume density σ as: where θ = (θ 1 , θ 2 ) are the MLP parameters, γ x (·) and γ d (·) are the positional encoding functions that are applied to each of the values in x and d respectively. To render the color of a ray passing through the scene, NeRF approximates the volume rendering integral using numerical quadrature. Let r(t) = o + td be the ray emitted from the camera center o through a given pixel on the image plane. The approximation of the colorĈ(r) of the pixel is: where c k and σ k are the color and density at point r(t k ), δ k = t k+1 − t k is the distance between two quadrature points. Stratified sampling is used to select quadrature points {t k } K k=1 between the near and far planes of the camera. Intuitively, alpha compositing with alpha values 1 − exp(−σ k δ k ) can be interpreted as the probability of a ray terminating at the location r(t k ), and function T k corresponds to the accumulated transmittance along the ray from the near plane to r(t k ). To optimize the MLP parameters, NeRF minimizes the sum of squared errors between an image collection and the corresponding rendered output. Each image I i is registered with its intrinsic and extrinsic camera parameters which can be estimated using structure-from-motion algorithms. NeRF precomputes the set of camera rays {r ij } at pixel j from image I i with each ray r ij (t) = o i + td ij passing through the 3D location o i with direction d ij . All parameters are optimized by minimizing the following loss: where C (r ij ) is the observed color of ray j in image I i . Given a photo collection of a scene with varying appearances and transient occluders, we aim to reconstruct the scene that can be hallucinated from a new shot while handling the occlusion. That's to say that we can modify the appearance of the whole 3D scene according to a new view captured at a different photometric condition. More specifically, taking a photo in the wild as input, we reconstruct an appearance-independent NeRF modulated by an appearance embedding encoded by a convolutional neural network in Sec. 4.1. To address the transient occluders in the photo, we propose an occlusion handling module to separate the static scene automatically in Sec. 4.2. Fig. 2 illustrates the overview of the proposed architecture. Next, we subsequently elaborate on each module. Figure 3 . Illustration of view-consistent loss. Given an example image Ii, we use a CNN to encode it into an appearance latent vector a i . We sample camera rays in another view to render the hallucinated image I i r together with a i . We encourage that the reconstructed appearance vector a r encoded from hallucinated image should be the same as a i , since it is a global representation across different views. To achieve the hallucination of a 3D scene according to a new shot from the input with varying appearances, the core problems are how to disentangle the scene geometry from appearances and how to transfer the new appearance to the reconstructed scene. NeRF-W [28] tries to use an optimized appearance embedding to explain the image-dependent appearances in the input. However, this embedding needs to be optimized during training, making it need to optimize the embeddings for hallucinating the scene from a new shot beyond the training samples and can not hallucinate an appearance from other datasets. Therefore, we propose to learn the disentangled appearance representations using a convolutional neural network based encoder E φ , of which parameters φ account for the varying lighting and photometric postprocessing in the input. E φ encodes each image I i into an appearance latent vector a i . The radiance c in Eq. 1 is extended to an appearance-dependent radiance c a i , which introduces a dependency on appearance latent vector a i to emitted color: The parameters φ of appearance encoder E φ are learned alongside parameters θ of radiance field F θ . This appearance encoder enables our method to have the flexibility to use the appearance of images beyond the training set. However, the problem that disentangles the appearance from viewing direction with unpaired images is inherently ill-posed and requires additional constraints. Inspired by recent works [20, 23, 62] that exploit latent regression loss to encourage invertible mapping between image space and latent space, we propose a view-consistent loss L v to achieve the disentanglement of appearance and view by taking an appearance vector (a) i from the the appearance encoder E φ and attempt to reconstruct it in different views, which is formulated as: where I r i is the rendered image whose view is randomly generated and appearance is conditioned on the image I i as shown in Fig. 3 . Here we assume that the reconstructed appearance vector E φ (I r i ) should be the same as the original appearance vector a i , since the appearance vector is a global representation across different views. Owing to the view-consistent loss, we can perform view-consistent appearance rendering, given the same appearance vector as input. In addition, we prevent encoding the image geometry content into the appearance vector with the help of viewconsistent loss, which encodes the render images from different views (also content) to the same vector when conditioning the volume on the same vector. To improve efficiency, we sample a grid of rays and combine them as the image I r i instead of rendering a whole image during training. [46] . This is based on the assumption that the global appearance vector of an image will remain unchanged after sampling using a random grid. Instead of using a 3D transient field to reconstruct the transient phenomena which is only observed in an individual image as in [28] , we eliminate the transient phenomena using an image-dependent 2D visibility map. This simplification makes our method has a more accurate segmentation between the static scene and transient objects. To model the map, we employ an implicit continuous function F ψ which maps a 2D pixel location p = (u, v) and an imagedependent transient embedding τ i to a visible possibility M: We train the visibility map, which indicates the visibility of rays originated from the static scene, to disentangle static and transient phenomena of the images in an unsupervised manner with an occlusion loss L o : The first term is the reconstruction error considering pixel visibility between the rendered and ground truth colors. Larger values of visible possibility M enhance the importance assigned to a pixel, under the assumption that it belongs to the static phenomena. The first term is balanced by the second, which corresponds to a regularizer with a multiplier λ o on invisible probability, and this discourages the model from turning a blind eye to static phenomena. To achieve Ha-NeRF, we combine the aforementioned constraints and jointly train the parameters (θ, φ, ψ) and the per-image transient embedding to optimize the full objective: Figure 4 . Qualitative results of experiments on constructed dataset. Ha-NeRF is able to encode the appearances and transfer them to novel views photo realistically (e.g., blue sky and sunshine in "Sacre Coeur", plants in "Brandenburg Gate", light reflection in "Trevi Fountain"). Besides, Ha-NeRF removes transient occlusions to render a consistent 3D scene geometry (e.g., square and pillars in "Brandenburg Gate"). Sacre Coeur Trevi Fountain Our implementation of NeRF and NeRF-W follows [41] . The static neural radiance field F θ consists of 8 fullyconnected layers with 256 channels followed by ReLU activations to generate σ and one additional 128 channels fully-connected layer with sigmoid activation to output the appearance-dependent RGB color c. The appearance encoder E φ consists of 5 convolution layers followed by an adaptive average pooling and a fully-connected layer to get the appearance vector To evaluate the performance of Ha-NeRF in the wild, We constructed three datasets called "Brandenburg Gate", "Sacre Coeur" and "Trevi Fountain" using the Phototourism dataset, which consists of internet photo collections of cultural landmarks. We downsample all the images by 2 times during training. Baselines. We evaluate our proposed method against NeRF, NeRF-W, and two ablations of Ha-NeRF: Ha-NeRF(A) and Ha-NeRF(T). Ha-NeRF(A) (appearance) builds upon our There are the images whose viewing direction is the same as the leftmost column content images, and the appearance is conditioned on the top line example appearance images. full model by eliminating the visibility network F φ , while Ha-NeRF(T) (transient) removes the appearance encoder E ψ from the full model. Ha-NeRF is the complete model of our method. Comparisons. We evaluate our method and baselines on the task of novel view synthesis. All methods use the same set of input views to train the parameters and embedding for each scene except NeRF-W, which uses the left half of each test image to optimize the appearance embedding for the test set since they can not hallucinate new appearance without optimizing during training. We present rendered images for visual inspection and report quantitative results based on PSNR, SSIM, LPIPS. Fig. 4 shows qualitative results for all models and baselines on a subset of scenes. NeRF suffers from ghosting artifacts and global color shifts. NeRF-W produces more accurate 3D reconstructions and is able to model varying photometric effects. However, it still suffers from blur artifacts like the fog effect around the peristele of "Brandenburg Gate". This fog effect is the consequence of NeRF-W's attempt to estimate a 3D transient field to reconstruct the transient phenomena, while the transient objects are only observed in a single image. At the same time, renderings from NeRF-W also tend to exhibit different appearances compared to the ground truth, such as the sunshine and the blue sky in"Sacre Coeur" and the light reflection in "Trevi Fountain". Ha-NeRF(A) has a more consistent appearance, such as the blue sky at the top of "Sacre Coeur". However, it is unable to reconstruct high-frequency details due to the occlusion. In contrast, Ha-NeRF(T) is able to reconstruct structures with occlusion such as the square of "Brandenburg Gate", but is unable to model varying photometric effects. Ha-NeRF has the benefits of both ablations and thereby produces better appearance and anti-occlusion renderings. Quantitative results are summarized in Table 1 . Optimizing NeRF on photo collections in the wild leads to particularly poor results that cannot compete with NeRF-W. In contrast, Ha-NeRF achieves competitive PSNR and SSIM compared to NeRF-W while outperforming the others on LPIPS across all datasets. Actually, this comparison is unfair to us. To transfer the appearance from test images, NeRF-W needs to optimize the appearance vectors on a subset of the test images during training. While Ha-NeRF does not use any test images during training. When testing, Ha-NeRF can directly encode the image appearance by a learned encoder. Despite this, our method still can produce competitive results compared with NeRF-W. Moreover, NeRF-W exhibits view inconsistency. As the camera moves, renderings conditioned on the same appearance embedding appear to have an inconsistent appearance, which can not be reflected by current metrics. And we put the results into the supplemental material for the consistency comparison of NeRF-W with Ha-NeRF. Figure 6 . Hallucination in the "Trevi Fountain" dataset with high-frequency information of appearance, such as sunshine and colored light reflection. There are the images whose viewing direction is the same as the leftmost column content images, and the appearance is conditioned on the top line example appearance images. NeRF-W NeRF-W w/ T Appearance 1 Appearance 2 Figure 7 . Images rendered from a fixed camera position with interpolated appearance between appearance 1 and appearance 2. Appearance Hallucination. By conditioning the color on the latent vector (a) i , we can modify the lighting and appearance of a rendering without altering the underlying 3D geometry. In the meantime, encoding appearance with an encoder E φ allows our framework to perform exampleguided appearance transfer. In Fig. 5 , we see rendered images produced by Ha-NeRF using different appearance vectors extracted from example images. We also show the results of NeRF-W where appearance vectors are optimized during training. Notice that Ha-NeRF hallucinates realistic images while NeRF-W suffers from global color shifts compared with the example images. Moreover, Fig. 6 shows that Ha-NeRF can capture the high-frequency information of appearance and hallucinate the sunshine and colored light reflection of the scene. Ha-NeRF can also interpolate the appearance vectors to get other hallucinations. In Fig. 7 , we present five images rendered from a fixed camera position, where we interpolate the appearance vectors encoded from the leftmost and rightmost images. Note that the appearance of the rendered images is smoothly transitioned between the two endpoints by Ha-NeRF. However, the interpolated results of NeRF-W completely ignore the sunset glow. Furthermore, we add the transient field of NeRF-W during its rendering (NeRF-W w/T), which shows the sunset glow. It reveals that NeRF-W could not disentangle the variable appearance (sunset glow) from transient phenomena (people) well. Cross-Appearance Hallucination. We can perform appearance transfer by a user-provided example image from a different dataset. As shown in Fig. 8 , we hallucinate new appearance for "Brandenburg Gate" condition on the example image of "Trevi Fountain". We can even transfer appearance from a radically different scene, as shown in Fig. 9 , where there is a large domain gap between appearance images and scenes. We note that NeRF-W inherently can not hallucinate an appearance from other datasets because NeRF-W needs to optimize the appearance vectors on the example images, which must depict the same place. Occlusion Handling. We eliminate the transient phenomena using an image-dependent 2D visibility map, while NeRF-W uses a 3D transient field to reconstruct the transient objects. As illustrated in Fig. 10 , our occlusion handling method generates an accurate segmentation between static scene and transient objects, which allows us to render occlusion-free images. However, NeRF-W inaccurately decomposes the scene (e.g., board, people, and fence still leave on the renderings of NeRF-W) and further entangles the variable appearance and transient occlusion in the 3D transient field (e.g., results in the transient volume to remember the white cloud of "Brandenburg Gate"). Limitations. Without exception, the proposed Ha-NeRF suffers from the noisy camera extrinsic parameters, similar to most NeRF based approaches. Additionally, the quality of synthesized images degrades while the input images are either motion-blurred or defocused. Specific techniques have to be developed to handle these issues. NeRF-W Transient NeRF-W Ha-NeRF Ha-NeRF Visibility Figure 10 . Anti-occlusion renderings of Ha-NeRF and NeRF-W. NeRF-W Transient is the renderings of the 3D transient field of NeRF-W, which tries to reconstruct the transient objects only observed in an individual image. We denote Ha-NeRF Visibility as our 2D visibility map that learned to disentangle static and transient phenomena of the images, indicating the visibility of rays originated from the static scene. NeRF has grown in prominence and has been utilized in various applications, including the recovery of NeRF from tourism images. While NeRF-W works effectively with a train-data optimized appearance embedding, it is hard to hallucinate novel views consistently at an unlearnt appearance. To overcome this challenging problem, we present the Ha-NeRF, which can hallucinate the realistic radiance field under variable appearances and complex occlusions. Specifically, we propose an appearance hallucination module to handle time-varying appearances and transfer them to novel views. Furthermore, we employ an anti-occlusion module to learn an image-dependent 2D visibility mask capable of accurately separating static subjects. Experimental results using synthetic data and tourism photo collections demonstrate that our method can render free-occlusion views and hallucination of the appearance. Codes and models will be publicly available to the research community to facilitate reproducible research. Neural point-based graphics Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields Immersive light field video with a layered mesh representation Unstructured lumigraph rendering Abdelaziz Djelouah, and George Drettakis. A bayesian approach for selective image-based rendering using superpixels Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo Extreme view synthesis Differentiable surface rendering via non-differentiable sampling Neural point cloud rendering via multi-plane projection Unstructured light fields Modeling and rendering architecture from photographs: A hybrid geometry-and image-based approach Deepview: View synthesis with learned gradient descent Fastnerf: High-fidelity neural rendering at 200fps The dimensionality of scene appearance The lumigraph Reasoning about photo collections using models of outdoor illumination Instant 3D photography Deep blending for free-viewpoint image-based rendering Multimodal unsupervised image-to-image translation Multi-view inverse rendering under arbitrary illumination and albedo Coherent intrinsic images from photo collections Diverse image-to-image translation via disentangled representations Light field rendering Neural scene flow fields for space-time view synthesis of dynamic scenes Geometry-aware deep network for single-image novel view synthesis Learning dynamic renderable volumes from images Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections Plenoptic modeling: An image-based rendering system Neural rerendering in the wild Implicit surface representations as layers in neural networks Local light field fusion: Practical view synthesis with prescriptive sampling guidelines Nerf: Representing scenes as neural radiance fields for view synthesis Giraffe: Representing scenes as compositional generative neural feature fields 3d ken burns effect from a single image A system for acquiring, processing, and rendering panoramic light field stills for virtual reality Efficient and robust color consistency for community photo collections Deepsdf: Learning continuous signed distance functions for shape representation Augmenting crowd-sourced 3d reconstructions using semantic detections D-nerf: Neural radiance fields for dynamic scenes Nerf pl: a pytorch-lightning implementation of nerf From dusk till dawn: Modeling in the dark Speeding up neural radiance fields with thousands of tiny mlps Free view synthesis Stable view synthesis Graf: Generative radiance fields for 3d-aware image synthesis Layered depth images The visual turing test for scene reconstruction Review of image-based rendering techniques Scene representation networks: Continuous 3d-structure-aware neural scene representations Pushing the boundaries of view extrapolation with multiplane images Factored time-lapse video State of the art on neural rendering Deferred neural rendering: Image synthesis using neural textures Single-view view synthesis with multiplane images Synsin: End-to-end view synthesis from a single image Plenoctrees for real-time rendering of neural radiance fields pixelnerf: Neural radiance fields from one or few images Editable free-viewpoint video using a layered neural representation Stereo magnification: learning view synthesis using multiplane images View synthesis by appearance flow Multimodal image-to-image translation by enforcing bi-cycle consistency