key: cord-0448534-rg0iw8gh
authors: Chen, Rongsen; Zhang, Fang-Lue; Finnie, Simon; Chalmers, Andrew; Rhee, Teahyun
title: Casual 6-DoF: free-viewpoint panorama using a handheld 360 camera
date: 2022-03-31
journal: nan
DOI: nan
sha: 9df94dec6ef7e547bed5859d32776b5a96756107
doc_id: 448534
cord_uid: rg0iw8gh

Six degrees-of-freedom (6-DoF) video provides telepresence by enabling users to move around in the captured scene with a wide field of regard. Compared to methods requiring sophisticated camera setups, the image-based rendering method based on photogrammetry can work with images captured with any poses, which is more suitable for casual users. However, existing image-based rendering methods are based on perspective images. When used to reconstruct 6-DoF views, it often requires capturing hundreds of images, making data capture a tedious and time-consuming process. In contrast to traditional perspective images, 360{deg} images capture the entire surrounding view in a single shot, thus, providing a faster capturing process for 6-DoF view reconstruction. This paper presents a novel method to provide 6-DoF experiences over a wide area using an unstructured collection of 360{deg} panoramas captured by a conventional 360{deg} camera. Our method consists of 360{deg} data capturing, novel depth estimation to produce a high-quality spherical depth panorama, and high-fidelity free-viewpoint generation. We compared our method against state-of-the-art methods, using data captured in various environments. Our method shows better visual quality and robustness in the tested scenes.

Recent advancements in Virtual and Mixed Reality (VR/MR) have led to a surge in popularity of 360°panoramic media. They are well suited for VR/MR due to their wide field-of-view (FoV), which provides complete rotational freedom. In its current form, however, 360°media has a lack of freedom for translational motion which can break the user's immersion [36] , [39] .

Recent research of 6 degrees-of-freedom (6-DoF) media has been able to generate motion parallax in accordance with user motion. However, current approaches, such as Facebook's manifold RED camera [31] , Google's welcome to light field [27] and layered meshes [6] , require sophisticated camera setups that are additionally reliant on professional capturing devices. The motion parallax generated by these methods is constrained to a small area, meaning they have limitations to provide free-viewpoint navigation over moderate to large distances.

Image Base Rendering (IBR) methods [37] , [38] based on photogrammetry [18] , [35] , on the other hand, only require an unstructured set of overlapping images that are more accessible for conventional (casual) users. These approaches can operate over larger areas due to their flexible camera setup and recovered geometric information. Inside-Out [15] synthesizes high-quality free-viewpoint navigation using unstructured sets of RGBD images to enable 6-DoF. The method was further refined in DeepBlending [14] , where the captured depth maps were replaced with a depth estimation. These methods are still, however, limited by their use of perspective cameras, requiring large quantities of images to accurately capture full environments. This process is time-consuming, complex, and unpleasant for the casual user. The spherical 360°camera, in contrast to the perspective camera, has a complete view of the environment in each image (a perspective camera would normally take 5-8 shots to cover the same area). Using the 360°camera for the capturing process would greatly simplify the process for IBR-based 6-DoF methods. The use of 360°cameras in view synthesis has been explored recently in Omniphoto [4] . However, their approach was limited to generating views within a structured circle, restricting the user's range of motion while eliminating all vertical motion. In this paper, we overcome these limitations with the use of casually captured panoramas from a single 360°camera.

Our method provides real-time 6-DoF viewing experiences using 360°panoramic images captured by a handheld 360°camera (Fig. 1) . Given an unstructured collection of 360°monocular panoramic images, our novel panoramic view synthesis method can synthesize panoramic images from novel viewpoints (a point in 3D space where no image was previously captured) in 30fps. Our method starts with an offline process to recover the orientation and position of each input panorama. We then recover the sparse and dense depth panoramas of the scene. Unlike previous methods [14] , [32] that use this information to generate dense 3D geometric models for rendering new viewpoints, we directly synthesize 360°R GB images using the recovered depth from the input panoramas. We present a novel depth interpolation and refinement scheme that ensures high visual quality and fast view synthesis.

We tested our method in various indoor and outdoor environments at medium to large scale scenes, with casually captured data using a consumer-grade 360°camera. We evaluated our method against current state-of-the-art approaches [14] , [15] , [25] in each of our environments over short and long ranges of motions.

Our contributions are summarized as follows:

• We present a novel platform with a complete pipeline to arXiv:2203.16756v1 [cs.CV] 31 Mar 2022 Fig. 1 : Our method provides 6-DoF experiences using 360°panoramic images captured by a handheld 360°camera. (a) The input 360°p anoramas captured by a conventional 360°camera, (b) the corresponding depth panoramas estimated by our method, (c) synthesized novel views with comparisons, and (d) their corresponding zoomed in images. enable 6-DoF viewing experiences using a set of panoramas captured by a handheld consumer-grade 360°camera.

• We present a robust approach to reconstruct the depth panoramas from a set of RGB 360°panoramic images. • We developed a novel method to synthesize novel panoramic viewpoints in real-time (30fps) to allow users to walk around within a large-scale captured scene.

6-DoF methods have been attracting much attention in recent years due to the need for motion parallax in VR applications. Thatte et al. [40] introduced stacked omnistereo, which uses two camera rigs stacked on top of each other in order to capture 6-DoF content. Welcome to Light Field [27] used a spinning camera setup for capturing high-density images and providing reliable depth estimation. Facebook Manifold RED [31] is a camera system that has been specifically designed to capture 6-DoF video. More recently, Broxton et al. [6] proposed a system that uses deeplearning to convert videos captured by a spherical camera rig into a layered mesh, which can be viewed in 6-DoF. These state-of-theart capture methods were restricted to professional cameras and sophisticated setups. The photogrammetry-based methods such as depth image-based rendering (DIBR) [22] , [26] , [50] only require a set of overlapping photos which are not necessarily from the same camera, thus being more suitable for casually captured videos.

Inside-Out [15] is a photogrammetry-based method that enables wide area, free-viewpoint rendering via tile selection. DeepBlending [14] uses a similar architecture to Inside-Out. One significant improvement of this method is that they use deep learning-based image blending to achieve higher visual fidelity. Recently, Xu et al. [44] further improved the work to have the ability to reconstruct reflection on a reflective surface. However, with the requirement of thousands of input images company with input depth from Kinect, thus, is not ideal for casual users such as tourists. Recent works such as NERF [25] , [47] , free-view synthesis [32] , and stable view synthesis [33] used deep learning to render high fidelity novel views. However, these methods are slow and require significant running time to synthesize each frame.

The methods discussed above were built for perspective cameras, meaning direct application to 360°panoramas would perform poorly. In this paper, we present a method for creating high fidelity 6-DoF scenes with 360°cameras.

Huang et al. [17] presented an approach that used geometry estimated via Structure-from-Motion (SfM) to guide the vertex warping of a 360°video. This technique simulates the feel of a 6-DoF experience. However, the experience is somewhat lacking due to its inability to produce motion parallax. Cho et al [9] extended this method, allowing it to use multiple 360°panoramas as input. However, it still presents the same limitation. The lack of motion parallax, in either case, can lead to VR discomfort.

One of the early 6-DoF panorama applications which provided motion parallax was proposed by Serrano et al. [36] . In their system, they used deep-learning to predict the depth map of a given 360°p anorama and render a mesh representation of the environment (inherently allowing for motion parallax). They designed a threelayered representation to handle occlusion. However, their method only took input from a single viewpoint, meaning the output quality and range of motion were limited. MatryODShka [1] adapted Multi-Plane Image (MPI) [24] , [49] to 360°panoramas, creating a layered representation of Omnidirectional Stereo (ODS) images with deep learning, however, this approach also has a limitation on its synthesis quality and range of available motion.

In this paper, we present a novel method of performing wide-area free-viewpoint navigation using casually captured 360°p anoramas.

Recovering depth information is important to enabling accurate 6-DoF images. Previous depth estimation methods for 360°images has attempted to recover the depth map via matched features [34] , this type of method suffers from feature detection errors often producing an incomplete depth map. Most recent works on 360°images designed to perform depth estimation with single [42] or pair of 360°images [43] , [51] . These works estimate depth without validating depth consistency across images. The inconsistency across depth images causes severe ghosting artifacts and is not suitable for view synthesis of large scenes where dozens to hundreds of images are often used for reconstruction. To reconstruct consistent depth across multiple images, the depth needs to be estimated with the spirit of Multi-View Stereo (MVS). Similar to other areas in computer vision, research has attempted to approach MVS using CNNs [45] , [46] . However, these methods often perform poorly on wildly captured images. Since their networks are trained by comparing the estimated depth with ground truth depth, it is limited to adapt the newly captured scenes when ground truth is not available. Neural radiance field (NeRF) [25] uses a Multi-Layer perceptron (MLP) to learn the presentation of the scene. Although this method requires training for every given new scene, the advantage is that it can train the network by directly comparing the estimated result with the capture RGB images, thus, more suitable for general scenes than methods trained with ground truth depth. However, we observed NeRF has difficulty in network converging when training with 360°panoramas. We demonstrate the limitation in Sect. 8. Since the learning-based methods were unable to provide acceptable depths from our survey and tests, we developed our depth estimation as in Sect. 5.

An overview of our method is shown in Figure 2 . It has an offline pre-processing phase and an online view synthesis phase. During the pre-processing phase, the captured 360°video sequence is first processed by registering camera parameters. Then the sparse and dense panoramic depths of each input panorama are estimated. In contrast to previous methods [14] , [15] , we estimated the dense depth map via a separate epipolor-based depth estimation instead of performing surface reconstruction on the sparse point cloud that builds using sparse depth maps. The estimated sparse and dense depths will then be refined to produce high-quality depth panoramas in several steps; sparse and dense depth fusion, depth correction via raymarching, and forward depth projection. The refinement process will reduce the noise in the estimated depths, as well as ensure depth consistency across input panoramas.

The online phase is for novel 360°view synthesis, which can be done in real-time to support interactive VR applications. Given a target position to synthesize a novel view panorama, we firstly project all input panoramas to the target position based on their estimated depths, generating an initial RGBD 360°panorama. We generate dense depth panoramas by interpolating the depth values of the corresponding pixels from the input panoramas and further enhancing them with the same depth correction algorithm as used in the pre-processing phase. Using the enhanced depth panorama, we are able to synthesize the RGB values of the novel view panorama by retrieving RGB pixels from the neighboring input panoramas and using weighted blending to combine them.

Our method takes a set of 360°panoramas as input, either discretely captured in different locations, or selected from a series frames in a video. Since blurring the image will harm the performance of camera registration methods (producing an inaccurate estimated camera pose), this will lead to ghosting artifacts in the view synthesis and degrade the quality of the estimated depth. When sampling frames from a video we use the variance of Laplacian Operator [2] to reject any blurred frames and substitute their closest sharp frames instead. Similar to other methods based on photogrammetry [14] , [15] , [32] , [33] , our method made no assumptions regarding the camera transforms (translation and rotation).

We recover the camera transforms corresponding to the selected frames using COLMAP (a SfM software package) [35] . We chose COLMAP over other software such as Meshroom [12] because we found that COLMAP produces more robust registration result, as well as to ensure a fair comparison with the alternative methods [14] , [25] (which also used COLMAP).

However, COLMAP is designed for perspective images. In order to use it on 360°panoramas, we first project them into a set of perspective sub-images with overlapping FoV. Every panorama is projected into 8 perspective images, each with a 110°horizontal FoV, by ignoring the top and bottom poles of the 360°panorama. The overview of our depth estimation pipeline. The process starts by generating depth panoramas from RGB inputs using the patch-based multi-view stereo and epipolar geometry. We then apply our depth refinement process on the fused depth panorama. Depth Fusion is consisting of depth panorama fusion in back-to-front fashion, and morphological operations.

We sample perspective images in this way because we find it provides better input for COLMAP than cubemap projection in terms of feature quality and quantity. Since the 8 perspective images were sampled from the same 360°panorama, they shared the same center point, using this we can recover the position of the 360°c amera.

Recent methods for view synthesis [14] , [33] follow a process of first estimating sparse depths from RGB inputs, and then densifying them via surface reconstruction [8] , [19] to recover dense depth information for a given scene. However, since casually captured image-sets from non-professional users often lead to imperfect reconstruction, the reconstructed sparse depth will contain large regions of missing depths such as the complex outdoor scene as shown in Fig. 3 (sparse depth panorama). It causes surface reconstruction errors, and thus, poorer results for view synthesis. To address this issue, we propose to use a two-stage approach to estimate the sparse depth and dense depth separately, using the patch-match depth estimation method and epipolar depth estimation method respectively. We further propose a depth refinement process to improve the quality of the estimated depth map. This refined depth will be later used for synthesizing novel views.

Sparse depth estimation: Our first step is to estimate a sparse depth panorama for each of the input RGB panoramas using the patch-match-based depth estimation method. We achieve this via COLMAP as in other similar methods [14] , [33] . This approach produces good results for both small and thin objects, but it has limitations on texture-less regions, resulting in a depth panorama that contains a lot of missing information (referred to as a sparse depth panorama). Directly performing view synthesis with such depth would result in scenes with missing geometry. Thus, a denser depth information is required for a complete view synthesis result. Dense depth estimation: Previous methods [14] obtain the dense presentation of the geometry via surface reconstruction using the sparse depth. However, the sparse depth estimated from casually image set may contains large gaps that is difficult for surface reconstruction to handle. Therefore, in contrast to previous methods, we propose to separately estimate the dense depth panorama using epipolar geometry [48] . Although this type of method often fails to detect thin geometries, it is able to recover more complete depth information of texture-less region than patch-base methods.

We implement the dense depth estimation as follows. Given a 360°panorama, we first generate sweep volumes directly using N closest 360°panoramas (we use N = 4 in our experiment), given that it has a full FoV that could help to effectively avoid the outof-FoV issue in narrow FoV videos. We then compare the sweep volume with the given 360°panorama to generate cost volume. We compute the cost volume using the classical ad-census [23] method, because it is able to demonstrate good result on texture-less region. Similar to previous methods [16] , [30] we adapt guided image filters to filter the matching cost. Guided image filters smooth the filtered cost volume with regards to the edge of the guiding image, which helps sharpen the edge of the resulting depth images. We then follow the classical winner-takes-all depth selection to produce the dense depth map.

Our dense depth panorama is estimated on a per-view basis, meaning that the estimated depth information is based on the parallax present in neighboring panoramas. Each panorama has different neighbors, meaning the recovered depth panoramas may not be consistent, as presented by the top row of Fig. 4 . This will cause visual artifacts during view synthesis. We solve the depth inconsistency, with an iterative depth refinement process. Our depth refinement process consist of three steps, depth fusion, depth correction via raymarching, and forward depth projection.

Depth Fusion: Our depth fusion is a straight forward process that combines the dense depth map and sparse depth map in back to front fashion. During depth fusion we using the dense depth map as a hole filler to filling up the space where sparse depth estimation fails to reconstruct. Then we apply an morphological operation to the combined depth map to reduce the floating noise comes with the sparse depth.

During the first iteration, we perform depth fusion using the estimated sparse and dense depth panorama. Similar to Hedman et al. [14] , our depth fusing allows us to adopt advantages of each depth type while avoiding their limitations. Nonetheless, the fused depth can still contain issues such as depth inconsistency that comes from the depth map estimated using epipolar approach. Thus, it needs further refinement to be used in view synthesis.

Depth Correction via Raymarching: The depth consistency is optimized by finding the optimal depth value among neighboring panoramas. Floating geometry artifact often occurs when the estimated depth map has outliers that are too close to the camera. For instance, in Fig. 4 top row, the area around pavilion has estimated depth value much smaller than its neighboring area. We first attempt to remove those outlier points by checking weather an depth value is the highest depth that are agreed upon all neighboring depth panoramas in a given range. We check this consistency via a raymarching process, which is similar to the one used in Koniaris et al. [20] as shown in Algorithm 1. The increase rate r and the nearest K rm input panoramas need to be set manually for different scenes in order to obtain optimal results. In our experiments, we set r and K rm to 0.005 and 4 respectively by default, and found they worked well in all our scenes. end for 14: end for 15: return D cor Although our raymarching process able to removes the outlier depths that were too close to the camera, it often erased part of object. This is because, due to the limitation of epipolar based depth estimation, not all neighboring views would be correctly estimated for thin objects, if the sparse depth map fails to address, the raymarching process would end up removing the depth information of that object. An example has been shown in the second row of Fig. 4 , where the pillar of the pavilion has been partially erased. In such cases, the refined depth panorama will lose the depth information related to the occluding object, causing artifacts where geometric details are missing. We overcome this issue via our following forward depth projection process. Forward Depth Projection: Our depth correction refines the depth map by checking the consistency of depth values across neighboring panoramas within a given range. Since each panorama has different neighbors, the result of the depth correction will be slightly different. Although an occluded region in one image might be partially erased during the first depth refinement process, such information may still exist in one of its neighboring panoramas. Our forward depth projection is able to make use of this viewdependent depth information, by merging a given depth and its neighboring panoramas to recover lost occlusion information. Our depth projection is shown in Algorithm 2. Where, the neighborhood range K f p is scene dependant. Then, we set the value K f p = K rm + 2, in our experiments. The slightly wider search range obtains information from the neighboring panoramas that are less likely affected by the same problematic depth panorama during raymarching. For each iteration of depth refinement, we increase the value of K rm and K f p slightly (we use 2 in our experiment). The slightly increased checking range helps to improve the consistency of the depth over more panoramas, eventually K rm and K f p will be equals to the number of input panoramas. However, in our experiments, we found that around 3 iterations was sufficient for most of our test scenes. We illustrate each step of our depth refinement in Fig. 4 , while showing the result of each refinement step. We also illustrated a few synthesized views using the depth maps from each depth refinement step to demonstrate the quality improvement. 

Our next step is to synthesize novel view panoramas using a set of input RGB panoramas and their corresponding depths and transformations obtained by our pre-processing phase. Previous methods [14] , [15] relying on the mesh reconstruction for each input image, leading to accumulated geometric errors when blending each mesh for novel view synthesis, thus limit the quality. To avoid this, we approach our view synthesis via depth based image warping with real-time depth correction. During view synthesis, we first generate the depth panorama for the synthesized view with depth correction from the input panoramas. Then, use this depth panorama to extract RGB pixel values from inputs, and weighted blend them to synthesis the pixel colors of the novel view panorama.

Image to 3D coordinates: We first aligned orientations of input panoramas to ensure they facing the same direction. Since we use the equirectangular representation of 360°panoramas as in the prior work [40] , we first convert the pixels (i, j, d) of the input equirectangular RGBD images into spherical polar coordinates (θ , φ , d), where θ ∈ [−π, π] and φ ∈ [ −π 2 , π 2 ] and d is the corresponding depth value from input depth panorama D k . We then project all the converted spherical coordinates of each input panorama into local 3D Cartesian coordinates v by Equation 1 .

Forming novel panorama: The next step is to reproject v into the center of target novel view panorama in its local coordinates. If we define the center of target novel view panorama as t new , then the transformation between any other input panorama which centered at t k can be calculated using t = t new − t k . Then all 3D points v of input panoramas are reprojected into the target position t new by ( v − t). Then, the reprojected points of each input can subsequently be converted into spherical polar coordinates (θ rep , φ rep , d rep ) of t new using Equation 2.

This spherical polar coordinates then converted into pixels of the equirectangular representation of the novel view panorama.

We use Equation 1 and 2 to extract the necessary data from the input panoramas, and synthesize the novel views in two steps: 1) Backwards warping with depth correction, and 2) image blending, described as follows.

Direct projection of the input panoramas into a novel view tends to produce many missing pixels. Simple color interpolation such as using linear or cubic function result in a blending of foreground and background.

To overcome this, we first synthesize a depth panorama at the target novel view, interpolate the depth values to avoid any missing values. In our approach we adapted a morphological closing [21] similar to the one used by thatte et al. [40] in order to achieve the desired result. We use the synthesized depth panorama to extract RGB pixel values from each input panorama using Equation 1 and 2, to synthesis the novel view panorama.

However, there may be disagreement between the input depth panoramas for what the correct depth value is at the target position. A depth value that refers to the correct pixel in one input panorama may point to a different location in another input panorama, leading to visual artifacts in the synthesized panorama. To correct this, we apply Algorithm 1 to the synthesized depth panorama, improving the overall visual consistency. We perform backwards warping with the corrected depth panorama in order to obtain the RGB pixels from the input panoramas with reduced visual artifacts.

To synthesis color values of our novel view panorama, we use the synthesized depth panorama to extract the RGB pixel values from the corresponding input panoramas. We extract the pixel values from the closest K neighboring panoramas, where K = 4. If a suitable pixel cannot be found from closest neighbors, progress further to find the corresponding pixels across other input panoramas.

After extracting corresponding pixel values, their RGB values are blended by our weighting formula. Our weighting formula is based on three components: the correctness of the estimated depth w d , the distance between the position of target view and input panorama w cam , and the angle between the viewpoints and the relevant point in the scene w ang . Then, the final blending weight W for each pixel is computed by:

The w d is computed by the difference between the estimated depth and the actual depth value as: where d rep is the estimated depth by reprojection from the position of the novel view depth panorama to the position of input depth panorama, and d is a depth value at the input depth panorama. If the difference between d and d rep is large, it means an occlusion occurs, and the pixel from this input panorama should contribute less towards our output. The camera weight w cam is used to measure the distance between the center of the novel view panorama and the center of the input panorama defined as:

where t is described in Sect. 6.1. The constant s is the scale weight, and we set it as 10 in our experiment. Our angle weight w ang is to reproduce view dependent features in the synthesized image by favoring pixels with similar view angles as considered in [4] , [7] . The w ang is calculated by the angle between the vector t and vector v , defined as:

Finally, each pixel value of the novel panorama I n (p) is synthesised by weighted blending of corresponding pixel values from K neighboring input panoramas I k as:

where q k represents the corresponding pixel in the k-th input panorama. We illustrate the effect of different weighting formulation in Fig. 7 . Our weighted blending successfully synthesis pixel values of the novel view panorama while reducing blending errors such as ghosting artifacts.

We used the Nvidia CUDA library for GPU-based computations such as view synthesis. We rendered our novel view panorama to a VR HMD using the Oculus SDK and OpenGL libraries. The current implementation is able to work at 30fps.

We evaluated our method with our own captured datasets for the following two reasons. First, there are not any available benchmark datasets used for evaluating free-viewpoint synthesis methods that consist of 360°panoramas. Secondly, one of the key features of our method is its ability to function on casually captured realworld 360°panoramas, meaning we needed a dataset that meets this criteria. We captured 360°videos of different environments using the Ricoh Theta V. Most of our scenes were captured with a handheld camera. We ensured that our dataset was captured under a variety of environments so that we could evaluate the scalability and robustness of our method. For detail of the captures scene please refer to the appendix.

We sample every 10th frame from the original video for our dataset while avoiding few frames with motion blur caused by casual capturing setup, and this produces around 50cm intervals across input panoramas.

We compared our results quantitatively and qualitatively with three closely related state-of-the-art methods: Unstructured Lumigraph Rendering (ULR) [7] , Inside-Out [15] , and DeepBlending [14] . They are implemented in the system provided by Bonopera et al. [5] . We also performed a qualitative comparison with the recent Neural Radiance Field (NeRF) [25] and OmniPhotos [4] . Since NeRF's default projection method only has front facing and outside in camera poses, we implemented a new projection method for 360°cameras, so we can directly use 360°panorama as input for NeRF. We trained NeRF with 250k iterations on each of the tested scenes. Since the prior works [7] , [14] , [15] do not support 360°p anoramas directly, the inputs of the above methods are the same as what we used in COLMAP (Sect. 4), which are 8 perspective images per 360°panorama.

Fig . 6 shows motion parallax of the synthesis view in various directions. Fig. 12 shows examples of synthesised panoramas, which provide novel panoramic views away from the captured path.

Comparisons of the view synthesis of the indoor scenes are in Fig. 8 . We noticed that learning-based methods such as DeepBlending efficiently reduce visual artifacts caused by the geometry estimation error, but it often blurs the scene. Both our method and Inside-Out preserve the sharp appearance of the scene, but, compared with Inside-Out, our method shows fewer visual artifacts on boundary regions.

Comparisons of view synthesis of the outdoor scenes are in Fig. 9 , where our method outperforms the prior works. The geometry reconstruction of a large outdoor scene is very challenging. DeepBlending and Inside-Out are highly dependant on the mesh quality by the surface reconstruction. The sparse depths produced by COLMAP tend to contain large missing regions, and therefore may cause surface reconstruction errors. Furthermore, their depth refinement does not consider depth consistency across scenes, and therefore often produces different errors across input panoramas. When blending multiple inputs for novel view synthesis, errors in different scenes are accumulated, producing visual artifacts. Our depth estimation and refinement method generates reliable and consistent depth information for synthesizing panoramas in different viewpoints. Our method recovers detailed geometry with reduced errors and improves results by fusing sparse depths. However, one side effect is that it may remove some of the finer details, causing ghosting artifacts in some cases. Even with this side effect, our method consistently produces higher quality results than prior methods in all our test scenes.

We have tested our method with scene data used in Omniphoto [4] , and the results are shown in Fig. 11 . Similar to MegaParallax [3] Omniphoto relies on optical flow to guide image warping. Their estimated optical flow describes the movement of pixels from one image to another, and thus, is only valid within the capture circle. When a user moves outside of the capture circle, optical flow will no longer be able to provide adequate guidance for image warping, and the method has to rely on the estimated geometry proxy. However, their geometry proxy only provides rough geometry information. Using it to perform image warping produces significant visual artifacts, such as twisted objects. Compared to Omniphoto, our method and DeepBlending both can synthesize reasonable results outside the captured regions. In our test, our method shows better visual quality than deep-blending.

In our test, NeRF struggles to generate novel views using our dataset. We hypothesize that our test scene attempts to cover a very large area with 360°panorama images, the scale of the scene exceeds the capability of NeRF's MLP network, which leads to poor results. We further compared our method with NeRF, and the result is shown in Fig. 10 .

Previous methods [14] , [15] were designed with dense inputs, and therefore we have tested with twice as short intervals. However, their results were not improved in our experiment. We hypothesize that the distortion of 360°panorama produced by the current 360°camera model is bad for convention reconstruction methods, increasing the sample frames introduce more noise features, which do not help improve the quality of the reconstructed mesh. Similar to previous methods, the number of images also impacts the quality of our method. We demonstrate how our method performs the other methods with the different number of input images in Table 1 . The increase in the number of input images would improve the result, but would also require more time for it to be processed. Nonetheless as shown in the table, for our method an input of 20 images from a 14 second capture video sequence is sufficient for an acceptable reconstruction result.

We quantitatively evaluated our method by comparing it with three closely-related methods [7] , [14] , [15] . The evaluation method we use follows the approach used by Waechter et al. [41] . The idea is to render novel views images at the same viewpoint as each of the input images, then use the input images as the ground truth to compare with in terms of completeness and visual quality [13] . We tested each method on six scenes with three indoor (PathWay, Hallway, Office) and three outdoor scenes (Bridge, Botanic Garden, Car Park). We did not evaluate the top/bottom view since viewers tend to fixate on the equator when exploring the panorama [4] . We measure the visual quality of the synthesized view at each location quantitatively by comparing the synthesized view with the ground truth image in several metrics such as Multiscale Structural Similarity Index (MS-SSIM), peak signalto-noise ratio (PSNR), and the perceptual similarity (LPIPS). The average scores of our tests are in Table 2 . As shown in the table, our method outperforms others in all metrics in both indoor and outdoor scenes.

We also performed a set of ablation studies and show the result in Table 2 . It shows that synthesis results with only sparse depth estimation are the worst, only dense depth estimation improves the results, and a combination of the two shows a better result than each method. Also, the final depth refinement step shows a clear improvement.

Our results were tested on a desktop PC with an Intel Xeon W-2133 CPU at 3.60GHz, 16GB of system RAM, and an Nvidia RTX 2080Ti GPU. The performance of novel panorama synthesis varies with the number of input panoramas. Throughout the experiment, we have used a set of the closest four input panoramas to synthesize a novel panorama. The data used for the view synthesis method is a set of 4K (4094 × 2048) equirectangular images and their corresponding 2K (2048 × 1024) depth panoramas. We are able to synthesize the panoramas at 30fps while rendering to the Oculus Rift headset. This frame rate increases to around 90fps when synthesizing a novel panorama at a resolution of 2048 × 1024.

The time it takes for our captured panorama collection to process depends on the number of images, as shown in Table 1 . We report the time and the amount of storage it requires to process a set of 20 images using our method and prior methods in Table 3 . The storage space needed for the final, processed data-set, of 20 source views, is about 47MB. Depth estimation requires a significant space for COLMAP to store their processing data, for instance, the 20 inputs scene required about 3GB of storage. However, once the sparse geometry reconstruction is completed, only the camera information and reconstructed sparse point cloud (sparse depths) are kept, and the rest of data can be discarded.

Our method shows promising results in our tests, outperforming the prior works. However, producing high fidelity 6-DoF video from a single handheld 360°camera is challenging, with a lot of room to improve.

Depth Estimation with Learning-based methods: Our depth estimation could be improved for both better depth accuracy and faster processing. The recent NeRF [25] could provide a clue to improve this. Although, we observed that naive NeRF has converging issues when we train it with 360°images, which we believe the issue is related to the MLP network. The main idea of NeRF is to use a learning-based method to learn the representation of scene. The scene information is still described using similar techniques from the classical MVS, such as ray sampling. When performing ray sampling the images will be converted to 3D points, thus, the format of the image does not effect the result of this process. Furthermore, as 360°images have advantages of FoV compared to normal perspective images, the use of 360°images should improve the result. Thus, replacing the MLP network in NeRF with a more effective learning structure, and use the improved NeRF for depth estimation, could be a good next step to improve our pipeline.

Reflective and Transparent Surface: Our depth estimation has limitations for reflective and transparent surfaces such as glass or water. An example of these limitations are shown in Fig. 12 (Hill scene), where our method contains ghosting artifacts on the glass window. The Multi-Plane Image (MPI) representation has demonstrated its ability to reproduce the behaviour of reflective and transparent surfaces [6] , [11] , [49] . Adapting similar methods into our pipeline to separate the transparent layers would be an interesting next step.

Dynamic Objects: Similar to previous work [10] , [14] , [15] , [25] , our method has limitations with dynamic objects. Given that the input of our method is a capture from a single camera, moving objects will shown in different positions in each image, making it difficult to estimate consistent depth for the object. Furthermore, the different positions also cause inconsistent projection for the object (same object projected into a different place). This can be solve by estimated the flow of the dynamic object over time [28] , [29] , and use flow information to guide image synthesis.

Stitching artifact: 360°panoramas are produced by multi-image stitching, which with current methods still contains small distortions or gaps around the image boundaries, causing inconsistencies of object shape in these areas. We show an example of such artifacts in Fig. 13 . This causes ghosting artifact in the view synthesis when performing image blending. Reducing stitching artifacts in the captured 360°panoramas could reduce such artifacts.

Pre-processing: Our method relies on off-line pre-processing including depth estimation and refinement steps. It is acceptable for a static scene, but limited when adapting dynamic scenes.

User-study: Overall user studies are highly restricted due to COVID-19 at the time of study; nationwide lockdown restricting any mass gathering. Whenever available, we plan to perform user study to measure presence, immersion, VR sickness, and the overall usability of 6-DoF experiences produced by our methods.

We present a novel and complete pipeline to synthesize a novel view panorama using a set of 360°images captured by a handheld 360°camera. The result provides 6-DoF experiences, a sensation to walk around a captured large scale scene with support of full panoramic view in any location. We have tested our method in Fig. 13 : Examples of the stitching artifact. The red circle shows where the image stitching artefact appears. various test scenes captured by a single handheld 360°camera including indoor and outdoor scenes, and mid to large scale scenes. We also compared our results with current state-of-the art methods, showing better visual quality and robustness. Our method consistently produces high fidelity results across all test scenes, while previous method sometimes failed. We also outline the current limitations and potential future work that could lead to improvements. We believe our method, tested on a consumer graded 360°camera, will be easily adapted to various applications that requires 6-DoF experiences of a captured large scale scenes, particularly benefiting casual users.

Matryodshka: Real-time 6dof video view synthesis using multi-sphere images

Blur image detection using laplacian operator and open-cv

Megaparallax: Casual 360 panoramas with motion parallax

Omniphotos: casual 360°vr photography

sibr: A system for image based rendering

Immersive light field video with a layered mesh representation

Unstructured lumigraph rendering

Delaunay triangulation based surface reconstruction

Novel view synthesis with multiple 360 images for large-scale 6-dof virtual reality system

Extreme view synthesis

Deepview: View synthesis with learned gradient descent

Alicevision Meshroom: An open-source 3D reconstruction pipeline

Casual 3d photography

Brostow. Deep blending for free-viewpoint image-based rendering

Scalable inside-out image-based rendering

Fast costvolume filtering for visual correspondence and beyond

6-dof vr videos with a single 360-camera

Multi-view reconstruction preserving weaklysupported surfaces

Poisson surface reconstruction

Compressed animated light fields with real-time view-dependent reconstruction

Soft morphological filters

Correspondence and depth-image based rendering a hybrid approach for free-viewpoint video

On building an accurate stereo matching system on graphics hardware

Local light field fusion: Practical view synthesis with prescriptive sampling guidelines

Nerf: Representing scenes as neural radiance fields for view synthesis

Depth image-based rendering with advanced texture synthesis for 3-d video

A system for acquiring, processing, and rendering panoramic light field stills for virtual reality

Nerfies: Deformable neural radiance fields. ICCV

Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields

Soft 3d reconstruction for view synthesis

An integrated 6dof video camera and system design

Free view synthesis

Stable view synthesis

Efficient hundreds-baseline stereo by counting interest points for moving omni-directional multi-camera system

Structure-from-motion revisited

Motion parallax for 360 rgbd video

Review of image-based rendering techniques

Image-based rendering

Towards perceptual evaluation of six degrees of freedom virtual reality rendering from stacked omnistereo representation

Stacked omnistereo for virtual reality with six degrees of freedom

Virtual rephotography: Novel view prediction error for 3d reconstruction

Bifuse: Monocular 360 depth estimation via bi-projection fusion

360sd-net: 360°stereo depth estimation with learnable cost volume

Scalable image-based indoor scene rendering with reflections

Mvsnet: Depth inference for unstructured multi-view stereo

Recurrent mvsnet for high-resolution multi-view stereo depth inference

Nerf++: Analyzing and improving neural radiance fields

Determining the epipolar geometry and its uncertainty: A review

Stereo magnification: Learning view synthesis using multiplane images

Free-viewpoint depth image based rendering

Spherical view synthesis for self-supervised 360 depth estimation