key: cord-0058709-b5ewkn9v authors: Hiremath, Sandesh Athni title: PDE Based Dense Depth Estimation for Stereo Fisheye Image Pair and Uncertainty Quantification date: 2020-08-20 journal: Computational Science and Its Applications - ICCSA 2020 DOI: 10.1007/978-3-030-58808-3_47 sha: 4b8f9b5d74f07a1edb9db069723209af16f71ff8 doc_id: 58709 cord_uid: b5ewkn9v Dense estimation of depth from stereo pinhole image pair is a well known problem and seems to be easily solvable by both classical as well as machine learning algorithms. However, dense depth estimation from stereo fisheye image pair is apparently a more challenging problem. There are two main factors adding to the complexity of the problem: (I) owing to the wider field of view, the fisheye lenses are inherently nonlinear thus making the estimation problem harder. (II) because we want to estimate depth for every pixel coordinate from just two images, i.e. a calibrated stereo pair, the data is statistically insufficient thus greatly complexifying the problem. To alleviate (II) many depth estimation algorithms enlarge their parameter space by incorporating pose parameters and work on image sequence (esp. videos). Here we stick to the stereo camera setting and provide a novel estimation technique and quantify its uncertainty. We use the technique of variational calculus to derive a (noisy) pde for which wellposedness is proved and then solved in the least square sense using proximal algorithms. This setting enables us to invoke the standard machinery of nonlinear least squares for generating the covariance estimates. Most state of the art algorithms that are based on deep learning (nonlinear regression) technique fail to say anything about the uncertainty of the outputs which is of paramount importance in safety critical applications for e.g. arising in automotive domain. Lastly, our methodology has the advantage of being massively parallelizable for hardware implementation thereby offering lesser runtime. Tremendous research in image processing algorithms has given rise to a wide variety of applications across all domains ranging from photography, entertainment, surveillance, medicine, robotics, etc. to name a few. Similarly, even the automotive industry has seen a surge of vision based applications especially for semi or fully automated driving. The range of applications include lane detection, traffic sign recognition, object/obstacle detection, parking assistance and so on. Consequently, cameras are becoming one of the mandatory sensor for an automobile. Due to limitation on the number of installed cameras and the need for incorporating more number of safety features, the choice for a wider angle cameras become quite obvious [7, 8] . Owing to this necessity fisheye cameras, whose FOV is larger than 180 • , are becoming popular for automotive applications. As a result, many of the computer vision algorithms are being adapted for wide-angle lenses [1] [2] [3] [4] 9, 10, 22] . Stereo image correspondence is a well known classical computer vision problem which involves finding the location of the objects of the first image in the second image. For a pinhole cameras this problem simplifies to finding the shift in pixels i.e. disparity between the images. The disparity is then used to calculate the depth of the observed objects, thereby enabling a dense 3D reconstruction. Many well know algorithms, both classical as well as deep learning based approaches, exists for this problem [5, 6, 14, 17, 23] and some of them serve as a standard in industry. However, the same correspondence problem for fisheye images is still a open problem. The main challenge when dealing with fisheye image is the distortion introduced by the lens (see [1] ). This distortion is modeled by a nonlinear camera projection model. A simple workaround is to use the camera projection model to remove the distortion and produce a rectified stereo image pair for which pinhole stereo algorithms can be applied. This approach was followed in [20] which heavily relied of the stable disparity outputs of the standard stereo matching algorithms such as [5, 6] . However, there is a growing demand for algorithms that directly work on raw distorted images mainly due to growth in automotive, robotics and drone applications. In [16] , the authors used conventional stereo matching algorithms using directly a nonlinear fisheye projection model and in [13] PDE based technique was used to find image correspondences. Analogous to [14] , in [19] the authors have used variational methods to directly match fisheye images obtained at different poses by locally linearizing the error term. In contrast to these classical algorithms, authors in [15] adapted deep learning techniques, where they modified/adapted the existing network for pinhole models and trained it on fisheye images. One of the main aspect lacking in all these methods is that they fail to quantify the uncertainty of their estimates and (one of the) goal of this paper is to fill this gap by providing a probabilistic modeling of the problem and to quantify the uncertainty of the estimates. The paper is organized in the following manner, first we formulate the problem and propose a model in Sect. 2, in Sect. 3 we provide a probabilistic interpretation of the model and propose a method for estimating the variance, in Sect. 4 we propose a numerical method for solving the model and provide results. Finally in Sect. 5 we discuss the reults and make some concluding remarks. The problem we want to solve is the following: given a pair of images I L and I R obtained from two calibrated fisheye cameras forming a stereo setup, estimate the depth densely i.e. for every pixel of the image grid. Mathematically, this can be written in the following precise way-given two calibrated fisheye images I L , I R : D → R, with compact domain D ⊂ R 2 and range R ⊂ R 3 , find the mapping λ : D → R + that indicates the depth of objects represented in the image I L (or I R ) 1 . To solve this problem we need to appeal to the underlying epipolar constraints. According to that, if the baseline of the stereo camera is sufficiently wide, images I L and I R will have significant overlap and could be viewed as shifted versions of each other. However, for fisheye cameras, unlike in pinhole cameras, the shift is along a curve 2 instead of along a line. The curve is obtained by projecting the curve on the right camera (resp. left camera) surface, traced by a light ray incident on the left camera (resp. right cam), onto the right image (resp. left image) plane. For fisheye and other omnidirectional lenses, images are distorted more on the edges as compared to the middle. Based on the coordinate position of the left image (resp. right image) the curves traced on I R is of varied degree of curvature, this again adds to the complexity. Different models for fisheye cameras have been proposed [1, 11, 21] which have varied degree of accuracy and simplicity. For our work, we use the enhanced unified camera model (EUCM) [12] mainly because of its simplicity and reasonable accuracy. In Figs. 1a-c we illustrate the validity of the EUCM projection model. Firstly, for a given pair of fisheye stereo images of a scene, a set of prominent matching features are selected. The ray passing through a feature point of left image would trace a curve (as a function of depth) on the right image. For the correct calibration parameters, the curve passes through the exact same feature point corresponding to the left image. The point where the curve hits the matching feature point represents the true depth value. The feature points are marked in black dots in left image and the curves are marked in cyan on the right image. The curves pass through the feature points (nearly) exactly. Thus the correctness of projection model and the calibration of the involved parameters is validated. As per the problem formulation Sect. 2.1, the aim is to estimate a positive valued scalar function λ. To this end, we adopt the framework of 'calculus of variations'. Before we proceed, let us make some basic assumptions on the functions and their spaces. We assume that I L , I R are the elements of the Hilbert space L 2 (D) and we expect λ also be an element of L 2 (D). In order to invoke the tools variational calculus, we first need to define a loss functional whose minimizing element represents the solution to the problem reasonably well. is the magnitude of the normal vector to the tangent plane, at the position (u, v) on the surface generated by the 3D point Z(λ(u, v)). The data fidelity term is based on the brightness constancy assumption and is required to minimize the error in reconstruction of the first image after sampling from the second image. The regularizer term is required to ensure structural consistency as well as to ensure solvability of the minimization problem and the candidacy the solution in the appropriate function space. The intuitive meaning of the term used here is that (u, v) → s(u, v) represents the infinitesimal area of the 3D surface generated by the mapping Here Z represents the 3d world point which can also be seen as a function of the depth map (u, v) → λ(u, v). More precisely, we also have that λ → Z(λ) := λ X. Note that, since X(u, v) is obtained after unprojecting the image coordinates (using the projection model), it is of unit norm and remains constant for a given set of camera intrinsics. Consequently, Z varies as per the depth map λ(u, v). Based on this, the area of the surface spanned by Z can be computed in the following way: where, Z u and Z v are the partial derivatives of Z w.r.t u and v respectively. Altogether, we have that are constant vector functions given as From the above formula it is clear that the regularizer term s 2 in (1) is directly inducing a smoothness constraint on the depth map. To estimate the depth map for either one of the stereo fisheye image pair (I L , I R ), we need to solve the following variational problem: A necessary condition forλ to be the minimizer of E[λ] is that it must satisfy the corresponding Euler-Lagrange equation which is given as: The partial derivatives of s 2 w.r.t λ are computed in the following way: let then Next we make an important simplification, namely we want to convexify the regularizer term. Note that, from Eq. (4) it is clear that the mapping λ → s 2 (λ, ·, ·) is a quadric (4th-order) polynomial, consequently its first order derivatives will be at most a cubic polynomial which in general is non-convex. In order to account for this we simply define a new mapping λ →ŝ(λ, ·, ·) := 1 λ s(λ, ·, ·) and takeŝ as the regularizer term. As per this, we have the following modified loss functional Its corresponding Euler-Lagrange equation reads as (6) where,ŝ Equation (6) can be abstractly written in the following form: where, 3. κ := (κ 1 , κ 2 , κ 3 , κ 4 ) is a vector of model parameters and is a positive constant. In order to solve (7) we still need to enforce boundary condition for which we make the obvious choice of Neumann boundary condition. Then the pde given in (7) has a unique solution λ ∈ H 1 (D). Thus by Lax-Milgram we have an unique solution to Aλ = g in H 1 (λ) for g := f (λ) ∈ L 2 (D). This implies the resolvent operator A −1 : L 2 → H 1 maps every g ∈ L 2 to a corresponding λ ∈ H 1 . Since Hence, both A −1 : L 2 → H 1 and A −1 : L 2 → L 2 are continuous linear operators. Howerver, since H 1 → → L 2 the operator A −1 : L 2 → L 2 is compact. Now the weak formulation of the original problem is equivalent to finding a fixed point of T λ := A −1 f (λ). Following as above, we have Thus maps bounded sets in L 2 to bounded sets in H 1 and since H 1 → → L 2 we get that T : L 2 → L 2 is a compact operator. To show that T : L 2 → L 2 is also continuous, consider a sequence (λ m ) m∈N such that λ m → λ in L 2 (D), then there exists a subsequence (λ m k ) k∈N such that λ m k (·) → λ(·) a.e. in D. Thus, by continuity of A −1 we have that Thus we also get that T λ m k − T λ L 2 → 0 as k → ∞. Finally, we have that T : L 2 → L 2 is compact and continuous in L 2 with T (H 1 ) ⊂ H 1 thus by Shauder's fixed point theorem we get that ∃λ ∈ L 2 such that λ = T λ in L 2 (D). Moreover, since Due to the boundedness of the D ⊂ R 2 , I L , I R clearly belong to L 2 (D). Because of the continuity and boundedness of the EUCM projection model, the coordinate functionsû,v also smooth and bounded. Similarly, even the cross product functions a, b, c can be verified to be smooth and bounded. To ensure uniform ellipticity of the diffusion coefficient and edge aware diffusion we modify K 1 (u, v) as: Since |∂ x I L | and |∂ y I L | are uniformly bounded, K 1 is uniformly elliptic if and only if p 1 + p 4 > 2p 2 i.e. a 2 + b 2 − (a, b) 2 > 0 i.e. a − b 2 > 0. If necessary, a sufficiently small constant is added to the diagonal elements, (especially where a(u, v) = b(u, v) or a = b = 0), of the K 1 in order to make it strictly positive definite. The necessary condition for finding the minimum of the Functional (5) is to solve the Eq. (7). In order to consider the errors induced in the image formation we incorporate noise in the model which takes the following form: where W ∼ N (0, Σ(λ)) is a L 2 (D)-valued Gaussian random variable with covariance operator Σ ∈ L(L 2 (D); L 2 (D)). The noise term depends on the depth function which facilitates positivity of the solution. It is important to note here that, even though we have introduced the noise at the level of Euler-Lagrange equation (7), one may also, equivalently, introduce at the level of functionals (5) . Intuitively, what this means is that, the data fidelity term reads as The noisy model is now solved in the L 2 sense i.e. argmin λ∈L 2 (D) We note here that the solution to a least squares problem is given by solving the normal equation. This is a nonlinear least squares problem which we solve following the techniques provided in the next section. In order to numerically solve (10) we invoke the machinery of convex optimization esp. techniques involving the use of proximal operators. Additionally, it can be supplemented by constraints such as positivity, smoothness, shearlet, wavelet and so on. To facilitate this we consider the following generic minimization formulation: Letting ι S be the indicator function of the set S and Q a bounded linear operator. We now mention a primal-dual algorithm that is used to solve the minimization problems of the type (11). Data: κ > 0, τ x , τ p > 0, ϑ ∈ (0, 1], x 0 , p 0 1x 0 := x 0 2x 0 := x 0 3 for n = 0, 1 . . . do 4 p n+1 = p n + τ p Qx n − τ p P S 1 τp p n + Qx n 5x n+1 = x n − τ x Q * p n+1 6 x n+1 = (A * A + 1 τx ) −1 (A * f + 1 τxx n+1 ) 7x n+1 = x n+1 + ϑ(x n+1 − x n ) 8 end Based on the Algorithm 1 we can compute the uncertainty of the estimate x n+1 in the following way. Let = [(A, 0), (0, 1 letf ∼ N (μf , Σf ), then line-6 can be seen as a solution to the following problem The solution to this is given as For a special case when f andx n+1 are independent of each other, Σ f and Σ −1 f can be written as N (μ 2 , Σ 2 ) . Correspondingly, we have that As mentioned above (Sect. 2) we are interested in estimating the depth map corresponding to an raw image obtained by fisheye cameras in a stereo setup. Firstly, the intrinsic camera parameters K L and K R are determined and kept fixed. Similarly, the relative transformation R and T between the two cameras are also determined and kept fixed. The operators R and T is used to transform the origin of left camera into the origin of the right camera as explained above in Sect. The constraints a > 0, b > 0, c > 0 and a = b, mentioned below formula (8), were numerically verified for the parameters in Table 1 . This ensures uniform ellipticity of K 1 as a result of which we get existence of a weak solution. Consequently, the least square solution is equivalent to the weak solution. Moreover, to ensure positivity of the solution, we enforce box-constraints i.e. 0 < λ min ≤ λ ≤ λ max with λ min := .5 and λ max := 100. By setting Q := I and P S := P [λ min ,λmax] in Algorithm 1 we get the Algorithm 2. For evaluating I R at non integer pixel locations (û,v) simple bilinear interpolation was used. For solving the resulting matrix equations, algebraic multigrid solvers were used [18] . For the covariance estimate we use the formula (12) with Since we are solving the quadratic problem (Eq. (13)) locally, numerical values of Σ 1 are obtained by evaluating (14) at λ n m . Based on this we have that Σ λ n+1 As λ n m →λ in L 2 (D) we have Σ 1 (λ n m ) → Σ 1 (λ), consequently we also get that Σ λ n m → Σλ. Since convariance computation involves matrix inversion it is a relatively expensive operation. In order to make the computation practically tractable, we make the following simplification: (i) during covariance computation we downscale the image to the Due to the nonlinearity of the problem the choice of initial condition is highly important. In order to generate a good initial estimate one may use any available algorithms ranging from naive exhaustive global search based method or SGM mehtod or an estimate generated from DNN. We take the output of the FisheyeDistanceNet [15] and use the semantic segmentation output to assign max values to the regions of the image that represent sky. It is specially important for outdoor scenes where sky occupies majority of the scene. This strategy motivates one to supplement an offline algorithm, such as multi-task (trained to infer semantic and depth values) DNN, with an online estimation algorithm. The combination provides two fold advantage: (i) on the one hand it provides a more realistic result by improving the initial estimate based on the actual data and (ii) on the other hand improves safety of the system by rectifying grossly erroneous DNN outputs. We use of this philosophy to estimate depths for different outdoor scenes. Their results are as shown below in Fig. 2. Figures 2b-q show the estimated depths. The use of semantically segmented Figs. 2a-p has enabled easy identification of the sky region resulting in a much clear depth output. Figures 3a and 3b depict the convergence of the numerical algorithm for all four scenes considered. Scene3 and Scene4 are taken from a video, for which we used more than two images to obtain image correspondences. Here we have used 6 pairs, the spikes in the convergence plot indicative of the use of a new neighboring frame as the sampling image. This was done purely to indicate the plausible improvement by incorporating multiple views that serves to improves statistical sufficiency of the data. This improvement was validated by comparing the estimate with the ground-truth generated by the Valeodyne sensors. In this paper we have presented a framework to estimate depths from two or more fisheye images. We employ variational technique to estimate the entire depth map at once. A novel surface area based term is used as a regularizer for the energy functional. As a result we get an anisotropic diffusion term which we believe facilitates in a more realistic surface reconstruction through the means of realistic depth estimates. The regularizer term also assists in establishing the Lax-Milgram property which is essential in proving existence of a unique weak solution in H 1 (D). In order to ensure positivity of the solution and to facilitate the use of additional regularizing constraints we use primal-dual methods to solve (9) in least squares sense. The least-squares interpretation enables us to infer the 2nd order moment i.e. the covariance operator of the depth estimate. Since we assume Gaussian noise, covariance (along with the mean) is sufficient to estimate the entire distribution of the depth map. It should be noted that a least-squares interpretation is not a necessary step to have an uncertainty estimation, alternatively one can look at (9) as a stochastic pde which also enables one to estimate the uncertainty of the solution. We believe that uncertainty quantification is of paramount importance for many safety critical applications (e.g. robotics, autonomous driving). Complex applications requires interaction of various modules that work with different types of data. In such cases covariance of the estimate is a crucial information that facilitates in fusing heterogeneous data. Thereby, seamlessly integrating different higher level modules. Furthermore, an entire range of Bayesian algorithms become applicable for image based inferences. Other popular approaches (esp. deep learning based techniques) lack to provide any kind of inference about the distribution of the estimate, thus incorporation of these techniques into complex real-time systems puts the entire system at great risk. However, as mentioned above, such offline algorithms, that offer quick initial estimates, can be combined with our online algorithm to improve the overall reliability of the system. Fisheye lenses Large-scale direct slam for omnidirectional cameras Omnidirectional stereo vision using fisheye lenses Automatic calibration of fish-eye cameras from automotive video sequences Efficient large-scale stereo matching Stereo processing by semi-global matching and mutual information Wide-angle camera technology for automotive applications: a review Trends towards automotive electronic vision systems for mitigation of accidents in safety critical situations 3d visual perception for self-driving cars using a multi-camera system: calibration, mapping, localization, and obstacle detection. Image Vis. Comput Feasible self-calibration of larger field-of-view (FOV) camera sensors for the advanced driver-assistance system (ADAS) What's the best lens for stereo? An exploration of lens models and their stereoscopic depth performance An enhanced unified camera model 3d scene reconstruction from multiple spherical stereo pairs Accurate real-time disparity estimation with variational methods FisheyeDistanceNet: self-supervised scale-aware distance estimation using monocular fisheye camera for autonomous driving Binocular spherical stereo DTAM: dense tracking and mapping in real-time PyAMG: algebraic multigrid solvers in Python Variational fisheye stereo On the accuracy of dense fisheye stereo Geometric modelling and calibration of fisheye lens camera systems Depth estimation using stereo fish-eye lenses Stereo vision methods: from development to the evaluation of disparity maps