key: cord-0058308-8vj7inrd authors: Kopf, Christian; Pock, Thomas; Blaschitz, Bernhard; Štolc, Svorad title: Inline Double Layer Depth Estimation with Transparent Materials date: 2021-03-17 journal: Pattern Recognition DOI: 10.1007/978-3-030-71278-5_30 sha: 67ab88f74638a426963ce70425a3b1ab91cb2023 doc_id: 58308 cord_uid: 8vj7inrd 3D depth computation from stereo data has been one of the most researched topics in computer vision. While state-of-art approaches have flourished over time, reconstruction of transparent materials is still considered an open problem. Based on 3D light field data we propose a method to obtain smooth and consistent double-layer estimates of scenes with transparent materials. Our novel approach robustly combines estimates from models with different layer hypotheses in a cost volume with subsequent minimization of a joint second order [Formula: see text] energy on two depth layers. Additionally we showcase the results of our approach on objects from common inspection use-cases in an industrial setting and compare our work to related methods. Reconstructing 3D depth information through multi-view stereo methods is a well researched topic of computer vision and has lead to countless variations and approaches over the years. In order to obtain depth from correspondences in different views, the majority of approaches rely on the assumption that scene points reflect light in all directions uniformly, i.e. the Lambertian assumption [11] . Unfortunately this assumption is violated for scene points from transparent materials which poses a special case that is rarely considered in state-of-the-art approaches. Moreover little to no surface structure and refractive effects make the task even more difficult. Despite the challenging nature of the problem, computing depth from scenes with transparent materials is a desirable ability for industrial applications. The amount of publications regarding depth estimation for transparent materials is very sparse. Earlier publications explore different approaches to tackle the problem as for example through visible light tomography [20] or polarized light [15] . The trend for recent approaches is towards specialized hardware such as the utilization of Time-of-Flight (ToF) measurements [7, 19] from RGB-D cameras and structured light setups [10, 16, 25] . With the increasing popularity of learning based methods recent approaches show full 3D reconstructions of transparent objects. As an example recent work of Li et al. [14] extends the range of capable approaches by computing 3D transparent shapes with a small set of images from a mobile phone camera and a known environment map. Another learning-based publication of recent years is the single-image approach by Chen et al. [5] which is able to produce visually convincing transparent shapes from learned refractive flow of transparent objects. Despite the impressive results of these approaches, they are not applicable to our intentions. Our goal is to present a method in a more restricted environment. Firstly, we limit our approach to passive (multi-view) data only. We also assume a front-parallel multi-view alignment along exactly one image axis which is common for stereo data. To allow for applications in an industrial setting, e.g. objects on a conveyor belt, our approach is also required to process scenes which are subjected to linear motion while guaranteeing reasonably fast runtime. The intention is to allow continuous acquisition with parallel processing of the data. The scene structure modeled by our approach consists of a transparent object in front of an opaque background. Thus the goal is to estimate two depth estimates globally across an entire scene, i.e. a front layer for non-opaque surfaces and an opaque back-layer for scene points that may be underneath, see Fig. 1 for an example. In the context of passive (multi-view) data, a popular method to describe densely sampled scenes is given by the plenoptic function [2] which enables the description of 3D scene geometry in a light field (LF) [6] . A simplified variant of a light field can be interpreted intuitively through a discrete volume in 3D, comprised of a stack of images from different views. The acquisition of such light fields is commonly subject to epipolar constraints, thus enabling depth estimation through light field analysis. For scene points of transparent materials, multiple depth cues superimpose which results in local multi-orientation structures in light field data. In [22, 23] Wanner and Goldlücke show how local structure tensor (ST) analysis [1] in the epipolar domain of light fields can be utilized to solve correspondence problems for Lambertian and non-Lambertian surfaces. The major drawback of this approach is that the depth estimates are bound to their corresponding model hypothesis, which implies that depth estimates are only justified in regions of a scene where the respective hypothesis is valid. Johannsen et al. [8, 9] present a different approach by solving coding problems with a dictionary of pattern orientations in a local neighborhoods to compute multi-layer depth estimates based on 4D light fields. Our experiments have shown that conventional stereo matching and line fitting in the epipolar domain gives bad results when it comes to transparent materials. In comparison we have found that the double orientation structure tensor approach is well suitable for transparent materials since this model is very sensitive to depth cues from non-Lambertian surfaces. In this work we will adopt the basic idea of depth estimation with structure tensors on narrow baseline light field data from an inline setup with the aforementioned requirements. We will start by reviewing our setup and explain the light field structure in Sect. 2. Furthermore we will give a brief introduction to structure tensor analysis in the context of light fields and in Sect. 4 follow up with our proposed method to combine estimates in a cost volume. After that we will explain how we refine our double-layer results through a novel variational method. Our assumption on the general scene structure is that a transparent object of arbitrary shape is located in front of an opaque second layer as in Fig. 2 . This example shows a clear tape spanned over a coin. The objective is to reconstruct the depth of the tape surface as well as the depth of the coin underneath. Acquisition rays that traverse through the transparent object carry composite depth information of at least two surfaces. This leads to some areas for which a single-layer depth hypothesis is valid and some areas for which a double-layer hypothesis is valid. In addition the problem becomes even more complicated due to opaque structures on the transparent object which might occlude the subjacent background. As the final result our goal is to obtain two depth images of size M × N where the depth estimates among both images coincide in single-layer regions and differ in double-layer regions. The input data is assumed to be 3D light field data which is a mapping of the form as a substitute for the whole light field. All data that is used in this paper is listed in Table 1 . To evaluate our proposed method we will use the publicly available "Maria" dataset from [24] and adopt it to our 3D light field setup by discarding all but the 9 central views in the horizontal direction of movement. In addition we acquired multiple light fields of transparent objects with an inline acquisition system based on the principles of [21] . An Illustration and a short Explanation of the setup can be found in Fig. 3 . The depth information of a LF scene can be computed through local orientation analysis in epipolar plane images (EPI) EPIs are images comprised of projections from acquisitions rays sharing common epipolar planes. By choosing a fixed value x =x thus one obtains An example of such an EPI is depicted in Fig. 2 . From this figure it can be seen how different depth layers from the scene impose multi-orientation patterns with certain angles in the epipolar domain of a LF. Note that the angle of these patterns is linked to the depth of the corresponding scene points. To obtain depth estimates we thus apply an orientation analysis through the aforementioned structure tensor approach [22] . Since the orientation analysis is performed locally at each point in the light field (M × N × P ) we obtain a result stack with the same dimensions. Single Orientation Structure Tensor (SOST): The purpose of the singleorientation model [1] is to compute depth estimate for opaque or occluded scene points, i.e. single-layer regions. By finding the local orientation w(ξ) in a local neighborhood Ω through minimization of the least squares energy term we obtain an angular estimate ξ, where p = [s, y] T is the image pixel of the EPI Ex and denotes the gradient vector in the EPI domain. Reformulating Eq. (7) leads to the definition of the single orientation structure tensor S, From the eigenvalue analysis on S, w(ξ) is obtained as the eigenvector corresponding to the smaller eigenvalue. The resulting estimate stack for the SOST model will be denoted by H 1 (x, y, s). By extending the single-orientation case to the double-orientation case [1] with an additive composition of patterns in a local neighborhood Ω, the single orientation model can be extended to the double orientation model. In a similar fashion to the SOST model, the optimal solution for a given multi-orientation pattern patch Ω can be computed by minimizing where m(θ, γ) denotes the MOP vector [1] and Because the structure tensor model T is comprised of second order derivatives we will refer to this model as the second order double orientation structure tensor (SODOST) model. Through an eigenvalue analysis with a subsequent root solving on this structure tensor we obtain the two orientations u(θ) and v(γ) or the corresponding disparities where it is assumed that u(θ) denotes the estimate closer to the camera. By performing this local analysis based on this model we thus obtain two further estimate volumes H 2 (x, y, s) (front) and H 3 (x, y, s) (back). The quality of the structure tensor estimates greatly depends on the choice of the inner and outer scale for Gaussian smoothing. This is explained in more detail in [22] . For the "Maria" dataset we chose to use the same parameters as the referenced paper (i.e. σ i = σ o = 0.8) and for all other datasets we chose to use a slightly larger outer scale σ o = 1.2. In our experiments we have observed that the structures in EPIs are a lot more prominent on opaque surfaces compared to transparent scene points. Prior to computing the structure tensor estimates we therefore normalize the contrast of the entire light field to enhance the structure imposed by transparent surfaces. Let I ∈ R M ×N ×P denote a single channel of the light field, we apply the normalization filtering where * is the volumetric convolution and k σ denotes a Gaussian kernel. The parameter c is used to scale the normalized output to [0, 1]. In our implementation we used c = 6. We chose k σ to be a 2D kernel in spatial dimensions with a standard deviation σ = 2. The major drawback of the plain structure tensor results are the affiliation to either a single-layer or double-layer hypothesis which is valid in certain regions of the scene. Our approach combines the results from all models and implicitly rules out incorrect or weakly supported estimates. To achieve this in a robust manner we create a cost volume V ∈ R M ×N ×D where D is the number of elements in a finite set {d 0 , d 1 , ..., d D−1 } of disparity hypotheses. As depicted in Fig. 2 we can observe that the angle of the sloped pattern lines in the epipolar domain of the light field correspond to different depths and subsequently disparities in a light field. Recall that we compute depth estimates at each point in a local neighborhood of the light field, such that we attain the estimate volumes H 1 , H 2 and H 3 . Because of the locality of the structure tensor operation, we can observe that the estimates along the corresponding sloped line in the light field ideally are constant. If the variance along this sloped line is high we want to penalize this with a high cost. To analyze the estimates along any sloped line we thus formulate a ray which intersects the reference view s ref at the coordinates x, y with an angle α. Since disparities can be translated to angles we can therefore use the set of disparity hypotheses from above to determine the intersected voxel in each view for each hypothesis ray. The relative shift for a certain view s i with respect to s ref can therefore be computed with In theory each ray thus intersects up to P voxels. We define that all these voxels of a single ray form a finite set R x,y,di . In a geometric perspective we can alter the thickness of this ray by also selecting neighboring voxels in each view such that we can set its radius of influence resulting in more or fewer voxels in the set. An illustration of this principle is given in Fig. 4 . With this in mind the cost for each point of the cost volume can therefore be attained through where |R x,y,di | denotes the number of elements in the set R x,y,di and r denotes a point in light field coordinates. For the inner cost metric we chose to use a stable absolute distance from the disparity hypothesis d i where τ is a hyperparameter. We remark that the structure tensor estimates H j (r) from each model may be incorrect and noisy in certain regions. With the given cost measure above a combined robust cost over all three structure tensor models in a local neighbourhood along a hypothesis ray is provided. The cost volume in Fig. 5 shows the unimodal and bimodal distribution along d for any location x, y on the spatial grid. Naturally the distribution can have more than two modes but in our testing we have found that either one or two prominent modes are present in the cost volume. The task now is to find the index d and the corresponding cost for one or two prominent modes on mostly smooth distributions. To isolate the minimal points we use a simple non-maximumsuppression algorithm denoted in Algorithm 1 and define that a local minimum is only valid if it is below a certain threshold and no local maximum is within a certain range along d. The algorithm is based upon a local gray value dilation operation with a subsequent comparison to the unchanged cost volume. For each x, y determining the index of the smallest minimum forms an image which is denoted g 1 . For areas with a valid second best minimum we likewise determine the indices which form g 2 . For single-layer coordinates we use the corresponding entry from g 1 . Since the estimates in both images have no clear affiliation to the front-layer or the back-layer we sort both images such that where (g j ) i denotes the i-th element of a flattened version of g j . In this section, we propose a joint refinement approach to refine noisy depth estimates for the foreground surface u and a background surface v. As a regularizer we make use of the Total Generalized Variation (TGV) regularizer [3] of second order which has been shown to be very suitable for the refinement of depth images [12, 17] . The TGV regularizer of second order is defined as where the optimal solution for the foreground u is constrained to be at least as close to the camera as the background v. Here c 1 , c 2 ∈ R M ×N denote the Algorithm 1. Implementation of the non-maximum suppression. The operator denotes the Hadamard product and δ B is the 1-0 indicator function on condition B. , y, d) ) Dilate along d with width=3 3: return VNMS(x, y, d) 5: end procedure reciprocal values of the residual cost from the solutions g 1 and g 2 to steer data fidelity locally and the operator denotes the Hadamard product. By transforming Eq. (16) into the corresponding saddle-point notation in the general form of min the problem can adequately be solved through the PDHG approach by Chambolle and Pock [4] . To apply the algorithm to our problem, the proximal operators prox τh and prox σf * have to be determined. Note that both can be computed element-wise. Since the convex conjugate of f : · 2,1 is given by the indicator function on the 2, ∞-norm ball, the proximal map in dual space is given by Unfortunately the operator in primal-space is not so straight forward since we need to pay attention to the constraint u ≥ v. We can formulate the term h(x) from the saddle point notation in the following way where the last term is the indicator function of the constraint As a first step we compute the solutionsû andv by the well known shrinkage operator In caseû