key: cord-0518568-77b6utgz authors: Chen, He; Park, Hyojoon; Macit, Kutay; Kavan, Ladislav title: Capturing Detailed Deformations of Moving Human Bodies date: 2021-02-15 journal: nan DOI: nan sha: 976e60ad73b6b7f13c381db4cf9d4b7f8037c978 doc_id: 518568 cord_uid: 77b6utgz We present a new method to capture detailed human motion, sampling more than 1000 unique points on the body. Our method outputs highly accurate 4D (spatio-temporal) point coordinates and, crucially, automatically assigns a unique label to each of the points. The locations and unique labels of the points are inferred from individual 2D input images only, without relying on temporal tracking or any human body shape or skeletal kinematics models. Therefore, our captured point trajectories contain all of the details from the input images, including motion due to breathing, muscle contractions and flesh deformation, and are well suited to be used as training data to fit advanced models of the human body and its motion. The key idea behind our system is a new type of motion capture suit which contains a special pattern with checkerboard-like corners and two-letter codes. The images from our multi-camera system are processed by a sequence of neural networks which are trained to localize the corners and recognize the codes, while being robust to suit stretching and self-occlusions of the body. Our system relies only on standard RGB or monochrome sensors and fully passive lighting and the passive suit, making our method easy to replicate, deploy and use. Our experiments demonstrate highly accurate captures of a wide variety of human poses, including challenging motions such as yoga, gymnastics, or rolling on the ground. (a) (b) (c) (d) (e) (f) In most real-world images, the human body is occluded by clothing, making precise body measurements difficult or impossible. A significant amount of previous work focuses on approximate but robust pose estimation in the wild [Cao et al. 2018; Güler et al. 2018 ]. However, a small muscle twitch or the speed of breathing may contain signals that are critical in certain contexts, e.g., in the context of social interactions, minute body shape motion may reveal important information about the person's emotional state or intent [Joo et al. 2018] . Detailed human body measurements are highly relevant also in orthopedics and rehabilitation [Zhou and Hu 2008] , virtual cloth try on [Giovanni et al. 2012; ], or building realistic avatars for telepresence and AR/VR [Barmpoutis 2013; Lombardi et al. 2018 ]. When precise measurements are needed, prior work utilized either 1) reflective markers attached to a motion capture suit or glued to the skin [Park and Hodgins 2006] , or 2) painting colored patterns on the skin [Bogo et al. 2017 ]. The traditional reflective ("mocap") markers present certain limitations. Because all of the markers look alike (Fig. 2a) , marker labeling relies strongly on temporal tracking and high frame-rate cameras. However, robust marker labeling is a hard problem [Song and Godøy 2016] which often requires manual corrections, especially for markers that have been occluded for too long. The difficulty of this problem grows with the number of markers [Park and Hodgins 2006] , thus sparse marker sets are most common in the industry. Sparse marker sets are sufficient for fitting a low-dimensional skeletal body model, but not for capturing the details of flesh deformation or motion due to breathing. To capture moving bodies with high detail, the DFAUST approach [Bogo et al. 2017 ] starts by geometrically registering a template body model to 3D scans [Hirshberg et al. 2012; and then uses colored patterns on the skin to obtain high-accuracy temporal correspondences via optical flow. These colored patterns serve a similar purpose as the checkerboard-like corners on our suit, i.e., they enable precise localization of points on the surface of the body. The key difference of our approach is that our suit contains also unique two-letter codes adjacent to each corner, allowing us to label the corners directly by recognizing the codes. This is not possible with the DFAUST's patterns, because they are self-similar, created by applying colored stamps to the skin. Instead, the DFAUST approach relies on the initial geometric registration and temporal tracking, which can suffer from error accumulation and may lead to incorrect local minima in more challenging poses or fast motions. The DFAUST dataset contains a variety of high-detailed human body animations, but is restricted to upright standing-type motions. In contrast, we demonstrate captures of a wider variety of motions, including gymnastics exercises, yoga poses or rolling on the ground, see Fig. 22 and our accompanying video and data. Our new motion capture method was enabled by recent advances in deep learning and high-resolution camera sensors. The key idea is to use a new type of motion capture suit with special fiducial markers, consisting of checkerboard-like corners for precise localization and two-letter codes for unique labeling. Our localization and labeling process is very robust, because it does not rely on temporal tracking or any type of body model; in fact, our approach succeeds even if only a small part of the body is visible in the image, e.g., the zoom-ins in the right part of Fig. 17 . A similar advantage exists also in the temporal domain. Because our localization and labeling approach can process each image independently, there are no issues due to occlusions and dis-occlusions which complicate traditional temporal tracking in both marker-based as well as marker-less methods. Even though the automatic localization and labeling are very robust, achieving this functionality is non-trivial, because our methods need to be robust against significant stretching of the tight-fitting suit as well as projective distortion. Fortunately, checkerboard-like corners remain to be checkerboard-like even despite significant stretching of the suit. We apply three convolutional neural networks combined with geometric algorithms. The first of our convolutional neural networks (CNNs) is a corner detector which localizes all of the corners in an input image (4000 × 2160). To rectify the distortion of the two-letter codes inside the white squares, we connect candidate four-tuples of corners into quadrilaterals (quads) and apply homography transformation which rectifies both suit stretching as well as projective distortion. Another CNN called RejectorNet then performs quality control and legible and upright-oriented codes. The remaining codes are passed to RecogNet which reads the characters in the two-letter code. Because the orientation of our codes is unique (we avoided symmetric symbols such as "O" or "I" as well as ambiguous pairs like "6" and "9"), recognition of the code allows us to uniquely label each of the adjacent corners, see Fig. 2c . The labels of our corners establish correspondences both in time and in space, i.e., between individual cameras in our multi-view system, which means that we can easily triangulate the 2D corner locations into 3D points. However, the 3D reconstructed (triangulated) points will inevitably miss observations due to self-occlusions and the limited number of our cameras, see Fig. 1c . To fill (interpolate) these missing observations, we start by fitting the STAR model [Osman et al. 2020 ] and then refine it for each of our actors using a point trajectory from a calisthenics-type motion sequence captured using our method. This refinement ensures that we obtain the best possible low-dimensional model for each of our actors, since high quality is our main objective. We use this refined body model to interpolate the missing corners in the rest pose, resulting in the final mesh without any holes, see Fig. 1d . Our goal was to make each two-letter code as small as possible, so we can recover the highest possible number of points on the body. We created our special capture suits in two sizes, one "medium" (with 1487 corners) and one "small" (with 1119 corners) and we captured three actors: one male and two females (two of the actors used the "medium" suit). We evaluated both the geometric accuracy through reprojection error of 3D reconstruction, and the quality of the temporal correspondences by computing the optical flow between the synthetic image and the real image. Results show 99% of our reconstructed points has a reprojection error of less than 1.01 pixels, and 95% of the pixels on the optical flow have a optical flow norm of less than 1.2 pixels. In our camera setup, 1 pixel approximately converts to 1mm on a person 2 meters away from the camera. Contributions: (1) We propose a new method to measure 3D marker locations at each frame and automatically infer corresponding marker labels. This is achieved without any priors on human body shapes or kinematics, which means that our data are "raw measurements", immune to any type of modeling or inductive bias. (2) We introduce a novel type of fiducial markers and capture suit, which enables marker localization and unique labeling using only local image patches. Our approach does not utilize temporal tracking, which makes it robust to marker dis-occlusions and also invites parallel processing, because each frame can be processed independently. (3) All results in this paper were obtained using an experimental multi-camera system utilizing 16 commodity RGB cameras and passive lighting. High-end multi-camera systems based on machine vision cameras such as those built by Facebook [Lombardi et al. 2018] or Google [Guo et al. 2019 ] require significant hardware investments and engineering expertise. In contrast, our system is easy to build from inexpensive off-the-shelf parts. We provide our data as supplemental material and we invite individual researchers, independent studios or makers to replicate our setup and capture new actors and motions. Optical systems based on reflective markers [Menache 2000 ] are the most widely used approaches to capture the human body. While typically only sparse marker sets are used, Hodgins 2006, 2008] pushed the resolution of reflective markers based system up to 350 markers to capture the detailed skin deformation. However, difficulties in marker labeling [Song and Godøy 2016] complicate further increases of resolution by adding even more markers. Recent work utilizes self-similarity analysis [Aristidou et al. 2018 ] and deep learning [Han et al. 2018; Holden 2018 ] to reduce the expensive manual clean-up in marker labeling procedure. An alternative to the classical reflective markers is the use of colored cloth, enabling the capture of certain types of garments such as pants [White et al. 2007] or hand tracking using colored gloves [Wang and Popović 2009] . Early work in markerless motion capture [Gavrila and Davis 1996] and [Bregler et al. 2004 ] inferred human poses directly from 2D images or videos. [Kehl and Van Gool 2006] integrates multiple image cues such as edges, color information and volumetric reconstruction to achieve higher accuracy. [Brox et al. 2009 ] tracks a 3D human body on 2D image by combining image segments, optical flows and SIFT features [Lowe 1999] . [De Aguiar et al. 2008 ] deforms laser-scan of the tracked subject under the constraints of multi-view videos to capture spatio-temporally coherent body deformation and textural surface appearance of the actors. Silhouettes [Liu et al. 2013; Vlasic et al. 2008] or visual hulls [Corazza et al. 2010] can be used to obtain more detailed human body deformations. [Stoll et al. 2011] model the human body via a sums of Gaussians, representing both shape and appearance of the captured actors. Deep learning enabled estimation of 2D human poses from monocular multi-person images [Newell et al. 2016; Pishchulin et al. 2016; Wei et al. 2016] , more recently also with hands and faces [Cao et al. 2018 [Cao et al. , 2017 . 3D pose or even dense 3D surface of the human body can also be predicted from a single image [Choutas et al. 2020; Güler et al. 2018; Mehta et al. 2017; Xiang et al. 2019; Xu et al. 2019] . Morphable human models can be learned from multi-person datasets Robinette et al. 2002] . Models such as SCAPE [Anguelov et al. 2005] , SMPL and STAR [Osman et al. 2020 ] focus on the body, while models such as Adam [Joo et al. 2018 ] and SMPL-X [Pavlakos et al. 2019 ] also include the face and the hands. Focusing on highquality rendering rather than geometry, [Guo et al. 2019; Meka et al. 2020] proposed methods for photo-realistic relighting of moving humans, including clothing and accessories such as backpacks. The idea of a motion capture suit with special texture is related to fiducial markers used e.g., in robotics or augmented reality, such as ARTag [Fiala 2005 ], AprilTag [Olson 2011; Wang and Olson 2016] , ArUco [Garrido-Jurado et al. 2014 ] and many others, but these fiducial markers are typically assumed to be non-deforming. They are also not easy to read for humans, which would complicate their annotations. The localization of our fiducial marker is related to corner detection. Many corner detection methods have been developed to meet different use-case scenarios. For localizing the corners, there are methods that are designed to detect general corners features that naturally exists in nature, like [DeTone et al. 2018; Rosten and Drummond 2006] , Another class of corner detectors focuses on rigid calibration checkerboards [Bennett and Lasenby 2014; Chen et al. 2018; Donné et al. 2016; Hu et al. 2019] , particularly useful in camera calibration. Because these methods assume the checkerboard pattern to be rigid, they will not work on our checkerboard-like suit which can have significant deformations (Fig. 2b) . The code recognition component of our method is related to text recognition. As discussed in [Long et al. 2020] , the text recognizer generally performs poorly on text with large spatial transformations. One possible solution is based on generating region proposals [Jaderberg et al. 2014; Ma et al. 2018 ] to rectify the spatial transformation. High-resolution temporal correspondences can be obtained by registering a template mesh to RGB-D images or 3D scans. The registration can be based on solely geometric information [Allain et al. 2015; Li et al. 2009 ], or combined with RGB images to align [Bogo et al. 2015 ] to reduce the tangential sliding. Model-less approaches are also possible [Collet et al. 2015; Dou et al. 2015; Newcombe et al. 2015] . Those methods focus on registering sequential motions frame to frame, with the assumption of small displacements between subsequent frames. Therefore, they can suffer from error accumulation, resulting in drift over time [Casas et al. 2012 ]. Aligning non-sequential motions is also possible [Huang et al. 2011; Tung and Matsuyama 2010] , but it is challenging to establish correspondence between very different poses [Boukhayma et al. 2016; Prada et al. 2016] . Deformation models can be trained from 3D scans [Allen et al. 2003; Anguelov et al. 2005] , with non-rigid scan registration being the technical challenge [Hirshberg et al. 2012] . Similarly to our new motion capture suit, the FAUST [Bogo et al. 2014 ] and DFAUST [Bogo et al. 2017 ] methods paint high-frequency colored patterns directly to the skin. We chose to work with a suit because putting it on and off is easy and fast compared to applying colored stamps and washing them off after the capture session. Even though we only experimented with basic tight-fitting suits in this paper, future improvements such as adhesive suits or nonpermanent tattoos are possible, see Section 7. Our capture system is significantly simpler and less expensive: we use only on 16 standard (RGB) cameras with passive uniform lights, while [Bogo et al. 2014 [Bogo et al. , 2017 used 22 pairs of stereo cameras, 22 RGB cameras and 34 speckle projectors (active light). Perhaps more important are the technical differences between our approach and DFAUST, in particular the fact that our codes are unique as opposed to the self-repetitive patterns used in FAUST and DFAUST. Rather than creating a dataset, our goal was to create a universal and practical method to enable future research on advanced human body modeling and its applications in areas ranging from graphics to sports medicine. Suit. To create our special motion capture suit, we started by purchasing a tight-fitting unitard, originally intended for dance or performing arts. Fortuitously, one of the manufacturer-provided patterns was precisely the black-and-white checkerboard texture reminiscent of computer vision calibration boards (in fact this provided some of the original inspiration for this project). We purchased two suits, one "medium" and one "small" and augmented them by writing codes into the white squares using a marker pen. The medium suit contains 1487 corners and 625 two-letter codes; the small suit has 1119 corners and 456 codes. For our two-letter codes, we only used symbols whose upright orientation is unique and non-ambiguous, specifically: "1234567ABCDEFGJKLMPQRTUVY". Camera system. Our multicamera setup contains 16 standard (RGB) cameras arranged into a circle surrounding the capture volume (Fig. 3b ). Each camera captures 4000×2160 images at 30 FPS in RAW format. The camera shutters are synchronized via genlock with negligible synchronization error, which means that human motion is captured as if "frozen" in time. Surrounding the capture volume, we positioned 32 softboxes that generate uniform diffuse light. The bright light allows us to use a small aperture and very fast 0.5 shutter speeds, guaranteeing sharp images even with the fastest human motions. The cameras are calibrated by waving a traditional calibration checkerboard in from of them. The intrinsic and extrinsic camera parameters are calibrated using the well-established method [Zhang 2000 ] for which we use OpenCV's checkerboard corner detector for rigid calibration boards [Bradski 2000 ]. Next, the camera parameters and the 3D checkerboard corner positions in the world coordinates are further refined using bundle adjustment [Triggs et al. 1999] . We use the Levenberg-Marquardt algorithm and the Ceres library [Agarwal and Mierle 2012] . Image processing pipeline. The calibrated cameras generate sequences of images, which are processed by our pipeline outlined in Fig. 4 . We start by detecting checkerboard-like corners in the input image with sub-pixel accuracy (Fig. 4a , Section 3.1). Next, we need to uniquely label the detected corners by recognizing the adjacent two-letter codes. Because the codes are written in the white squares surrounded by four corners, we generate candidate quadrilaterals (quads) by connecting four-tuples of corners. Only a few four-tuples of corners correspond to the white squares, but it is okay to generate a quad that does not correspond to a white square, because it will be discarded later; hence we use the term "candidate quads", see Fig. 4b . Since the quads are generated by connecting four corners, we naturally have correspondences between the corners and the quads. The candidate quads are rectified by mapping them into a regular square using homography (Fig. 4c , Section 3.2) to remove suit stretching and perspective distortion, and then passed as input to RejectorNet which performs quality control and checks whether the quad actually corresponds to a white square with a code (Fig. 4d ). The RejectorNet also ensures the correct upright orientation of the code. The images accepted by RejectorNet are then passed to RecogNet, which reads the two-letter code that finally enables us to uniquely label each corner (Fig. 4e) . We would like to point out that our method is local by design, i.e., each stage of the pipeline works with small patches of the input image. This gives us a several advantages: a) Our method is capable of extracting reliable geometric information of the human body andcrucially -correspondences even from a small patch of the suit. This makes our method very robust to occlusions or partial views of the human body e.g. due to zoomed-in cameras. b) By decomposing the suit into small quads and undistorting them using homography, we can counteract much of the projective distortions and suit stretching (see Fig.4c ), simplifying the learning task. c) The CNN quad classifier includes a quality control mechanism, rejecting white squares with dubious quality and further improving the robustness of our method. The corner detector's task is to detect and localize all checkerboardlike corners in the input image. This task is non-trivial because there are corner-like features in the background, the suit stretches along with the skin and there are significant lighting variations. Our corners have two key properties: a) The corners are sparsely and approximately uniformly distributed on the suit; b) The corners are defined locally, i.e., a small image patch is enough to identify and localize a corner. We divide our input image into a grid of 8 × 8 cells ( Fig. 5a ) with the assumption that there could be at most one checkerboard-like corner in each 8 × 8 cell, and apply CornerdetNet CNN (Fig. 5c ) to detect and localize a checkerboard-like corner from each cell separately. The design of CornerdetNet is inspired by single shot detectors [Liu et al. 2016; Redmon and Farhadi 2017] , which perform prediction and localization simultaneously. The input to CornerdetNet is the 8 × 8 cell where a corner is being sought, including a 6-pixel margins added to each side (Fig. 5b) , making the input crop size 20 × 20. These margins allow us reliably detect even the corners close to the boundaries of the 8 × 8 cell. The 6-pixel margins overlap with adjacent cells (Fig. 5b ), but the 8 × 8 (1, 0.72, 0.28) cells do not overlap. The CornerdetNet outputs three floating-point numbers. The first one is a logit of a binary classifier predicting whether a corner is present or not, and the other two are normalized coordinates ([0, 1] × [0, 1]) of the corner relative to the 8 × 8 cell. The training loss for CornerdetNet is: Where balances the prediction loss and localization loss; we set = 200 when training CornerdetNet. * represents the logit of the binary classifier, * represents the prediction of corner location, and represents the ground truth respectively, and L ( * , ) is cross entropy. Corner Clustering and Refinement. When a corner lies exactly on the boundary of two 8 × 8 cells, it can be detected more than once (Fig. 6a) . To fix such duplicate detections, we perform a clustering pass: if any two detected corners are too close (< 3 pixels), we discard the one with the lower logit value. Since this might introduce The re-crops generated for a detected corner. In the parentheses are the corner position in global pixel coordinates. additional localization noise, we generate new crops randomly perturbed around the original corner positions, run localization on each of these crops and average the results in global pixel coordinates, see Fig. 6b . This helps especially when corners are crossing the boundaries of the 8 × 8 cells (Fig. 6a ). At this point we have detected typically several hundreds of corners in each input image. The next step is to read the codes and link them to the corners, which will give us a unique label for each corner. As shown in Fig. 1 , the deformations of the suit and its codes can be significant, including not only projective transformations, but also stretching and shearing because the tight-fitting suit is highly elastic. First we have to detect the white squares with two letter codes. We know that each such white square is surrounded by four corners. Therefore, we generate quadrilaterals (quads) by connecting fourtuples of corners. In theory we could connect any four-tuples of corners into a quad, but in practice we can immediately discard concave quads (which do not correspond to correct sequences of corners) or quads that would cover too few or too many pixels (which would make it impossible to contain a legible code). We call the resulting quads "candidate quads", because they may -but are not guaranteed to -contain a correct two-letter code. We transform the four corners of each candidate quad to a standardized square using a homography transformation to simplify subsequent processing. The standardized square is a 64 × 64 pixel image with 20 pixel margin on each side, i.e., 104×104 total. The 20 pixel margin allows the RejectorNet to detect errors stemming from incorrect corner detections. Since we do not know the correct upright orientation of our two-letter code yet, we generate all four possible orientations, see Fig. 8c ,d. Candidate quad generation. It would be wasteful to enumerate all 4-tuples of corners for further processing by neural networks. Therefore, we first apply simple criteria to filter out quads that cannot contain a valid code. We start by iterating over all the corners, and for each corner, we select three other corners within a bounding box. When connecting corners into a quad, we ensure that each quad is convex, clock-wise oriented and unique. Additional filtering criteria include geometric criteria and image based criteria: geometric criteria constrain the area, maximum/minimum edge-lengths and maximum/minimum angles of the generated candidate quad; image based criteria constrain the average intensity, and standard deviation of all the pixels in the generated candidate quad. To obtain the range for each criterion, we gather statistics for each of those quantities in the training dataset (Section 4) and create conservative intervals to ensure that we cannot mistakenly reject any valid quad. The candidate quads that pass all of these early rejection filters are transformed using homography and passed to further processing to quad classifier neural networks. Fig. 8c shows an example of an invalid quad and Fig. 8d demonstrates a valid one. Quad classifiers. We trained two quad classifiers, RejectorNet and RecogNet. RejectorNet is a binary classifier predicting whether a candidate quad is valid, i.e., whether the four corners are at the correct locations and their order is correct relative to the upright code orientation, see Fig. 8b . Also, the white square surrounded by a valid quad needs to contain a clearly legible code. Invalid quads are discarded, and the valid ones are passed to RecogNet which reads the codes, such as "U7" in Fig. 7 . RecogNet is a multi-class classifier with two heads, one for each character of the two letter code. The architectures of both networks are shown in Fig. 7 . We use standard cross entropy losses to train those classifiers. The training of our CNNs is discussed in Section 6. Why separate RejectorNet and RecogNet? We considered combining the two networks into one, but we found that network training is easier if we treat each problem separately. Specifically, the Rejec-torNet should perform quality control of a 104 × 104 standardized image, including rejection of errors made by CornerdetNet (Fig. 12b) . Because we prefer missing observations to errors, we train Rejector-Net to be conservative and reject any inputs of dubious quality. The second network, RecogNet, has to recognize two characters in any image. We can make RecogNet more reliable by training it even on very difficult input images, enhancing the robustness of the entire pipeline. The details of our training process and data augmentation are discussed in Section 4.2. Corner labeling and 3D Reconstruction. At this point, the twoletter codes of the valid quads have been recognized, including their upright orientation. The next step is to uniquely label each corner. We define a labeling function ( , ) which maps a twoletter and corner index ∈ {1, 2, 3, 4} (see Fig. 8b ) to an integer which represents a unique corner ID. The unique corner IDs are defined for each suit. Many corners have two two-letter codes adjacent to them. If both of the two-letter codes are visible, we can leverage this fact as a redundancy check, detecting potential errors of RecogNet. Given unique corner IDs, we can convert corresponding 2D corners in more than two views into labeled 3D points. Let C be the set of cameras that see corner , ∈ C is a camera that sees corner , c ∈ R 2 be the corner 's location in image coordinate system of camera , and : R 3 → R 2 is the projection function of camera . We computed 3D reconstructed corner p by minimizing the reprojection error: This is a non-linear least square optimization problem; we compute an initial guess of p using the Linear-LS method [Hartley and Sturm 1997] and optimize it using non-linear least squares solver [Agarwal and Mierle 2012] . Error Filtering. The label consistency check discussed above works only if two adjacent two-letter codes are present in the suit and visible in the images. If this is not the case, a corner can be assigned the wrong label if RejectorNet or RecogNet make a mistake. This kind of labeling errors will typically result in non-sensical correspondences with large reprojection errors, which we detect and correct by a RANSAC-type method. discussed below. Specifically, for a corner with label , let C be the set of the cameras that claim to see this corner. We assume that outliers in C , i.e., the cameras that mislabeled corner , should only be a minority. We iterate over all pairs of cameras ( , ) in C , and 3D reconstruct the corner from each pair. Among all of the pairs, we pick the 3D reconstruction that has the lowest reprojection error averaged over all cameras in C and assume this is the correct 3D location p . Next, we analyze the reprojection errors of p into all of the cameras C . The reprojection error should be low in cameras with correct labeling, but high if there was a labeling error. We use the 1.5 × (interquartile range) rule [Upton and Cook 1996] to detect the outliers in terms of reprojections errors. We re-compute the triangulation of p after removing the outliers from the cameras C . This RANSAC-type outlier filter does not work when there are only two cameras that see one corner. Therefore, we additionally discard reconstructed corners with an average reprojection error larger than 1.5 pixels. These tests are designed to be conservative, because mistakenly discarded points are not a major problem, just missing observations which can be inpainted as discussed in Section 5. A key feature of our approach is that all of our networks are trained only on small image patches, e.g., see Fig. 10c and Fig. 11e . This allows our trained model to generalize to different suits, capture environments, camera configurations and body poses that are not in the training set, because our local fiducial markers exhibit significantly less variability than images of full human poses. This is quite different from deep-learning based methods that perform global pose-prediction, looking for the body as a whole. The training of our networks does not require large training sets. We have prepared our training data ourselves, without the use of any external annotation services or existing data sets. Our dataset contains 24 manually annotated images, randomly selected from captures of our three actors. For each image, we apply two types of annotations: corner annotation (Fig. 10a ) and quad annotation (Fig. 11a) . In the corner annotation, we manually annotate all of our checkerboard-like corners on the suit with sub-pixel accuracy. In the quad annotation, we manually connect the corners annotated in the previous step into quads. Specifically, we create quads that correspond to valid white squares with two-letter codes in the suit and the annotators also write down the code of each annotated quad. We ensure the quad vertices are in a clockwise order and start from the top-left corner, defined by the upright orientation of the code (see Fig. 8b ). These annotations are then automatically converted into training data for our networks as follows. Corner Detector. We generate the training data for CornerdetNet by sliding a 20 × 20 window with stride 1, as shown in Fig. 10b . Each of the 20 × 20 crops is an input to CornerdetNet, labeled positive if and only if an annotated corner lies inside its center 8 × 8 pixels. For positive samples we also compute the sub-pixel corner coordinates relative to the 8 × 8 cell. Quad Classifiers. We start by generating candidate quads from the annotated corners in the 24 manually annotated images using the algorithm discussed in Section 3.2. Note that the same quad generation algorithm will be used during deployment, i.e., when processing new motion sequences. The quad generator is conservative and creates many quads that do not correspond to valid white squares, see Fig. 11c . However, we know which quads are valid, because all of the valid ones were manually annotated, see Fig. 11a . This allows us to automatically generate both positive and negative examples for a given candidate-quad generator, see Fig. 11d . These 104 × 104 images are used to train RejectorNet. It is important that the quad generator used during deployment is identical to the quad generator used when generating the training data for RejectorNet. The two-letter code annotations of the valid quads are then using to train RecogNet. Data Augmentation. All of the crops generated from annotated images as described in the previous sections are augmented by applying intensity perturbations (contrast, brightness, gamma). In addition, we also apply geometric deformations on each input image. For the corner detector, we also augment the training data by generating random rotations of each image, because checkerboard-like corners are rotation invariant. Different data augmentation approaches need to be applied to RejectorNet and RecogNet. For the RejectorNet, we blur the image using Gaussian filter and add elastic deformation using thin-plate splines [Wood 2003 ] to simulate skin deformation. We constrain the elastic deformations to fix the checkerboard-like corners in place, see Fig. 12a , otherwise positive examples could be turned into negative ones. We also use this fact to our advantage: if we displace a checkerboard-like corner of a valid white square, we obtain a new (augmented) negative example, simulating the case when quad's corners have not been correctly localized, see Fig. 12b . Since RecogNet is required to predict characters from any input image, we can afford to augment our data more aggressively. Specifically, we use much more significant geometric distortions, intensity variations, blurring and additional noise, see Fig. 12c . This aggressive data augmentation has an interesting effect: the performance on the training data becomes worse, since we made the recognition task more difficult. However, we obtain better performance on the test set, which is what matters. This agrees with human intuition: if students are given harder homework (training), they will likely perform better in their first job. Synthetic Data. To further enhance the diversity of our training data, we also generated synthetic data sets by rendering an animated SMPL model. We use synthetic data only for training the RecogNet, because this was the bottleneck in the overall pipeline, see Section 6 for more details. We textured the body mesh with the same checkerboard-like pattern as used in the real suit and applied animations from a public motion capture database [Mahmood et al. 2019 ]. We randomly generated new two-letter codes, including variations in font types and sizes to emulate the handwriting of the codes. For each animation frame, we rendered images with virtual cameras, simulating our real capture setup by copying the intrinsic and extrinsic parameters from our real cameras. The visibility of corners in the rendered images is determined using ray tracing. To control the quality of quads that will be added to the training set, we check for corner visibility and use a classifier considering the quad's 3D normal direction and quad geometry in the rendered image. Our method for 3D reconstruction of labeled points will inevitably result in missing observations because the human body often selfoccludes itself and is observed only by a limited number of cameras, see Fig. 9 . In this section, we propose a method to interpolate (inpaint) the missing corners. Even though we could use any existing multi-person human body model Osman et al. 2020] for this purpose, we can achieve higher quality, because our pipeline gives us highly accurate measurements of the actor's body and its deformations. Therefore, instead of relying on previous statistical body shape models, we capture example motions of a given actor using our method and use this data to create a more precise refined body model, i.e., a model with parameters refined for a specific person. Our body model has two types of parameters: shape parameters that are invariant in time, and pose parameters that change from frame to frame as the body moves. The shape parameters are only optimized during the model refinement process. After the body model refinement process is done, we fix the shape parameters and only allow the pose parameters to change. However, even after the refinement, the low-dimensional body model will not fit the 3D reconstructed corners exactly (Fig. 14c) . We call the remaining residuals "non-articulated displacements", because they correspond to motion that is not well explained by the articulated body model. The non-articulated displacements arise due to breathing, muscle activations, flesh deformation, etc. Therefore, in addition to our refined body model we also interpolate the non-articulated displacements mapped to the rest pose via inverse skinning. The combination of the refined body model with the non-articualted displacement interpolation enables us to achieve high quality inpainting. Note that here v ,ṽ ∈ R 4×1 are the deformed and rest pose vertex in homogeneous coordinates and T ( , J) ∈ R 4×4 represents the transformation matrix of joint . In the following we will use homogeneous coordinates interchangably with their 3D Cartesian counterparts. Initialization. We initialize our body model by registering our corners to the STAR model [Osman et al. 2020] . We start by selecting a frame init in a rest-like pose where most corners are visible, and fit the STAR model to our labeled 3D points in init using a non-rigid ICP scheme which finds correspondences between our suit corners and the STAR model's mesh. The non-rigid ICP process is initialized by 10 to 20 hand picked correspondences between the STAR model and the 3D reconstructed corners. During the ICP procedure, We optimize both pose and shape parameters of the STAR model and iteratively update correspondences by projecting each of our 3D reconstructed points to the closest triangle of the STAR model (the actual closest point is represented using barycentric coordinates). In this stage, we have registered most of our corners to the STAR model, but we still need to add corners that were unobserved in frame init . We can fit the STAR model to subsequent frames of our training motion using non-rigid ICP initialized by the registered corners instead of hand picked correspondences. These subsequent frames will reveal corners unobserved in the initial frame, which we register against the STAR mesh by closest-point projection as before. We use the corners registered to the STAR model's rest pose as the initial rest pose shapeṼ 0 , and use barycentric interpolation to generate the initial skinning weights W 0 . Note that the number of vertices and mesh connectivity of our body model is different from the STAR model's mesh. We use each corner on the suit as a vertex of our model, and the rest pose vertexṽ corresponds to corner in our suit. The meshing of our body model is discussed below. We use the STAR model's joints as the initial joint location J 0 . We removed the joints that controls head, neck, toes and palm from the STAR model, resulting in = 16 joints. We call this model our initial body model. Model refinement. After the initialization, we further optimize the shape parameters to obtain our refined body model that more accurately fits a specific actor. Specifically, we optimize the skinning weights W, the joint locations J and the rest pose vertex positionsṼ. Unlike, SMPL or STAR, we do not use pose-corrective blend shapes and instead correct the shape by interpolating non-articulated displacements, discussed in Section 5.2. If , = 1, 2, . . . , is the set of 3D points that were reconstructed from frame and is the number of frames in the training set, we refine the body model by minimizing: where , is the geodesic distance from corner to the closest vertex that has non-zero initial weight for joint j in the STAR model. The regularization weights were empirically set to = 1000 and = 1 when our spatial units are millimeters. We optimize L with an alternating optimization scheme. Starting with the initial LBS model, we first calculate pose parameters for each frame. Then we optimize W, J, andṼ one by one, while keeping the other parameters fixed. We iterate this procedure until the error decrease becomes negligible; in our results we needed between 50-100 iterations. After the optimization is finished, we mesh the rest pose vertices V. From the unique ID of each corner, we know how they were connected into quads in the suit. We manually add vertices to close the holes which come from areas of the suit such as the zipper and the seams (see Fig. 13a ). The result is a quad-dominant mesh (Fig. 13b) . After the optimization, the fitting error (Eq. 5) will drop from 13.5mm to 7.1mm on the test set; further results are reported in Section 6.3. The refined LBS body model is good for representing articulated skeletal motion of the actor's body, but it does not represent well effects such as breathing or flesh deformation. However, the nonarticulated component of the motion that cannot be represented by LBS is relatively small. Therefore, we start by applying inverse skinning transformations (also known as "unposing") to our observed 3D reconstructed points p i , see Fig. 14d . We denote the inverse skinning of point at pose as −1 (p , J, W, ). As can be seen in Fig. 14d , the −1 (p , J, W, ) will not exactly matchṽ due to the non-articulated residuals. Formally, the non-articulated displacements Δṽ are defined as: The key problem of our inpainting consists in interpolating the values of Δṽ from the observed points to the unobserved ones, in other words, predicting the unobserved non-articulated displacements, see Fig. 14e . The modified rest pose is then mapped back by (forward) skinning to produce the final mesh, see Fig. 14f . Our method for predicting the unobserved non-articulated displacements in the rest pose is based on the assumption of spatiotemporal smoothness. We stack all of the rest pose displacements into a × 3 matrix X, where is the number of frames and the number of vertices (all vertices, both unobserved and observed ones). We find X by solving the following constrained optimization problem: where L spat is a spatial Laplacian term that penalizes non-smooth deformations of the mesh and L temp is a temporal Laplacian term that penalizes non-smooth trajectories of the vertices. Both of the terms are positive semi-definite quadratic forms. The parameter is a weight balancing these two terms which we empirically set to 100. The sparse selector matrix C represents the observed points (constraints) and D their unposed 3D positions for each frame (each frame may have a different set of observed points). Specifically, we define L spat as: where L is cotangent-weighted Laplacian of the rest pose and X is the -th column of X. We found this quadratic deformation energy to be sufficient because our non-articulated displacements in the rest pose are small, though in future work it would be possible to explore non-linear thin shell deformation energies. For L temp , we use 1D temporal Laplacian which corresponds to acceleration: ∥Δṽ −1 − 2Δṽ + Δṽ +1 ∥ 2 2 . The L spat operator is applied to all frames independently, and L temp is applied to all vertices independently. However, their weighted combination in Eq. 8 introduces spatio-temporal coupling, allowing one observed point to affect unobserved points through both space and time. The optimization problem Eq. 8 is a convex quadratic problem subject to equality constraints, which we transform to a linear system (KKT matrix) and solve. The only complication is that when processing too many frames, the KKT system can become too large. For example, with = 5000 frames, the KKT matrix becomes approximately 10 7 × 10 7 . Even though the KKT matrix is sparse, the linear solve becomes costly. To avoid this problem, we observe that smoothing over too many frames is not necessary and introduce a windowing scheme, decomposing longer sequence into 150-frame windows and solve them independently. To avoid any non-smoothness when transitioning from one window to another, the 150-frame windows overlap by 50 frames. After solving the problem in Eq. 8 for each window separately, we smoothly blend the overlapping 50 frames to ensure smoothness when transitioning from one window to the next one. In this section we considered only off-line hole filling, where we can infer information from future frames. This approach would not be applicable to real-time settings where future frames are not available. We first discuss the details of our CNN training process and the results. To prepare the training data, we manually annotated 24 randomly selected images (4000 × 2160) of our actors. Out of the 24, we withheld 4 as a test set. Table 1 shows the total numbers of images used for training our CNNs. As shown in the first row of Table 1 , the original training set (without data augmentation) for RecogNet is much smaller compared to CornerdetNet and RejectorNet, because of the limited number of valid quads in each of our annotated images. To improve the classification performance of RecogNet, we used synthetically generated images to complement the real data, as discussed in Section 4.2. The synthetic data contain 214471 crops (104×104), which significantly improved the robustness of RecogNet, see Section 6.1. We train our CNNs using Tensorflow [Abadi et al. 2015 ] using a single NVIDIA Titan RTX; for each of our CNNs, an overnight run is typically enough to converge to good results using the Adam optimizer. After our CNNs have been trained, we run inference on a PC with an i7-9700K CPU and an NVIDIA GTX 1080 GPU. With a 4000 × 2160 input image, an inference pass of CornerdetNet takes 300ms, generating candidate quads 10ms, RejectorNet takes 1-2s to classify all of the candidate quads (104 × 104) and RecogNet takes 5ms to recognize the valid quads. The computational bottleneck is the RejectorNet due to the large number of candidate quads; this could be improved in the future by a more aggressive culling of candidate quads. For each frame, the time for 3D reconstruction is negligible, taking less than 1ms for all points. Even though we used only one computer and processed our image sequences off-line, we would like to point out that our method for extracting 3D labeled points from multi-view images is embarrassingly parallel, because each frame and even each input crop for our CNNs can be processed independently. Coupling through time is introduced only in the final hole-filling step (Section 5.2). The time for solving the sparse linear system (Eq. 8) for a 150 frames window is about 10s. We captured motion sequences of three actors, one male and two females. One of the female actor wears the small suit and the other two actors wear the medium suit. For each actor, we captured about 12,000 frames (at 30FPS) of raw image data consisting of 1) camera calibration, 2) 6000 frames of calisthenics-type sequence intended for body model refinement (also serving as a warm-up for the actor), 3) the main performance. Each frame consists of 16 images from our multicamera setup. It took about 300 hours to process all of the 576,000 images (4.6 TB) using one computer. CornerdetNet. There are two parts of the CornerdetNet's output: 1) classification response that predicts whether there is a valid corner in the center 8 × 8 window of the input 20 × 20 image and 2) its subpixel coordinates (or arbitrary values of a corner is not present). In Table 2 we summarize the results for both classification and localization errors. The localization error is measured by the distance in pixels between the predicted corner location and the manually annotated corner location. The overall classification accuracy for CornerdetNet is 99.393% on the training set and 99.510% on the test set. The fact that CornerdetNet works better on the test set supports our hypothesis that more aggressive data augmentation results in worse performance on the training set but better performance on the test set. On the test set, the average localization error is 0.21 pixels and 99% of the corner localizations achieve error 0.6361 pixels or less, which is remarkably low. With our camera setup, 1 pixel error corresponds to approximately 1mm of 3D error for an actor 2 meters away from the camera. In practice, this means that our 3D reconstructed points are highly accurate, allowing us to capture minute motions such as muscle twitches or flesh jiggling. RejectorNet. The confusion matrices of a trained RejectorNet network are reported in Table 3 . The overall classification accuracy for RejectorNet is 99.723% on the training set and 99.704% on the test set. From the confusion matrix, we can observe that we have more false positives than false negatives. The reason is that we intentionally annotated the training data conservatively. As shown in Fig. 15a , quads with even slight imperfections were labeled as negative examples. This results in RecogNet reporting more false positives, but the RecogNet actually inherits the conservative nature of the annotations; in practice, RecogNet only rarely accepts a low-quality quad image. RecogNet. We compare the RecogNet trained with/without the synthetic training set in Table 4 . Without using the synthetic training set, the RecogNet had prediction accuracy of 99.522% on the test set. This accuracy was low and it was the main source of errors in our pipeline. Enhanced with the synthetic training set, the prediction accuracy on the test set increased to 99.919% and significantly improved our results. Overall performance. In the previous sections we reported the results of each individual CNN. To evaluate our complete corner localization and labeling pipeline (Fig. 4) , we use our test set of 4 manually annotated images (4000 × 2160) where we know the ground truth positions and labels of all corners. The 4 images in the test set collectively contain 1702 manually labeled corners. Our 2D pipeline detected 92% (1566 of 1702) of the ground truth corners. The discarded corners corresponded to low quality quads which were rejected by RejectorNet. Note that we intentionally trained the RejectorNet to be conservative, i.e., to reject all borderline cases (but we do not argue that the same principle should be applied to SIGGRAPH technical paper submissions!). Missing observations do not represent a big problem because they can be fixed by inpainting (Section 5.2); also, we observed that low-quality quads are often associated with inaccurate corner localization, increasing the noise in the 3D reconstruction. The mean corner localization error in our test set is 0.4607 pixels and the maximum localization error is 1.854. Due to our conservative rejection approach, the final CNN, RecogNet, made zero mistakes on the test set, i.e., all of the 1566 corners were assigned the correct label. Metrics. Evaluating 3D reconstruction accuracy is hard, because we do not have any ground truth measurements of a moving human body. To evaluate the accuracy of our 3D reconstructed corners, we compute their reprojection errors and we compare them to the reprojection errors obtained in our camera calibration process (Section 3.2). Using to denote the projection function of camera , if a reconstructed 3D point p is seen by camera and c is the pixel location of the corresponding 2D corner in camera , the reprojection error for corner in camera is defined as: The reprojection error for camera calibration is defined analogously, except that we use 3D calibration boards with perfect, rigid checkerboard corners and a standard OpenCV corner detector. In contrast, our corners are painted on an elastic suit worn by an actor. Quantitative evaluation. We report the histograms of reprojection errors of 3D reconstruction and camera calibration in Fig. 16 . The 3D reconstruction reprojection error is computed per camera for all the reconstructed points in a consecutive sequence of 10000 frames. The calibration reprojection error was computed on 448 frames that we use to calibrate the cameras, where we wave a 9 × 12 calibration board in front of our cameras. In 16, we can see the two error distributions look very similar, which means the reprojection errors of our 3D reconstruction results have similar statistics as the reprojection errors in camera calibration. We cannot expect to obtain lower reprojection errors than camera calibration. Table 5 shows the percentiles of all the reprojections errors in 10000 frames that we use to evaluate the 3D reconstruction. 99% of the reprojection errors is less than 1.009, which is remarkably accurate given the high resolution of our images (4000 × 2160). Qualitative evaluation. Fig. 17 shows challenging cases where there are significant self occlusions. We mesh the reconstructed point cloud using the rest pose mesh structure introduced in Section 5.1 by preserving the observed faces in the rest pose mesh (see Fig. 13b ). Then we project the reconstructed mesh back to the image using the camera parameters, which gives us the green wireframe in Fig. 17 . We can see that the mesh wireframe aligns very closely with the checkerboard pattern on the suit. Another important observation is that even despite large occlusions, our method can still obtain correctly labeled corners as long as the entire two-letter code is visible see e.g., the foot and calf in Fig. 17a,b. In Fig. 17c , we can see that the conservative RejectorNet correctly rejects the wrinkled quads in the belly region, since reading the codes would be difficult or impossible. To refine our actor model, we record a 6000 frames training sequence. After the body model refinement we select another 3000 frames corresponding to motions different from the ones in the training set. The fitting error is defined as the distance between the vertices of the deformed body model and the actual 3D reconstructed corners. We compare the fitting errors between the initial model, which is just a remeshing of the STAR model (see Section 5.1), and the refined body model, which was optimized on the training set (see Section 5.2). 18 shows the distribution of fitting errors per vertex of the initial body model and the refined body model on the training set and the test set. We can see in both data sets, the refined body model is much more accurate. Specifically, the body model refinement reduces the average fitting error from 13.6mm to 5.2mm on the training set and from 13.5mm to 7.1mm on the test set. Fig. 19 visualizes the fitting errors on the body model before and after body model refinement in one example frame using a heat map. To quantify the accuracy of the 3D reconstruction of the entire body, we compare renderings of a textured mesh with the original images using optical flow [Bogo et al. 2014 [Bogo et al. , 2017 . First, we need to create a suit-like texture for our body mesh (Fig. 13b) . We create a standard UV parametrization for our mesh and generate the texture from 10 hand picked frames using a differentiable renderer [Ravi et al. 2020] ; though this is just one possible way to generate the texture [Bogo et al. 2017] . We render the textured body mesh with back face culling enabled and overlay it over clean plates (i.e., images of the capture volume without any actor). The virtual camera parameters are set to our calibration of the real cameras. The optical flow is computed from the synthetic images to the undistorted real images using FlowNet2 [Ilg et al. 2017] with the default settings. Because our mesh does not have the hands and the head, we first render a foreground mask of our body mesh (Fig. 20c) . We only evaluate the optical flow on the region covered by the foreground mask to exclude the hands, the head and the background. The foreground mask cannot exclude the hands and the head when they are occluding the body (as in Fig. 20a, b) but, fortunately, the optical flow is robust to missing parts (see Fig. 20d ). We use optical flow to compare the original images with two types of renders: 1) our low-dimensional refined body model (the gray mesh in Fig. 14c , which does not fit the reconstructed corners exactly), 2) our final result after adding non-articulated displacements (Fig. 14f) . Fig. 21a plots the average optical flow norm for each frame for consecutive 2000 frames, including various challenging poses and fast motions. We can see that the result with non-articulated displacements is much more accurate than only our low-dimensional refined body model. This is mainly due to flesh deformation which is not well explained by the refined body model, especially in more extreme poses which correspond to the spikes in the blue curve in Fig. 21a . The red curve corresponds to our final result which exhibits consistently low optical flow errors. We also plot the distribution of the optical flow norm for each pixel in the foreground mask in Fig. 21b . With our final animated mesh, 95% of pixels have optical flow norm less than 1.20 and 99% of pixels have optical flow norm less than 2.46. An obvious limitation of our method is the necessity of wearing a special motion capture suit. A suit can in principle slide over the skin, but we did not observe any significant sliding in our experiments because our suits are tightly fitting. If this became a problem in the future, we could increase adhesion with internal silicone patches as in sportswear, or even apply spirit gum or medical adhesives. The suit needs to be made in various sizes and fit may be a challenge for obese people. The holy grail of full-body capture is to get rid of suits and instead rely only on skin features such as the pores, similarly to facial performance capture. We tried imaging the bare skin, but with our current camera resolution (4000 × 2160) we were unable to get sufficient detail from the skin. We could obtain more detail with narrower fields of view and more cameras to cover the capture volume, but then there are issues with the depth of field and hardware budgets. Additional complications of imaging bare skin are body hair and privacy concerns; our suit certainly has its disadvantages, but mitigates these issues. A significant advantage of our suit compared to traditional motion capture suits is that we do not need to attach any markers (reflective spheres, See Fig. 2a) . Traditional motion capture markers can impede motion or even fall off, e.g., when the actor is rolling on the ground. An intriguing direction for future work would be to enhance our suit with additional sensors, in particular EMG, IMU or pressure sensors in the feet. In this paper we focused on the body and ignored the motion of the face and the hands. Our actors are wearing sunglasses because our continuous passive lights are too bright; the perceived brightness could be reduced by lights which strobe in sync with camera shutters, but this would require significant investments in hardware. In future work, our method could be directly combined with modern methods that capture the motion of the face and the hands [Choutas et al. 2020; Joo et al. 2018; Pavlakos et al. 2019; Xiang et al. 2019] . We note that our current system captures the motion of the feet, but not the individual toes. Our current data processing is off-line only. In the future, we believe it should be possible to create a real-time version of our system. This would require machine vision cameras tightly integrated with dedicated GPUs or tensor processors for real-time neural network inference. Each such hardware unit could emit small amounts of data: only information about the corner locations and their labels, avoiding the high bandwidth requirements typically associated with high-resolution video streams. Another avenue for future work involves research of different types of fiducial markers that can be printed on the suit. In fact, we made initial experiments with printing on textile and sewing our own suits, which gives us much more flexibility than handwritten two-letter codes discussed in this paper. We postponed this line of research due to the Covid19 pandemic. Our pipeline for reconstructing labeled 3D points does not make any assumptions about the human body, which means that we could apply our method also for capturing the motion of clothing or even loose textiles such as a curtain. We have presented a method for capturing more than 1000 uniquely labeled points on the surface of a moving human body. This technology was enabled by our new type of motion capture suit with checkerboard-type corners and two-letter codes enabling unique labeling of each corner. Our results were obtained with a multi-camera system built from off-the-shelf components at a fraction of the cost of a full-body 3DMD setup, while demonstrating a wider variety of motions than the DFAUST dataset [Bogo et al. 2017] , including gymnastics, yoga poses and rolling on the ground. Our method for reconstructing labeled 3D points does not rely on temporal coherence, which makes it very robust to dis-occlusions and also invites parallel processing. We provide our code and data as supplementary materials and we will release an optimized version of our code as open source. Dandelion Mané Ceres solver: Tutorial & reference An efficient volumetric framework for shape tracking The space of human body shapes: reconstruction and parameterization from range scans Scape: shape completion and animation of people Selfsimilarity analysis for motion capture cleaning Tensor body: Real-time reconstruction of the human body and avatar synthesis from RGB-D ChESS-Quick and robust detection of chessboard features Detailed full-body reconstructions of moving people from monocular RGB-D sequences FAUST: Dataset and evaluation for 3D mesh registration Dynamic FAUST: Registering human bodies in motion Eigen appearance maps of dynamic shapes The OpenCV Library Twist based acquisition and tracking of animal and human kinematics Combined region and motion-based 3D tracking of rigid and articulated objects OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields Realtime multi-person 2d pose estimation using part affinity fields 4D parametric motion graphs for interactive animation CCDN: Checkerboard corner detection network for robust camera calibration Monocular expressive body regression through body-driven attention High-quality streamable free-viewpoint video Markerless motion capture through visual hull, articulated icp and subject specific model generation Performance capture from sparse multi-view video Superpoint: Selfsupervised interest point detection and description MATE: Machine learning for adaptive calibration template detection 3D scanning deformable objects with a single RGBD sensor ARTag, a fiducial marker system using digital techniques Automatic generation and detection of highly reliable fiducial markers under occlusion Tracking of humans in action: A 3-D model-based approach Virtual try-on using kinect and HD camera Densepose: Dense human pose estimation in the wild The relightables: Volumetric performance capture of humans with realistic relighting Online optical marker-based hand tracking with deep labels Haroon Idrees, Donglai Xiang, Hanbyul Joo, Tomas Simon, and Yaser Sheikh. 2019. Single-Network Whole-Body Pose Estimation Coregistration: Simultaneous alignment and modeling of articulated 3D shape Robust solving of optical motion capture data by denoising Deep charuco: Dark charuco marker pose estimation Global temporal registration of multiple non-rigid surface sequences Flownet 2.0: Evolution of optical flow estimation with deep networks Deep features for text spotting Total capture: A 3d deformation model for tracking faces, hands, and bodies Markerless tracking of complex human motions from multiple views Robust single-view geometry and motion reconstruction Ssd: Single shot multibox detector Markerless motion capture of multiple characters using multiview image segmentation Deep appearance models for face rendering Scene text detection and recognition: The deep learning era SMPL: A skinned multi-person linear model Object recognition from local scale-invariant features Arbitrary-oriented scene text detection via rotation proposals Learning to Dress 3D People in Generative Clothing Jointdependent local deformations for hand animation and object grasping AMASS: Archive of Motion Capture as Surface Shapes Vnect: Real-time 3d human pose estimation with a single rgb camera Deep Relightable Textures -Volumetric Performance Capture with Neural Rendering Understanding motion capture for computer animation and video games Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time Stacked hourglass networks for human pose estimation AprilTag: A robust and flexible visual fiducial system STAR: A Spare Trained Articulated Human Body Regressor Capturing and animating skin deformation in human motion Data-driven modeling of skin and muscle deformation Expressive body capture: 3d hands, face, and body from a single image Deepcut: Joint subset partition and labeling for multi person pose estimation Dyna: A model of dynamic human shape in motion Motion graphs for unstructured textured meshes Efficient Online Multi-Person 2D Pose Tracking with Recurrent Spatio-Temporal Affinity Fields Accelerating 3D Deep Learning with Py-Torch3D YOLO9000: better, faster, stronger Civilian american and european surface anthropometry resource (caesar), final report Machine learning for high-speed corner detection How fast is your body motion? Determining a sufficient frame rate for an optical motion tracking system using passive markers Fast articulated motion tracking using a sums of gaussians body model Bundle adjustment-a modern synthesis Dynamic surface matching by geodesic mapping for 3d animation transfer Understanding statistics Articulated mesh animation from multi-view silhouettes AprilTag 2: Efficient and robust fiducial detection Real-time hand-tracking with a color glove Convolutional pose machines Capturing and animating occluded cloth Thin plate regression splines Monocular total capture: Posing face, body, and hands in the wild Denserac: Joint 3d pose and shape estimation by dense render-and-compare A flexible new technique for camera calibration Human motion tracking for rehabilitation-A survey