key: cord-0556266-7fre3czu
authors: Gordaliza, Pedro M.; Vaquero, Juan Jos'e; Munoz-Barrutia, Arrate
title: Translational Lung Imaging Analysis Through Disentangled Representations
date: 2022-03-03
journal: nan
DOI: nan
sha: c4acafb1318f91b1061cb99a8fc62abce8c0c36e
doc_id: 556266
cord_uid: 7fre3czu

The development of new treatments often requires clinical trials with translational animal models using (pre)-clinical imaging to characterize inter-species pathological processes. Deep Learning (DL) models are commonly used to automate retrieving relevant information from the images. Nevertheless, they typically suffer from low generability and explainability as a product of their entangled design, resulting in a specific DL model per animal model. Consequently, it is not possible to take advantage of the high capacity of DL to discover statistical relationships from inter-species images. To alleviate this problem, in this work, we present a model capable of extracting disentangled information from images of different animal models and the mechanisms that generate the images. Our method is located at the intersection between deep generative models, disentanglement and causal representation learning. It is optimized from images of pathological lung infected by Tuberculosis and is able: a) from an input slice, infer its position in a volume, the animal model to which it belongs, the damage present and even more, generate a mask covering the whole lung (similar overlap measures to the nnU-Net), b) generate realistic lung images by setting the above variables and c) generate counterfactual images, namely, healthy versions of a damaged input slice.

The longitudinal characterization of animal models is crucial during (pre-)clinical drug trials. To characterize disease progression meaningfully, we need to have the capacity to extract comparable biomarkers in similar phases of the disease progression. Besides, we need to prove the existence of similar pathophysiological mechanisms modulating common causal factors that give rise to the variability of trial outcomes. In this context, medical imaging techniques enable the extraction of indicators (imaging biomarkers) from in vivo studies (Willmann et al., 2008) . For example, the number of Mycobacterium tuberculosis (Mtb.) colonies present in a subject can be inferred from the damaged lung volume in an image of a human, primate, or mouse (Yang et al., 2021) . The images contain meaningful information to interpret the mentioned physiological process. However, their manual analysis is tedious, and automation is advantageous to process the vast amount of data produced during the trials. Thus, developing Artificial Intelligence (AI) systems that can not only automate the extraction of particular markers for each animal model (e.g., the damaged lung volume) but are also capable of inferring the common agents of such particular indicators (e.g., bacterial burden) is essential.

Although AI, especially Deep Learning (DL), has eased the process (Zhou et al., 2021; Hinton, 2018) , some design premises has lessened its inference capabilities. In particular, DL models excel at extracting the statistical dependence between input-output pairs, i.e.,(x i , y i ) ∈ X , Y, from assumed independent and identically distributed (i.i.d.) observational data (Peters et al., 2017) .

Such success has leaned the model designs towards an insufficient representation learning strategy (Bengio et al., 2013) . Namely, discovering statistical dependence between specific data pair samples is prioritized rather than understanding the physical model generating the whole data population (e.g., physiological mechanisms).

Since the i.i.d. assumption is fragile, well-known distribution shifts (Castro et al., 2020) between data employed at training, validation and test phases, and real-world data are usual. Under this scenario, the models tend to learn correlated representations that only hold for specific environments or domains, namely spurious correlations (Arjovsky et al., 2020) . Since (as a mantra) correlation does not imply causation, such flaws cause ruinous effects (DeGrave et al., 2021; Roberts et al., 2021) for generalisation, transferability and explainability purposes (Scholkopf et al., 2021) .

More formally, naive DL models maximize a joint distribution, p(X, Y ) or p(X) (selfsupervision), characterized by an entangled representation of the input. Namely, if X and Y correlate during training without necessarily deriving from a causal representation (X → Y ), p(X, Y ) can adopt numerous factorization forms that are domain-specific (Goyal and Bengio, 2021) . Thus, forcing to implement independent models even for related domains (in our case, lung CT images of TB animal models). Such models are put in common through posthoc analysis, losing possible data synergies.

In general, learning strategies mitigate this issue by shrinking the p(X, Y ) solutions space. To this aim, models are enriched injecting inductive biases (e.g., CNNs assume spatial correlation (Dumoulin and Visin, 2018) ), to facilitate the discovery of more meaningful and disentangled representations (Liu et al., 2021) . These strategies resemble human cognition. Since, humans arrange the proper biases to extract a limited number of relevant factors transferable among different environments (Pearl, 2011) .

AI systems design can follow a similar causal perspective. Namely, specific biases can be introduced to shrink the solution space. Thus, in this work, we consider the bias: a) the strongly hierarchical nature of the human visual system and b) the data generation process. Such an approach intends to mimic the radiologists' tasks, who take into account specific patient factors (i.e., clinical history, sex, age) beyond the image per se. This approach yields more effective disentangled representations of the input (Scholkopf et al., 2021) .

In particular, we intend to identify the unique mechanisms that govern the generation of translational imaging of lung Computed Tomography (CT) images and their corresponding segmentation masks (Figure 1(a) ). We employ three different animal models (mouse, primate and human) infected by Mtb. (Pai et al., 2016) . From a simplified radiological point of view, mammals' lungs share texture and shape features. We model these shared attributes as an effect of the same causative factors, e.g., bacterial load (see Appendix A). To prove the benefits of our strategy, we show how after optimizing the model employing a small limited number of volumes, our design can:

• Produce a very accurate reconstruction of the input images and generate suitable segmentation masks ( Figure 4 , Table 2 ).

• Generate new realistic images of the three different animal models controlling the lung damage on each, which implies the proper characterization of the disentangled variables ( Figure 2 ).

• Generate counterfactual images of damaged lungs (Schutte et al., 2021; Cohen et al., 2021) . Namely, the model can capture the meaningful representations of an input image and convert it into a healthy version by intervening on the damage variable.

We define a generative model in which the high dimensional texture and shape features that can be extracted from lung CT images and their corresponding segmentation masks are a result of the causal Direct Acyclic Graph (DAG) presented in Figure 1 x, and their segmentation masks y. Both generated from a latent variables hierarchy at different resolutions scales, K, governed by three factors, i.e., animal model, A, the relative position of the axial slice, S, and the estimated lung damage caused by Mtb., D. (b) Summarized architecture: Blue and pink represent the image and mask generation branches, features concatenation and p θ , q φ parameters combination in training. The encoder is not present for image generation (Section 3.3). Counterfactual images arise inferring and setting some values at the deeper representation level z 0 (Section 3.4).

The proposed DAG simplify the physical image generation for obvious reasons. All the possible elementary causative factors (e.g., specific scanner, comorbidities, subject age, sex) are reduced to three: the animal model, A, the observed lung axial slice, S, and the lung damage, D. The causative factors are modelled as three groups of independent variables, z 0 , under the noise term, {A,S,D} , which comprises noise and unconsidered variables. The primary variables govern the generative process, which follows a part-whole hierarchy (Hinton, 2021) from low-level representations of the texture and shape features, z 1 , to high dimensional ones, z k , the observed image, x and the segmentation mask, y. This part-whole hierarchy resembles brain columns functioning (Locatello et al., 2020; Devlin et al., 2018) . Variable superscripts, z k , symbolize hierarchy levels at the DAG. The plate notation at the DAG represents such upsampling generation. The DAG implements two paths diverging at the first hierarchy level (shared representation path), z 1 . The division forces, during optimization, to generate a disentangled representation of shape, z L and texture, z R . CT images depend on shape and texture variables (blue path), while the segmentation masks only depend on shape variables (pink path). Then, assuming the independence of the noise terms, the independent causal mechanism (ICM) principle is fulfilled (Scholkopf et al., 2021) and the following disentangled factorization arise:

For the above equations, each conditional distribution is parametrized by depthwise convolutional decoders. The parameters θ, leverages a high capacity model (Figure 1 (b) allowing to characterize the unobservable causes of variation ( ) consistent with the available data (in our case, lung CT images) (Peters et al., 2017; Pawlowski et al., 2020) . Once the model is optimized, it is possible to modify the disentangled variables to obtain new generated The computation of the parameters requires optimization through training of the posterior probability, p θ (z | x, y), which is intractable. To tackle this issue, we adapt the particular factorization in Equation (1). We employ deep Variational Autoencoders (deep VAEs) with a bigger expressiveness than traditional VAEs (Kingma et al., 2016; Child, 2020) . Thus, we can generate more detailed images and implement our hierarchical model. In this way, we obtain the best approximate amortized posterior distribution, q φ (z|x), being φ the parameters of the encoder. Notice that the distribution is amortized just from x (not from y), so we force the model to extract the meaningful mechanism to generate the segmentation masks just from the self-supervisory signal of the image (LeCun and Misram Ishan, 2021). Indeed, we add a segmentation branch in the architecture (Figure 1(b) ), dependent on the main branch.

Namely, we adopt the Noveau VAE (NVAE) (Vahdat and Kautz, 2020) . This architecture is carefully designed for hierarchical models. Moreover, it has proven efficacy in approximating posteriors by introducing an inductive bias in the image generating process in a deeply hierarchical architecture.

To this aim, the set of z variables at each representation level k is divided into smaller sets, m k , to get a total of M groups of latent variables. Thus, a hierarchical structure is established within each resolution too, being z the set:

Its prior and approximate posterior probability are given by:

Following this formulation, from marginalization of the log Equation (1) and rearranging terms, we obtain the variational lower bound to optimize (subscripts colors denote each optimization branch):

KL being the Kullback-Leibler divergence and

being q φ (z m−1 | x) the approximate posterior through the hierarchy of m k−1 group. Since NVAE convergence depends on the reasonable approximation of KL terms (see (Vahdat and Kautz, 2020)), to this aim, all priors and posterior probabilities are approximated as Normal distributions. Thus, we can write:

The model is optimized employing small datasets: ten lung CT volumes per animal model (∼ 2000 slices). The data used for training are axial slices from three Mtb. lung models identified as follows. The dataset names identify: the animal model, A, the data source and the phase as follows Note that all datasets include segmentation masks delineated by trained experts. A detailed description of the different datasets is presented in Table 1 . 

The model is optimized employing six scales, K = 6, with 18 latent variables per scale, partitioned each in m k groups as follows, m k = [2, 2, 2, 3, 6, 9] . The three µ A , µ S and µ D are known during training (µ A = [−1, 0, 1], µ D = (0, 1), µ S = (0, 1)), fix at image generation and inferred for image reconstruction and segmentation mask generation employing KL q φ (z 0 )||N (0, 1) . During optimization µ D is given by the the healthy lung relative volume (extracted by simple thresholding) with respect to the ground truth mask volume.

After optimization, the model can generate realistic images, such as those shown in Figure 2 , by choosing the mean values of z 0 A , z 0 S , z 0 D factors. To illustrate this capacity in Figure 2 , we set a relative slice position of 0.5, the animal model is fixed for each row and, the effect of the lung damage variable is modulated from lower to higher in each column. 

The first column of each row in Figure 3 shows an actual image of a damaged lung corresponding to a given animal model. When no actions are performed, the model infers the disentangled image representation of the causative variables (z 0 A , z 0 S , z 0 D ) through the encoder. Subsequently, the image is reconstructed, and a segmentation mask (third column) is generated employing the optimized decoder (Figure 1(b) ). The second column shows a healthy counterfactual of the input images, which is generated setting to zero the mean value of the inferred damage variable, z 0 D . The decoder is fed with the zero-mean z 0 D and the rest (unaltered) inferred causal variables to generate the counterfactual version of the slice and its respective mask (fourth column). A , z 0 S , z 0 D . By setting the damage variable z 0 D to 0 the decoder generates the healthy counterfactual (counterfactual slice) and its respective mask (counterfactual mask).

Pathological lung segmentation is an important task to solve in drug development studies. Unfortunately, it is a complex task due to the difficulty of discrimination between lesions and other neighborhood tissues. Needless to say that the diversity of the biological data supposes an added difficulty (Hofmanninger et al., 2020) . In this experiment, we retrain the optimized model with counterfactual images to generate the segmentation masks from the test datasets (Section 3.1). We use the approach described previously to generate the counterfactual images (Section 3.4). To learn about the strengths and weaknesses of this generative approach, we compare the results obtained, our c , with the segmentation masks calculated by our original method, our nc , and the state-of-the-art fully supervised method, nnU-Nnet (Isensee et al., 2021) . Table 2 : Mean and standard deviation (SD) of the Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) between the ground truth masks and mask obtained from the methods indicated at rows (nnU-Nnet, proposed method before employing counterfactual images (our nc ), and after (our c )) for each test dataset (columns). Table 2 shows the mean and standard deviation for Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) between each segmentation method and the ground truth masks for each test dataset. The results present an improvement for all measures and datasets when employing counterfactual images, yielding similar results to the nnU-Nnet. The differences are due to subtle changes in most of the cases or even small imperfections in the ground truth masks as it is shown in Figure 4 . 

The methodology proposed in this work yields promising results obtaining the factors characterizing the pathophysiological processes shared between animal models. Although the approach indeed suffers from several limitations: the use of isolated axial slices instead of the more informative whole three-dimensional images and the characterization of disease affectation based simply on the damaged lung volume and not on the specific manifestations of the disease for each animal model. These limitations will be the object of future work.

To sum up, our model is capable of inferring meaningful disentangled representations. Namely, it generates synthetic slices by setting the values of the modelled factors. Even more relevant, it produces counterfactual versions of existing slices by testing the effective disentanglement. In the future, we explore strategies to exploit the approach to increase the diversity of existing data, essential for automatic segmentation, or to provide the damage variable as a possible (to be validated) inter-species biomarker.

In the case of our model, we rescale the cropped images resolution to 256 x 256 pixels and normalize the intensity (0-1).

During training, our model needs an estimation of healthy lung volume per CT (Sections 2 and 3.2). To this aim, over the cropped image, we apply a threshold to recover just the healthy tissue inside the whole lung mask. Following the experts' recommendations, we set this threshold from −900 to −200 HUs for the human training dataset (H CLE tr ), -1000 to -200 HUs for the macaque dataset (P P HE tr ), and from -800 to -300 HUs for the mouse model (M GSK tr ). The healthy volume extracted is divided by the total mask volume to obtain the relative value employed during training.

• Selection of each dataset sample: We use 30 CT volumes per dataset employed during training and testing and 20 when the datasets are employed just during the test phase, as described in Section 3.1.

Except for the M GSK and H RAD datasets, the rest of the original datasets contain more than 30/20 volumes.

To define our specific trimmed samples, we employ the relative healthy volume to classify each CT as low damage (relative healthy volume ≥ 0.85), medium damage (0.85 > relative healthy volume > 0.4) and high damage (relative healthy volume ≤ 0.4). Subsequently, we randomly select the same number of subjects per interval.

• Training details: We employ the two Titan V GPUs during 900 epochs with a total batch size of 8 using the Adamax optimizer with an initial learning rate of 0.01 and Cosine Annealing scheduler (minimal learning rate: 1e − 4). We apply online data augmentation to the normalized images by employing random affine transformations (10º rotation) and adding Gaussian noise (µ = 0, σ = 0.05).

For the nnU-Net (Isensee et al., 2021) , we use a single Titan V GPU, following the nnU-Net authors' (recommendations). After adapting the cropped image name formats to the nnU-Net requirements, we run the nnUNet plan and preprocess function to allow the pipeline to estimate the network configuration and training parameters. The complete list can be found in the following link. Subsequently, we train a 2D configuration in a 5-fold cross-validation during 1000 epochs per fold, employing a batch size of 14, data augmenting (see linked file for details), the SGD optimizer and a learning rate of 0.01

• Image Generation speed: After loading the trained model (∼ 20s), it is possible to generate a 16 batch size of 256 × 256 images in approximately 0.25s. Each column contains instances of each dataset, previously employed in Section 3.4. The first rows depict the original, reconstructed and counterfactual slices with the profile line green, yellow and blue, respectively. The last row draws the HU profiles per voxel. Vertical dashed lines highlight big differences between real/reconstructed and counterfactual slices.

Invariant Risk Minimization

Immune Diagnosis of Tuberculosis Through Novel Technologies

Representation learning: A review and new perspectives

Causality matters in medical imaging

Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images

COVID-19 Image Data Collection

Gifsplanation via Latent Shift: A Simple Autoencoder Approach to Counterfactual Generation for Chest X-rays

AI for radiographic COVID-19 detection selects shortcuts over signal

Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL HLT 2019 -2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies -Proceedings of the Conference

Overview of ImageCLEFtuberculosis 2019 -Automatic CT-based Report Generation and Tuberculosis Severity Assessment

A guide to convolution arithmetic for deep learning

The immunological life cycle of tuberculosis

Unsupervised CT Lung Image Segmentation of a Mycobacterium

Fergus Gleeson, and Arrate Muñoz-Barrutia. A Multi-Task Self-Normalizing 3D-CNN to Infer Tuberculosis Radiological Manifestations

Inductive Biases for Deep Learning of Higher-Level Cognition

Deep Learning-A Technology With the Potential to Transform Health Care

How to represent part-whole hierarchies in a neural network

Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem

nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation

Improving Variational Inference with Inverse Autoregressive Flow. (Nips)

Self-supervised learning: The dark matter of intelligence

A Tutorial on Learning Disentangled Representations in the Imaging Domain

Object-Centric Learning with Slot Attention

Deep Structural Causal Models for Tractable Counterfactual Inference

Causality: Models, reasoning, and inference, second edition

Elements of Causal Inference: Foundations and Learning Algorithms

Evis Sala, and Carola Bibiane Schönlieb. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans

A preclinical micro-computed tomography database including 3D whole body organ segmentations. Scientific Data

Toward Causal Representation Learning

Using StyleGAN for Visual Interpretability of Deep Learning Models on Medical Images

NVAE: A Deep Hierarchical Variational Autoencoder

Molecular imaging in drug development

One Size Fits All? Not in In Vivo Modeling of Tuberculosis Chemotherapeutics

Summers. A Review of Deep Learning in Medical Imaging: Imaging Traits, Technology Trends, Case Studies With Progress Highlights, and Future Promises

This project has received funding from the Innovative Medicines Initiative 2 Joint Undertaking (JU) under grant agreement No. 853989. The JU receives support from the European Union's Horizon 2020 research and innovation programme and EFPIA and Global Alliance for TB Drug Development non-profit organisation, Bill & Melinda Gates Foundation and University of Dundee. This work was partially funded by Ministerio de Ciencia, Innovación y Universidades, Agencia Estatal de Investigación, under grant PID2019-109820RB-I00, MCIN/AEI/10.13039/501100011033/, co-finance by European Regional Development Fund (ERDF), "A way of making Europe."

 -Ornelas et al., 2012; Ernst, 2012) and main tests to characterise the entire disease spectrum. The inner cycle names the traditional categorical clinical stages of the continuous spectrum of TB immunological life cycle. Each outer circle represent each TB assessment tests capability. Blank spaces for lack of sensibility, bicolour ones represent the binary character of the test, while gradient representation represents the ability to provide a continuous value.

The following list offers further details about the context of the experiments.• System setup: All experiments were performed in a machine with an Intel Xeon 8153 CPU, 64-GB RAM and two 12-GB Titan V GPUs. We created a specific Docker image based on Ubuntu 20.04 with Python 3.6.9 and torch 1.6.0 to run our code.• Preprocessing: To reduce the size of the chest CT images, we crop the images and their respective segmentation masks to the body region. We employ thresholding from −1024 to 600 over the Hounsfield Units (HU) followed by morphological operations to eliminate small isolated blobs. Finally, we select the one corresponding to the whole body region.Since nnU-Net automatically estimates the rest of preprocessing operations, these cropped volumes feed the nnU-Net preprocessing pipelines. The details about nnU-Net experiments are given below in the list.

This appendix shows generated slices instances fixing the damage and varying the relative slice position. This experiment extends Section 3.3, in which axial slices belong to a fixed relative slice position. Since our chest CT volumes orientation is cephalic to caudal, the model generates axial images of the upper airways (trachea) and the corresponding per animal model surrounding tissues at the lowest slice position, as shown in the first column of the Figure 6 . This way, the second column shows the corresponding generated anatomy for the superior lungs, while the third and fourth columns accordingly show the middle and inferior regions. Finally, the fifth column depicts the generated version at the beginning of the abdominal anatomy. Fig. 6 : Synthetic lung CT images generated by our model. Images are generated with a fix relative damage, µ D = 0.5. For each row, the animal model µ A is fixed to −1, 0, 1, respectively, while for each column, the relative slice position µ S is increased between 0 and 1.

This appendix extends the qualitative results presented in Section 3.4. The former section shows the model capacity generating counterfactual images and their respective segmentation masks.Here, we evaluate how realistic are the generated images. For that, we compare the Hounsfield Units (HU) of real CT slices with two cases: a) the reconstructed slice from the variable inferred by the encoder without modification of any of these values, and b) the counterfactual image, namely, after intervening on the inferred damage value. We compute the voxel-wise Root Mean Square Error (RMSE) for the reconstructed images per test dataset. Table 3 shows these results with an average RM SE = 18.73 ± 2.16.Voxel-wise evaluation is not suitable for counterfactual images. Previous manual delimitation of comparable regions is needed, which is a priority for our future work.To illustrate similarities and differences in the HU scale, in Figure 7 , we plot the HU profile belonging to the damaged regions shown in Figure 3 . Respectively, the first three rows contain 1) the original axial slice from the different test datasets (the image is generated from the µ a , µ s and µ d inferred by our model), with the profile horizontal line in green, 2) the reconstructed slice (the image is generated maintaining µ a , µ s inferred by our model and correcting µ d ), with profile line in yellow and 3) the counterfactual after modifying the inferred expected damage, with the profile line in blue.The last row shows the HU plot for each profile-specific colour. HU values are similar for the three slices except for those regions where the slice counterfactual version replaces the damage with healthy tissue-like. We highlight such changes framing them in vertical dashed red lines.Besides, it is important to note that the original and reconstructed images present more noisy patterns than the counterfactual version, as was expected from its blurrier appearance and the thickening of the soft tissue for the mice dataset.