key: cord-0288694-hd33k1db
authors: Amodio, Matthew; Shung, Dennis; Burkhardt, Daniel; Wong, Patrick; Simonov, Michael; Yamamoto, Yu; van Dijk, David; Wilson, Francis Perry; Iwasaki, Akiko; Krishnaswamy, Smita
title: Generating hard-to-obtain information from easy-to-obtain information: applications in drug discovery and clinical inference
date: 2020-08-22
journal: bioRxiv
DOI: 10.1101/2020.08.20.259598
sha: 1b77eb89b727da0bc18d1004faf5333b4b4ae11c
doc_id: 288694
cord_uid: hd33k1db

In many important contexts involving measurements of biological entities, there are distinct categories of information: some information is easy-to-obtain information (EI) and can be gathered on virtually every subject of interest, while other information is hard-to-obtain information (HI) and can only be gathered on some of the biological samples. For example, in the context of drug discovery, measurements like the chemical structure of a drug are EI, while measurements of the transcriptome of a cell population perturbed with the drug is HI. In the clinical context, basic health monitoring is EI because it is already being captured as part of other processes, while cellular measurements like flow cytometry or even ultimate patient outcome are HI. We propose building a model to make probabilistic predictions of HI from EI on the samples that have both kinds of measurements, which will allow us to generalize and predict the HI on a large set of samples from just the EI. To accomplish this, we present a conditional Generative Adversarial Network (cGAN) framework we call the Feature Mapping GAN (FMGAN). By using the EI as conditions to map to the HI, we demonstrate that FMGAN can accurately predict the HI, with heterogeneity in cases of distributions of HI from EI. We show that FMGAN is flexible in that it can learn rich and complex mappings from EI to HI, and can take into account manifold structure in the EI space where available. We demonstrate this in a variety of contexts including generating RNA sequencing results on cell lines subjected to drug perturbations using drug chemical structure, and generating clinical outcomes from patient lab measurements. Most notably, we are able to generate synthetic flow cytometry data from clinical variables on a cohort of COVID-19 patients—effectively describing their immune response in great detail, and showcasing the power of generating expensive FACS data from ubiquitously available patient monitoring data. Bigger Picture Many experiments face a trade-off between gathering easy-to-collect information on many samples or hard-to-collect information on a smaller number of small due to costs in terms of both money and time. We demonstrate that a mapping between the easy-to-collect and hard-to-collect information can be trained as a conditional GAN from a subset of samples with both measured. With our conditional GAN model known as Feature-Mapping GAN (FMGAN), the results of expensive experiments can be predicted, saving on the costs of actually performing the experiment. This can have major impact in many settinsg. We study two example settings. First, in the field of pharmaceutical drug discovery early phase pharmaceutical experiments require casting a wide net to find a few potential leads to follow. In the long term, development pipelines can be re-designed to specifically utilize FMGAN in an optimal way to accelerate the process of drug discovery. FMGAN can also have a major impact in clinical setting, where routinely measured variables like blood pressure or heart rate can be used to predict important health outcomes and therefore deciding the best course of treatment.

. (a) The measurements on data are separated into "easy-to-collect information" (EI) and "hard-to-collect information" (HI). The easy-to-collect measurements are available on all data, while the hard-to-collect measurements are only available on some data. (b) With a Conditional GAN, we can learn to model the relationship between these two categories of measurements. The FMGAN we propose uses a conditional Generative Adversarial Network (cGAN) to generate 127 hard-to-collect information (like sequencing results from a perturbation experiment) from other 128 easy-to-collect information (like basic information on the drugs used). Specifically, we propose a cGAN 129 with the easy-to-collect information as conditions and the hard-to-collect information as the data 130 distribution. A cGAN is a generative model that learns to generate points based on a conditional label 131 that is given to the generator G. In the adversarial learning framework, G is guided into generating 132 realistic data during training by another network, the discriminator D, that tries to distinguish between 133 samples from the real data and samples from the generated data. The generator G and discriminator 134 D are trained by alternating optimization of G and D. 135 A standard GAN learns to map from random stochastic input z ∼ N(0, 1) (or a similarly simple distribution) to the data distribution by training G and D in alternating gradient descent with the following objective: min G max D E x∼P(x) log(D(x)) + E z∼P z log(1 − D(G(z)))

The generator in a cGAN receives both the random stochastic input z and a conditional label l and thus has the following objective: min G max D E x l ∼P(x|l) log(D(x|l)) + E z∼P z log(1 − D(G(z|l)))

The cGAN was originally used in image generation contexts, where the condition referred to 136 what type of image should be generated (e.g. a dog). The cGAN is useful in this context because 137 the generator G receives a sample from a noise distribution (as in a typical GAN) as well as the 138 condition. Thus, it is able to generate a distribution that is conditioned on the label, as opposed to a 139 single deterministic output conditioned on the label. In the original use case, it can learn to generate a 140 wide variety of images of dogs when given the conditional label for dogs, for example. While many 141 previous methods exist for generating a single output from a single input, there are few alternatives for 142 generating a distribution of outputs from a single input without placing assumptions on the parametric 143 form of the output distribution. 144 The framework of the FMGAN is summarized in Figure 1 . The columns of the data are separated 145 into easy-to-collect information (EI) and hard-to-collect information (HI). In the notation of the GAN, 146 we use the EI as the conditional label l and the HI as the data x. For observations that have both, we train the FMGAN with the generator receiving a label l and a noise point z, while the discriminator 148 receives the label l and both real points x and the generated points G(z|l). Then, after training, the 149 generator can generate points for conditions l without known data x. This allows us to impute HI 150 where we only have EI. 151 The FMGAN architecture is designed to take advantage of complex relationships between the 152 condition space and the data space. A single underlying entity (e.g. a drug or a patient) has a 153 representation in both spaces. In the EI space, the drug is a point, while in the HI space the drug is 154 represented by a distribution of cells perturbed by it. Despite the difference in structure, the FMGAN 155 is able to leverage regularities in the relationship between the two spaces. This relies on the FMGAN 156 being able to leverage manifold structure inherent within each space (for more discussion of manifold 157 structure, please see the supplementary information). 158 In some cases, the data modality for the EI is difficult to utilize: for example, the chemical 159 structure of the drug. The chemical structure can be represented as a string sequence called SMILES or 160 a two-dimensional image of the structure diagram. Small changes in the chemical structure can have 161 large changes on its function, but may appear to be minor changes to the overall SMILES string or 162 the overall structure diagram image. Thus, we use an embedding neural network, parameterized as 163 a convolutional network, to process these representations into a more regular space where standard 164 distances and directions are meaningful. This parameterization is crucial, as originally the structure is These replicates produce variable effects, motivating the need for a framework that is capable of 176 modeling such stochasticity. 177 We design four separate experiments with this dataset: In each dataset, the measurement we choose for evaluation is maximum mean discrepancy 188 (MMD) [12] . We choose this because we require a metric that is a distance between distributions, not a 189 distance merely between points. Taking the mean of a distance between points would not capture the The formation of easy-to-collect (red columns) and hard-to-collect (white columns) data for each experiment with drug perturbation data. (a) in the held-out genes experiment, the easy-to-collect measurements are taken from held-out genes (b) in the PHATE coordinate experiment, they are the result of running on the genes matrix (c) in the SMILES string experiment, the easy-to-collect data is an embedding from processing this representation with a CNN (d) in the structure diagram experiment, it is the same as in the SMILES string experiment except run on the structure diagrams.

We compare our FMGAN to a baseline not built off of the cGAN framework. In developing a 197 baseline, we must compare to a model that takes in a point and outputs an entire distribution. As 198 most existing work yields deterministic output, we create our own stochastic distribution yielding 199 model to compare to. This model, which we term simply "Baseline", takes a condition and a sample 200 from a random noise distribution as input, just like our FMGAN. However, unlike our model which 201 uses adversarial training and a deep neural network, the Baseline is a simpler, feed forward neural 202 network that minimizes the mean-squared-error (MSE) between the output of a linear transformation 203 and the real gene profile for that condition. As it is given noise input as well as a condition, it is still 204 able to generate whole distributions as predictions for each condition, rather than deterministic single To show our cGAN can learn informative mappings from the EI space to the gene expression 211 space, as distinct from the rest of the process, we first choose a means of obtaining EI that are known to be meaningfully connected to the gene expression space. Specifically, we artificially hold out ten 213 genes and use their values as EI, with the GAN tasked with generating the values for all other genes.

This experimental design is summarized in Figure 2a . We choose the ten genes algorithmically by 215 selecting one randomly and then greedily adding to the set the one with the least shared correlation 216 with the others, to ensure the information in their values have as little redundancy as possible: PHGDH, 217 PRCP, CIAPIN1, GNAI1, PLSCR1, SOX4, MAP2K5, BAD, SPP1, and TIAM1. In addition to dividing 218 up the gene space to use these ten genes to predict all of the rest, we also divide up the cell space and 219 train on 80% of the cell data, with the last 20% heldout for testing. 220 We find our cGAN is able to successfully leverage information in the EI space to accurately model 221 the data. We designed our proof of concept deliberately so that the true values are known for each gene 222 expression and drug we ask our network to predict. These values can be compared to the predictions 223 with MMD for a measure of accuracy.

Our cGAN is able to generate predictions with an MMD of 2.847 between it and the validation 225 set (drugs it has never previously seen), showing it very effectively learned to model the dependency 226 structure between the EI space and the HI space, even on newly introduced drugs (Table 1 ). This is in 227 comparison to the Baseline model, which has a higher (worse) MMD of 2.922. It is noteworthy that the 228 FMGAN outperforms the baseline even in this case, where no processing of the EI needs to take place, 229 as they are numerically meaningful values to begin with. 230 We also can visualize the embedding spaces learned by the generator to investigate the model.

Shown in Figure 3a are the generator's embeddings colored by each of the heldout genes. As we can 232 see, the generator found some of these more informative in learning an EI embedding than others. We Our next experiment formulates the EI space not as individual heldout genes, but instead on a 240 dimensionality-reduced representation of the whole space. We theorize that this approach would be 241 beneficial over the previous held-out-genes experiment if the EI data exhibits manifold structure. If 242 it does, this processing will have made a geometric representation of the EI that corresponds to the 243 HI, and thus the mapping is computationally simpler. Previous work has shown that gene expression 244 profiles often do exhibit this manifold structure [11, 13, 14] . 245 We run the embedding tool PHATE on the gene profiles to calculate two coordinates, which we 246 then use as EI in our FMGAN [11] . Doing so preserves the manifold structure of the data, allowing 247 for a meaningful transformation to the HI space. This process is depicted in Figure 2b . As usual, we 248 separate cells into an 80%/20% training/testing split for evaluation purposes, after being subsampled 249 to ten thousand points for computational feasibility with the dimensionality reduction method, and 250 we report scores on the evaluation points.

As shown in Table 1 , once again the FMGAN better models the target distribution, as measured 

Next, we test the full pipeline of FMGAN by using SMILES string embeddings as the EI 260 (summarized in Figure 2c ). This is a much more challenging test case, because in the previous 261 cases each point in HI space had a distinct condition, and in the case of the PHATE coordinates, that 262 condition was derived from the data it had to predict. In this case, many different data points have the 263 same condition, and thus the relationship is much less direct between the EI and the HI.

An additional wrinkle also arises in this setting where the conditions to the cGAN are learned 265 from a raw data structure, rather than a priori existing in their final numerical form like heldout genes 266 or PHATE coordinates. Since G and D are trained adversarially and each depends on the embedder E, 267 the networks could try to beat each other by manipulating the embeddings into being non-informative 268 for the other network. Thus, we let G and D learn their own embedder E, thus removing the incentive 269 to make E non-informative. 270 As in the previous experiment, we separate the data into an 80%/20% training/testing split for 271 evaluation purposes, but this time split along the drugs since each condition gives rise to many points 272 in the HI space. In this section, we investigate further the EI space learned from the SMILES strings by the 276 generator. In the two previous experiments, the conditions given to the FMGAN had information 277 more readily available, either in the form of raw data or even more informative PHATE coordinates.

The SMILES strings, by contrast, must be informatively processed for the learned conditions to be 279 meaningful.

In this learned EI space, there is one condition coordinate for each drug (while the HI consists of 281 many perturbations from each drug). Shown in Figure 3b is the raw data colored by the value of gene of the embedding, and is characterized by much lower expression of this gene. 286 We compare this to the embedding learned by the generator, which we show in Figure 3c . In this 287 plot, each drug is one point, colored by the mean gene value of all perturbations for that drug and 288 with a point whose size is scaled by the number of perturbations for that drug. We see that the first 289 two drugs are in the central part of the space, and closer to each other than they are to BRD-K79090631.

The drug BRD-K79090631 is off in a different part of the space, along with other drugs low in EIF4G2.

This shows that the learned conditions from the generator have indeed identified information about 292 the drugs and taken complex sequential representations and mapped them into a much simpler space. Figure 4 shows the real data, which is noisy but still shows different density of mortality in 337 different parts of the space. We also see the FMGAN generated data next to it: qualitatively, these 338 predictions resemble the raw data to a substantial degree. As a baseline, we can build a linear 339 regression model that tries to predict this response variable as a function of the coordinates. Due and generated points, but also the labels for each point. As a result, the generator not only learns to 414 generate realistic data, but it also learns to generate realistic data for a given label.

After training, the labels, whose meaning is known to us, can be provided to the generator to 416 generate points of a particular type on demand. Because G is provided both a label and a random 417 sample from Z, the cGAN is able to model not just a mapping from a label to a single point, but instead 418 a mapping from a label to an entire distribution. Learning a generative model conditioned on the labels allows information sharing across labels, 420 another advantage of the cGAN framework. Since the generator G must share weights across labels, 421 the signal for any particular label l i is blended with the signal from all other labels l j , j = i, allowing 422 for learning without massive amounts of data for each label. generator neural network G with a second discriminator network D using the following equation:

where x is the training data, z is a noise distribution that provides stochasticity to the generator 626 and is chosen to be easy to sample from (typically an isotropic Gaussian). 

that describes a Markovian diffusion process over the intrinsic geometry of the data. Finally, a 681 diffusion map [33] is defined by taking the eigenvalues 1 = λ 1 ≥ λ 2 ≥ · · · ≥ λ N and (corresponding) 682 eigenvectors {φ j } N j=1 of P, and mapping each data point representation called potential distance (pdist). Potential distance is an M-divergence between the 701 distribution in row i, P t i,. and the distribution in row j, P t j,. . These are indeed distributions as P t is 702

Markovian: pdist(i, j) = ∑ k (log(P t (i, k) − P t (j, k)) 2 .

The log scaling inherent in potential distance effectively acts as a damping factor which makes 704 faraway points similarly equal to nearby points in terms of diffusion probability. To evaluate the accuracy of the predicted distribution with respect to the true distribution for a given condition, we utilize maximum mean discrepancy (MMD) [36] . The MMD is a distribution distance based on a kernel applied to pairwise distances of each distribution. Specifically, MMD is calculated as: 

Aligning biological manifolds