key: cord-0272588-817hud9f authors: An, Yang; Drost, Felix; Theis, Fabian; Schubert, Benjamin; Lotfollahi, Mohammad title: Jointly learning T-cell receptor and transcriptomic information to decipher the immune response date: 2021-06-25 journal: bioRxiv DOI: 10.1101/2021.06.24.449733 sha: aa12eb2de509df5061235195a9d9020b1eb2680a doc_id: 272588 cord_uid: 817hud9f T cells play a pivotal role in the adaptive immune system recognizing foreign antigens through their T-cell receptor (TCR). Although the specificity and affinity of the TCR to its cognate antigen determines the functionality, the phenotypic differentiation and thereby also the fate of the T cell remain poorly understood. Therefore, studying the transcriptional changes of T cells in the context of their TCRs is key to deeper insights into T-cell biology. To this end, we developed a multiview Variational Autoencoder (mvTCR) to jointly embed transcriptomic and TCR sequence information at a single-cell level to better capture the phenotypic behavior of T cells. We evaluated mvTCR on two datasets showing a clear separation of the cell state and their functionality, thus, providing a more biologically informative representation than models using each modality individually. T cells are an integral part of the adaptive immune system. They recognize antigen-derived epitopes that are bound to the Major Histocompatibility Complex of other cells and initiate an immune response to free the body from harmful microorganisms and cancer. This recognition is mainly driven by the T-cell receptor (TCR), a surface protein consisting of an αand β-chain, which is expressed with a diversity of up to 10 20 different sequences (Zarnitsyna et al., 2013) . The complexity of cell state and functionality makes the study of T cells a challenging endeavor. Recent advances in single-cell technologies allows to jointly determine TCR sequences and transcriptome information (10x Genomics, 2019; Yost et al., 2019; Fischer et al., 2020) , which pave the way to a deeper understanding of the phenotypic interplay between gene expression and TCR sequence. Previous approaches for embedding T cells used TCR sequences for clustering or similarity assessment of groups with common epitope specificity (Glanville et al., 2017) (Dash et al., 2017) . Due to greater datasets, new methods were proposed utilizing deep learning models such as DeepTCR, which embeds T cells based on their TCR sequence and VDJ gene usage with an Autoencoder (Sidhom et al., 2021) . Recently, two novel methods were proposed that additionally include transcriptomic information by Bayesian Clustering (Zhang et al., 2021) or neighborgraph analysis (Schattgen et al., 2020) . However, none of the existing methods have addressed learning a joint representation guided by both functional information from TCR sequences and transcriptional profiling of cells to analyze such paired data. To address this, we propose mvTCR ( https://github.com/SchubertLab/mvTCR) to learn a joint embedding of T cells by using both TCR sequence and the scRNA-seq data. Our model leverages advances is sequence learning using Transformers (Vaswani et al., 2017) combined with generative multi-view learning (Kingma & Welling, 2013; Lee & van der Schaar, 2021) providing a holistic view to analyze T cells. The model is evaluated on two datasets demonstrating the clustering of cells with homogeneous epitope specificity and transcriptional similarity. The VAE model embeds the transcriptomic data X RN A and the amino acid sequences of the TCR X T CR by its encoding networks E RN A and E T CR to the features H RN A and H T CR of size d f eat , respectively. The extracted feature embeddings are subsequently fused by the mixture model M to obtain a joint latent distribution q(Z joint |X RN A , X T CR ). For downstream tasks, the Gaussian mean of this distri- bution is used to obtain the T cell embedding. We tested two mixture model schemes: concatenation and Product-of-Expert (PoE) (Lee & van der Schaar, 2021) . M 's output feeds into the modality-specific decoders D RN A and D T CR reconstructing the corresponding input data ( Figure 1 ). TCR: The sequences of αand β-chains in X T CR are one-hot-encoded, end-padded to the maximum sequence length, and processed by individual encoder-and decoderarchitectures. Since the two chains follow a similar structure but contribute differently, both networks share the same architecture but not their weights. Due to their great ability to encode sequences, the Transformer model is used for the encoders E T RA and E T RB (Vaswani et al., 2017) , followed by a linear layer to reduce the output size to d f eat /2. The features from both chains are concatenated to form H T CR . The decoders D T RA and D T RB adhere to a reverse structure. First, H joint is upsampled by a linear layer, which is followed by the decoding Transformer. Finally, a softmax layer is used to predict the residue at each sequence position. RNA: Following (Lotfollahi et al., 2019; 2020b) , X RN A is passed through the encoder E RN A built from multiple blocks each consisting of a fully-connected layer, batch normalization, Leaky ReLU activation, and a dropout layer. The last linear layer transforms the output to the same dimensionality d f eat as the concatenated TCR features to avoid domination of one modality over the other. D RN A follows the reverse architecture of E RN A . The Concatenation Module concatenates H T CR and H RN A and passes the joint feature vector through a multilayer perceptron E joint to estimate the joint posterior q(Z joint |H RN A , H T CR ), from which the joint embedding Z joint is sampled via the reparameterization trick (Kingma & Welling, 2013) (Figure 1b) . Finally, Z joint is passed into the shared decoder D joint and its output is fed into the decoders of both modalities D RN A and D T CR . The PoE Module uses separate encoders E RN A 2 and E T CR for each modality to estimate individual marginal latent distribution q(Z RN A |H RN A ) and q(Z T CR |H T CR ) ( Figure 1c ). Here, Z joint is sampled via the reparameterization trick from the closed-form solution of the product of both distributions where N represents the number of modalities and p(Z) a zero-mean, univariant Gaussian prior. After sampling, Z joint is passed to both encoders E T CR and E RN A 2 in addition to the corresponding marginal distributions Z RN A and Z T CR . Thereby, the model is trained to preserve information specific to each modality individually, while simultaneously capturing features shared across them. All encoders and decoders of the mixture module were built from fully-connected blocks similar to the RNA module. The loss function consists of three weighted parts. The model is optimized towards reconstructing its input data by using the Mean Squared Error loss for X RN A and the Cross-Entropy loss for X T CR . Additionally, the latent distributions q(Z|X RN A , X T CR ) or q(Z 1 |X RN A ), q(Z 2 |X T CR ), and q(Z 3 |X RN A , X T CR ) are regularized by the Kullback-Leibler divergence to resemble a zero-mean, univariant Normal distribution. The hyperparameters of the network were selected with Optuna (Akiba et al., 2019) to either optimize the reconstruction loss or a dataset-specific downstream task. For task-specific training, the loss weights were included in this search, while for reconstruction they were fixed such that the losses from both modalities have the same magnitude. To avoid overfitting, cells from rare clonotypes (i.e., cells with the same TCR sequence) were oversampled. We compared our models in various downstream tasks against unimodal baseline models trained on either RNA or TCR information. These baseline models followed the same overall architecture but were trained using only a single modality. First, we analyzed the dataset provided by 10x Genomics consisting of paired single-cell TCR, transcriptome, and epitope specificity information of over 150,000 CD8 + T cells from four donors (10x Genomics, 2019) . To avoid batch effects, the models were trained for donor 1 and 2 separately. Due to limited diversity in epitope specificity, we excluded donor 3 and 4 in our analysis. For model training and downstream tasks, cells with unknown or rare epitope specificity were removed. Second, our models were evaluated on the dataset described in (Fischer et al., 2020) , which shows T cell reactivity towards a mix of SARS-CoV-2 epitopes (Gene Expression Omnibus, GSE171037). The cells were split into training, validation, and test set of approximately 70%, 15%, and 15% for the 10x Genomics dataset and 80%, 10%, 10% for the smaller SARS-CoV-2 dataset, respectively. To avoid data leakage, cells from the same clonotype were assigned to the same subset. To investigate whether cell states and functionality are better captured in the joint than in unimodal embeddings, we investigated the latent spaces of donor 1 of the 10X Genomics dataset (Figure 2 , donor 2: Suppl. Figure 1 , SARS-CoV-2: Suppl. Figure 2) . The embedding of the model trained on TCR data was dominated by multiple smaller clusters representing similar or identical clonotypes. However, it failed to accumulate T cells with the same epitope specificity but expressing dissimilar sequences. The RNA-only model formed larger clusters of cells with the same specificity. However, clear separation of subclusters could be observed in T cells specific to the Cytomegalovirus epitope KLGGALQAK (Figure 2, red) . The joint models embedded T cells with identical cognate epitopes into less fragmented and clearer separated clusters. These findings suggest that the joint models learned to combine complementary information from both modalities, which represent cellular functionality better than separate embedding. This was also reflected in quantitative evaluations measured by the Average Silhouette Width (ASW) and the Normalized Mutual Information (NMI). ASW measures intra-cluster cohesion compared to inter-cluster separation, where higher values indicate clearer clustering. NMI is an external metric, which evaluates how much information is shared between the assigned cluster and ground-truth labels. In particular, NMI is the harmonic mean between homogeneity and completeness, thereby measuring the pureness in clusters as well as the fragmentation of labels between clusters. During analysis, Leiden clustering (Traag et al., 2019) was used on the embeddings reporting the maximum metric resulting from three different clustering resolutions (0.01, 0.1, 1.0; Table 1 ). For the 10x Genomics dataset, epitope specificity was used as an external label to calculate the NMI, while for the SARS-CoV-2 dataset, scores were reported for cell type (CD4 + or CD8 + T cells) and reactivity toward a SARS-CoV-2 Spike epitope mix (CD4 + reactive, CD8 + reactive and unreactive T cells). For each dataset, at least one joint model outperformed the single modality models for each metric. This supports the qualitative findings that joint modeling improves clustering of T cells with higher intra-class coherence and inter-class separation, while representing specificity as well as functionality better. Additionally, T cell embeddings should preserve biological signals such as cell state beyond epitope specificity and cell type. To test this, we performed differential expression analysis (excluding TRA/B genes) to determine the most characteristic marker gene of each Leiden cluster (donor 1: Figure 2 and Suppl. Figure 3 , donor 2: Suppl. Figure 1 , SARS-CoV-2: Suppl. Figure 2) . Here, we showcase how two selected marker genes (IL7R, GZMH) are distributed in the embedding spaces ( Figure 2 ). Note, that these observations also hold true for the top characteristic genes of the remaining clusters as well. The selected genes are expressed in different stages of T cell development. The IL7R (Figure 2 e-f) encodes the IL7 receptor, which interacts with IL17 during T cell development, in particular during V(D)J recombination to form the final TCR-sequence (Schlissel et al., 2000) . High expression of this gene, therefore, indicates a T cell during proliferation. GZMH (Figure 2 i-l) gives rise to a cytotoxic protease that induces cell death to infected host cells (Fellows et al., 2007) and is therefore expressed in the effector development stages. Gene expression patterns were clearly preserved in both joint and the RNAonly models. While in the TCR-only embedding, marker genes were arbitrarily distributed in multiple clusters. The joint model seemed also to cluster T cells hierarchically first by epitope specificity and secondly on gene-level creating subpopulations of cells with similar cell states, thus providing a more informed representation when studying human T cell repertoires. Generally, specificity is mainly attributed to the TCR. However, transcriptome information contains TRA and TRB genes, which determine the set of possible sequences during TCR development. Additionally, previous works (Schattgen et al., 2020) revealed that T cells with a common ances- tor express a similar transcriptional profile. Therefore, the antigen-specificity can be inferred from the gene expression data as well. We used this fact to evaluate whether a joint embedding of both modalities can improve TCR-specificity imputation over unimodal models, thereby substantiating the improved expression of the joint latent space. The ability to perform imputation of missing specificity in the embedding space was evaluated by using the cells from the training data as an atlas set and predicting the binding epitopes with a k-Nearest Neighbor (kNN) classifier for the test set. This is especially challenging since the data split prevents common clonotypes between atlas and query set. The joint models perform on-par with the RNA model on donor 1 of the 10x Genomics dataset, while they outperform both unimodal models for donor 2 (Table 2) . Interestingly, TCR-specificity could be better inferred by gene expression than by TCR sequence in donor 1 and vice versa in donor 2. The joint model however performed on-par or better than the unimodal models in both donors, indicating that our model was able to represent relevant aspects of T cell functionality by capturing and combining complementary information from both modalities in an intelligent manner. In this work, we proposed mvTCR, a VAE to embed T cells based on their transcriptomics and TCR-sequence data jointly. Extensive experiments show mvTCR can synergistically combine this complementary information to enrich the latent representation of single-cell data and captures patterns that would have been otherwise missed when interpreting each level independently. Our model achieves better T cell clustering based on functionality and cell state and reveals subpopulations of different cell states within the epitope specific clustering. We further showed that a joint embedding improves the imputation of unknown epitope specificity. We envision future works will make such representations transferable to new studies (Lotfollahi et al., 2020a) and also improve the interpretability of the latent space (Rybakov et al., 2020) . Overall, mvTCR offers a more holistic view of T cell repertoires at single-cell resolution and therefore has the potential of leading to profound new insights into T-cell biology. The 10x Genomics datasets can be downloaded from https://support.10xgenomics.com/single-cell-vdj/datasets under the section Application Note -A New Way of Exploring Immunity. The SARS-CoV-2 dataset is available at GEO GSE171037. The source code to recreate the results including training, evaluation, and tutorial scripts is accessible at https://github.com/SchubertLab/mvTCR. The reported models can be downloaded from 10.5281/zenodo.5006839. A new way of exploring immunitylinking highly multiplexed antigen recognition to immune repertoire and phenotype Optuna: A next-generation hyperparameter optimization framework Quantifiable predictive features define epitope-specific t cell receptor repertoires Natural killer cell-derived human granzyme h induces an alternative, caspase-independent cell-death program Single-cell rna sequencing reveals in vivo signatures of sars-cov-2-reactive t cells through Identifying specificity groups in the t cell receptor repertoire Auto-encoding variational bayes A variational information bottleneck approach to multi-omics data integration scgen predicts single-cell perturbation responses Query to reference single-cell integration with transfer learning. bioRxiv Conditional out-of-distribution generation for unpaired data using transfer vae Uniform manifold approximation and projection for dimension reduction Learning interpretable latent autoencoder representations with annotations of feature sets. bioRxiv Linking t cell receptor sequence to transcriptional profiles with clonotype neighbor graph analysis (conga) The interleukin 7 receptor is required for t cell receptor γ locus accessibility to the v(d)j recombinase Deeptcr is a deep learning framework for revealing sequence concepts within t-cell repertoires From louvain to leiden: guaranteeing well-connected communities Attention is all you need Clonal replacement of tumorspecific t cells following pd-1 blockade Estimating the diversity, completeness, and cross-reactivity of the t cell repertoire Mapping the functional landscape of t cell receptor repertoires by single-t cell transcriptomics