key: cord-0696141-4dbuc91c authors: Kaden, Marika; Bohnsack, Katrin Sophie; Weber, Mirko; Kudła, Mateusz; Gutowska, Kaja; Blazewicz, Jacek; Villmann, Thomas title: Analysis of SARS-CoV-2 RNA-Sequences by Interpretable Machine Learning Models date: 2020-05-15 journal: bioRxiv DOI: 10.1101/2020.05.15.097741 sha: 7fa2ace8a29fdb60340905bf4ceb038f5289dddf doc_id: 696141 cord_uid: 4dbuc91c We present an approach to investigate SARS-CoV-2 virus sequences based on alignment-free methods for RNA sequence comparison. In particular, we verify a given clustering result for the GISAID data set, which was obtained analyzing the molecular differences in coronavirus populations by phylogenetic trees. For this purpose, we use alignment-free dissimilarity measures for sequences and combine them with learning vector quantization classifiers for virus type discriminant analysis and classification. Those vector quantizers belong to the class of interpretable machine learning methods, which, on the one hand side provide additional knowledge about the classification decisions like discriminant feature correlations, and on the other hand can be equipped with a reject option. This option gives the model the property of self controlled evidence if applied to new data, i.e. the models refuses to make a classification decision, if the model evidence for the presented data is not given. After training such a classifier for the GISAID data set, we apply the obtained classifier model to another but unlabeled SARS-CoV-2 virus data set. On the one hand side, this allows us to assign new sequences to already known virus types and, on the other hand, the rejected sequences allow speculations about new virus types with respect to nucleotide base mutations in the viral sequences. Author summary The currently emerging global disease COVID-19 caused by novel SARS-CoV-2 viruses requires all scientific effort to investigate the development of the viral epidemy, the properties of the virus and its types. Investigations of the virus sequence are of special interest. Frequently, those are based on mathematical/statistical analysis. However, machine learning methods represent a promising alternative, if one focuses on interpretable models, i.e. those that do not act as black-boxes. Doing so, we apply variants of Learning Vector Quantizers to analyze the SARS-CoV-2 sequences. We encoded the sequences and compared them in their numerical representations to avoid the computationally costly comparison based on sequence alignments. Our resulting model is interpretable, robust, efficient, and has a self-controlling mechanism regarding the applicability to data. This framework was applied to two data sets concerning SARS-CoV-2. We were able to verify previously published virus type findings for one of the data sets by training our model to accurately identify the virus type of sequences. For sequences without virus type information (second data set), our trained model can predict them. Thereby, we observe a new scattered spreading of the sequences in the data space which probably is caused by mutations in the viral sequences. The coronavirus disease 2019 (COVID-19) caused by SARS-CoV-2 viruses, whose origin 2 lies probably in Wuhan (China), is a severe respiratory disease [1] . Currently (May 3 2020), it is spreading rapidly all over the world [2] . Yet there are several indicators that 4 the molecular characteristic evolves during time [3, 4] . This evolution is mainly driven 5 by mutations, which play an essential role and maybe accompanied by mechanisms of 6 stabilization [5, 6] . 7 The analysis of the genomic structure by sequencing is currently topic of ongoing 8 research to better understand the molecular dynamics [7] . Obviously, changing the 9 genomic structure may cause new properties and, hence, could increase the difficulties 10 in finding drugs for treatment. For example, changes may lead to behavioral changes, 11 such as the increased binding of the SARS-CoV-2 surface glycoprotein to human ACE2 12 receptors [8] . 13 Viruses of the family Coronaviridae possess a single stranded, positive-sense RNA 14 genome ranging from 26 to 32 kilobases in length and frequently are extremely 15 similar [9] . Therefore, the analysis of those sequences to understand the genetic 16 evolution in time and space is very difficult. This problem is magnified by incorrect or 17 inaccurate sequencing [10] . Further, mutations are not equally distributed across the in [36, 37] . 48 In the present publication, we investigate whether alignment-free dissimilarities are 49 suitable for the identification of SARS-CoV-2 clusters/classes in combination with 50 interpretable machine learning methods for clustering and classification [38, 39] . This we 51 do for two data sets: GISAID-data and NCBI-data. For the first one, virus classes 52 (types) were identified by phylogenetic tree analysis in [12] , whereas the second one is 53 without class information. 54 Although deep neural network approaches provide impressive results in sequence 55 classification [16, [40] [41] [42] , deep architectures are at least difficult to interpret [43] . 56 Therefore, we focus on applying prototype based methods using alignment-free 57 dissimilarity measures for sequence comparison. In fact, prototype-based machine 58 learning models for data classification and representation are known to be interpretable 59 and robust [44] [45] [46] . Using such methods for the SARS-CoV-2 sequence data, first we 60 verify the classification results for the GISAID-data. In particular, we classify the 61 sequences by a learning vector quantizer, which is proven to be robust and 62 interpretable [45, 47] . Thereafter, we use this model to classify the new data from the 63 NCBI. Moreover, this interpretable classifier provides correlation information regarding 64 data features contributing to a class discrimination. This additional knowledge allows a 65 further characterization of the virus classes. 66 Materials and methods 67 68 In order to investigate SARS-CoV-2 viruses in terms of sub-type spreading, two virus 69 sequence data sets were considered. 70 The GISAID Dataset D G 71 The first one, abbreviated by D G , is from the GISAID coronavirus repository (GISAID - 72 Global Initiative on Sharing Avian Influenza Data). It consists by March 4th, 2020 of 73 254 coronavirus genomes, isolated from 244 humans, nine Chinese pangolins, and one 74 bat Rhinolophus affinis. After preprocessing, 160 complete human sequences are 75 obtained as described in [12] , where these genomes of SARS-CoV-2 have been used to 76 create a phylogenetic network. The resulting network analysis distinguished three types 77 of the virus (cluster) A, B, and C: A is most similar to the bat virus, whereas B are 78 3/24 sequences obtained from A by two mutations: the synonymous mutation T8782C and 79 the non-synonymous mutation C28144T changing a leucine to a serine. A further 80 non-synonymous mutation G26144T changing a glycine to a valine lead from B to type 81 C. In this sense, the classes (virus types) code implicitly the evolution in time of the 82 virus. 83 In our data analysis, we removed two sequences, whose accession numbers occur twice 84 in the data record, and another two, which we identified as not human resulting in 156 85 final sequences. Additionally, we take the type/class information as label for the virus 86 genome sequences and, hence, as reference. A detailed data description as well as 87 complete list of sequences can be found in [12] . The virus type assignments and 88 additional data (country, collection date) as well as accession numbers for all 156 89 sequences in use are additionally provided in the supplementary material. The complete data information can be found in supplementary files S12 Data. The NCBI dataset D N The second data set including 892 complete genomes has been selected from the 93 National Center for Biotechnology Information (NCBI) Viral Genome database [48] , 94 and GenBank [49] by April 19th, 2020, see Tab. 1. These data are human based 95 sequences and provide additionally the country information from which the sequences 96 originate, as well as their collection date. For each sequence we have also derived a more 97 general assignment to regions based on the country information, which includes the 98 following values: USA, China, Europe, and Others. The accession number and the 99 additional data used in the analysis have been included in the supplementary material. 100 We refer to this data set by D N . Remark, although the SARS-CoV-2 virus is an RNA virus, the sequences provided by 102 databases are given using the DNA-coding. In the following, we take over this 103 convention and do not explicitly refer to that later. Europe USA Others Dec '19 16 0 0 0 Jan '20 44 4 16 9 Feb '20 2 6 44 7 Mar '20 1 23 706 10 Apr '20 0 0 4 0 Table 1 . Distribution of the NCBI-data D N regarding regions and month of collection date. Again, the complete data information can be found in supplementary files S12 Data. Representation of RNA Sequences for Alignment-free Data 106 Analysis 107 Several approaches were published to represent sequences adequately for alignment-free 108 comparison. These method range from chaos game representation to standard unary 109 coding or matrix representations. An overview is given in [27] and [36, 37] . Here we focus 110 only on two of the most promising approaches -Natural Vectors and Bag-of-Words. for k ≥ 2 is given as µ k L = E p L − µ 1 L k . Then, the natural vector of order K for a 120 sequence s is given as whereby we again drop the dependencies on K and s, if it is not misleading. Natural 122 vectors are usually compared in terms of the l p -metric giving the Euclidean distance for p = 2. Kendall-statistics, as a kind of correlation 124 measure, was applied in [50] . The NV-description of sequences can also be applied for nucleotide sequences containing 126 ambiguous characters (degenerate bases) [35, 51] . This yields an extended alphabet 127 A = A ∪ E . In that case, weights 0 ≤ w L (s i ) ≤ 1 are introduced for each L ∈ A with 128 where p L,si is the probability that the detected ambiguous character s i ∈ E should be 129 the character L. These weights have to be taken into account during the expectation 130 value calculations [35] . Obviously, comparison of those histogram vectors can be done using the usual Euclidean 145 distance. However, motivated by the latter mentioned density property, an alternative 146 5/24 choice is to compare by means of divergence measures [52] . In the investigations 147 presented later, we applied the Kullback-Leibler-divergence [53] 148 for sequence histograms h and m. Note that the first term in (3)) is the negative h j · log (m j ) 150 is the Shannon cross-entropy. Yet, other divergences like Rényi-divergences could be 151 used [54] . We refer to [55] for a general overview regarding divergences in the context of 152 machine learning. The assignment of the nucleotide triplets to the histogram dimension can be found in 154 the supplementary material S13 Histogram Coding of Nucleotide Triplets. The Median Neural Gas algorithm (MNG) is a neural data quantization algorithm for 158 data compression based on dissimilarities [56, 57] . It is a stable variant of the k-median 159 centroid method improved by neighborhood cooperativeness enhanced learning, where k 160 is the predefined number of representatives [58, 59] . In this context, median approaches 161 only assume a dissimilarity matrix for the data and restrict the data centroids to be 162 data points. Thus, after training, MNG provides k data points to serve as 163 representatives of the data. Thereby, the data space is implicitly sampled according to 164 the underlying data density in consequence of the so-called magnification property of 165 neural gas quantizers [60, 61] . 166 It should be emphasized that despite the weak assumption of a given similarity matrix, 167 MNG always delivers exact data objects as representatives. Hence, any averaging for 168 prototype generation like in standard vector quantizers is avoided here. This is essential, 169 if averaged data objects are meaningless like for texts, music data, or RNA/DNA 170 sequences, for example. Affinity Propagation for Clustering with Cluster Number Control 172 Affinity Propagation (AP) introduced by Frey&Dueck in [62] is an iterative cluster 173 algorithm based on message passing where the current cluster nodes, in the AP setting 174 denoted as prototypes or exemplars, interact by exchanging real-valued messages. Contrary to methods like c-means or neural maps, where the number c of prototypes 176 has to be chosen beforehand, AP starts assuming that all N data points are potential 177 exemplars and reduces the number of valid prototypes (cluster centroids) iteratively. More precisely, AP realizes an exemplar-dependent probability model where the given 179 similarities ς (i, k) between data points x i and x k (potential exemplars) are identified as 180 log-likelihoods of the probability that the data points assume each other as a prototype. 181 For example, the similarities ς (i, k) simply could be negative dissimilarities like the 182 negative Euclidean distance. The cost function C AP (I) minimized by AP is given by where I : N → N is the mapping function determining the prototypes for each data 185 point given in (4) and is a penalty function. During the optimization, two kind of messages are iteratively 187 exchanged between the data until convergence: the responsibilities r (i, k) and the 188 availabilities a (i, k). The responsibilities reflect the accumulated evidence that point k serves as prototype for data point i. The 190 availabilities describe the accumulated evidence how appropriate data point k is seen as a potential 192 prototype for the points i. Finally, the prototypes are determined by Hence, a (i, k) and r (i, k) can be taken as log-probability ratios [62] . The iterative 194 alternating calculation of a (i, k) and r (i, k) is caused by the max-sum-algorithm 195 applied for factor graphs [63] , which can further be related to spectral clustering [64] . The number of resulting clusters is implicitly determined by the self-similarities ς (k, k) 197 also denoted as preferences. The larger the self-similarities the finer is the granularity of 198 clustering [62] . Common choices are the median or the minimum of the similarities 199 between all inputs. Otherwise, the self-similarities can be seen as a control parameter introduced by T. Kohonen [65] . A cost-function-based variant is known as generalized 207 LVQ [66] . This cost function approximates the classification error [67] . In particular, an 208 LVQ classifier requires training data . . , C} is the set of available class labels. Further, the model 210 assumes a set of prototypes W = {w k ∈ R n , k = 1 . . . M } with class labels c (w k ) such 211 that at least one prototype is assigned to each class. Hence, we have a partitioning of is supposed, which has to be differentiable with respect to 214 the second argument. For a given LVQ-configuration a new data point x is assigned to 215 a class by the mapping 7/24 is known as the winner-takes-all rule (WTA) in prototype-based vector quantization. The prototype w ω is denoted as winner of the competition. During the learning, the cost-based LVQ minimizes the expected classification error is the local classification error depending on the choice of the monotonically increasing 222 function f and the classifier function is the so-called best matching 224 correct prototype and is the corresponding best matching 225 incorrect prototype. Frequently, the squashing function f is chosen as sigmoid: . Learning takes place as stochastic gradient descent learning 227 (SGDL) [68, 69] of E X [E (x k , W )] with respect to the prototype set W to obtain an 228 optimum prototype configuration in the data space. The dissimilarity d (x, w) can be chosen arbitrarily supposing differentiability with 230 respect to w to ensure SGDL. Frequently, the squared Euclidean distance both, x and w are assumed as discrete representations of density functions, divergences 233 like the Kullback-Leibler-divergence D KL (x, w) from (3) come into play instead [70] . The resulting LVQ variant is denoted as divergence-based GLVQ (GDLVQ). Another popular choice is the squared Euclidean mapping distance proposed in [71] with the mapping matrix Ω ∈ R m×n and m being the projection 237 dimension usually chosen m ≤ n [72] . Here, the data are first mapped linearly by the 238 mapping matrix and then the Euclidean distance is calculated in the mapping space R m . 239 The mapping matrix can be optimized again by SGDL to achieve a good separation of 240 the classes in the mapping space. The respective algorithm is known as Generalized Matrix LVQ (GMLVQ) [73] . Note that SGDL for Ω-optimization usually requires a 242 careful regularization technique [74] . After training, the adapted projection matrix Ω provides additional information. The 244 resulting matrix Λ = Ω T Ω ∈ R n×n allows an interpretation as classification correlation 245 matrix, i.e. the matrix entries Ω ij give the correlations between data features i and j, 246 which contribute to the class discrimination [39, 75] . A non-linear mapping could be 247 realized applying kernel distances instead [76, 77] . A trained LVQ model can be applied to newly incoming data of unknown distribution. 249 However, care must be taken to ensure that the model remains applicable and that 250 there is no inconsistency with the new data. Therefore, each LVQ can be equipped with 251 a reject option for the application phase [78, 79] . If the dissimilarity of the best 252 matching prototype to a data point is greater than a given threshold τ , it is rejected for 253 classification, i.e. this optional tool equips the LVQ with a so-called self-controlled evidence (SCE) [45] . The threshold τ is determined during model training for each 255 prototype individually, e.g. 95%-percentile of the dissimilarity value for those data, 256 which are assigned to the considered prototype by the WTA-rule (6) together with the 257 class assignment (5). Stochastic Neighbor Embedding for Visualization The method of Stochastic Neighbor Embbeding (SNE) was developed to visualize 260 high-dimensional data in a typically two-dimensional visualization space [80] . For this 261 purpose, each data point x k in the data space is associated with a visualization vector 262 v k ∈ R 2 . The objective of the respective embedding algorithm is to distribute the 263 visualization data in a way that the density of original data distances in the 264 high-dimensional data space is preserved as good as possible for the respective density 265 of the distances in the visualization space (embedding space). The quality criterion is 266 the Kullback-Leibler-divergence between them, which is minimized by SGDL with 267 respect to the visualization vectors v k . Yet, SNE suffers from the fact that the distance densities in the original data space are 269 frequently heavy-tailed [81] , which leads to inaccurate visualizations. To overcome this 270 problem, the so-called t-distributed SNE (t-SNE) was developed [82] . Data Processing Workflow In the following we describe and motivate the steps of data processing and analysis. For all LVQ variants we take only one prototype per class. For GMLVQ, the projection matrix is chosen as Ω ∈ R 2×n , i.e. the mapping 289 dimension is m = 2. SGDL training as 10-fold cross validation to determine the best LVQ 291 architecture for the given problem. -Training of W using the GLVQ for NV representation. * GDLVQ is not applicable for this sequence representation due to 294 mathematical reasons. 295 -Training of W using the GLVQ for BoW representation. -Training of W and Ω using the GMLVQ for BoW representation. Final training of the best LVQ architecture with optimum training schedule 298 to achieve best prototype configuration W . -If GMLVQ architecture is selected for final training: training of both W 300 and Ω, determination of the classification correlation matrix Λ = Ω T Ω. 301 -Determination of the reject thresholds for each prototype for 302 self-controlled of evidence use based on the 95%-percentile rule. Evaluation of the data rejected by the SCE rule. According to the processing workflow we trained several LVQ-classifier variants for the 320 D G -data. By 10-fold cross-validation, we achieved the averaged accuracies depicted in 321 Tab. 2 together with their respective standard deviations. According to these results, 322 GMLVQ performs best using the BoW-coding of the sequences together with the 323 Euclidean mapping distance d Ω (x, w) from (9). Thus, we finally trained a GMLVQ Table 2 . Classification results of trained LVQ-variants for the D G -dataset obtained by 10-fold cross-validation. Applying the trained GMLVQ classifier to the D N B -data set leads to the classification 339 of 37 data points to class A, 95 data points to class B, 2 data points to class C. The classification analysis of the D G -data by means of the machine learning model 356 GMLVQ verifies the class determination suggested in [12] . Only 7 data samples are not 357 classified accordingly due to the model self-controlled evidence decision. Thereby, the 358 GMLVQ model shows a stable performance in learning (see Tab. 2), which underlies its 359 well-known robustness [47] . Thus, we observe an overall precise agreement supporting 360 the findings in [12] . This agreement, however, is obtained by alignment-free sequence comparisons. More smallest deviations in the histograms. Yet, we cannot expect greater deviations, because 377 the sequences differ only in few characters according to the special mutations [11, 12] . The AP-centroids differ slightly more than the GMLVQ prototypes, see S6 Fig. This 379 can be dedicated to larger overall scattering of the D N B -data. Further, the GMLVQ-prototypes serve as class 'detectors'. If the encoded sequences are 381 most similar to them with respect to the mapping distance, the sequences are assigned 382 to the respective classes according to the WTA (6). However, in general the prototypes 383 are not identical with the mean vectors of the class distribution, as emphasized in [83] . could be that both t-SNE as well as AP implicitly reflect data densities in the data 390 space. Class densities, however, do not have to coincide with the overall data density. Thus, the Ω-mapping, which is optimized during GMLVQ training for best classification 392 performance, offers the better visualization option and, hence, disclosures the class 393 distribution more appropriately. samples from December/January are available. We observe that class C for the D G 399 data is mainly represented in January for European samples, which confirms the 400 findings in [12] . Thus, the small number of class C samples in D N B -classification may 401 be addressed to this peculiarity in Europe. Further, the GMLVQ, which was trained by 402 D G data, rejects a large amount of data from D N B , particularly in March. We suspect 403 an accumulation of mutations which could explain the scattering. Accordingly, the 404 GMLVQ is able to detect this behavior by means of the SCE decision rule. 405 We observe from the visualization S11 given virus type information, which was obtained by phylogenetic tree analysis [12] . GMLVQ supposes vectorial data representations and compares vectors in terms of a 416 well-defined dissimilarity measure. In this application, the GMLVQ training is based on 417 the Bag-of-Words coded sequences and yields class specific prototype vectors as well as 418 an optimum class/typus separating dissimilarity measure in the data space of encoded 419 sequences. Compared to phylogentic trees, which require high computational costs due 420 to the involved sequence alignment process, the GMLVQ approach has lower complexity 421 and allows an easy out-of-training generalization. Visualization of GMLVQ classification for the full D N -data by Ω-mapping. The data as well as the GMLVQ prototypes are depicted. The class coloring is as in S2 495 combinations, the coding by the histogram dimensions is given in the file 'S13 503 Histogram Coding.xlsx'. 504 S14 GMLVQ Mapping for D N Virus type assignments for the 505 D N -sequences obtained by GMLVQ For each sequence in D N , the class/type 506 assignment obtained by the GMLVQ model is given as well as if a rejection decision was 507 made according to the SCE of the GMLVQ model. Additionally, we provide the 508 sequences from D G , which were rejected by the SCE decision of the GMLVQ model. The respective file is 'S14 GMLVQMapping.xlsx'. Accordingly, we identify 2-, 4-, and 5-cluster solutions as most recommended The proximal origin of SARS-CoV-2 Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study Evolution and molecular characteristics of SARS-CoV-2 genome geographic and temporal distributions of SARS-CoV-2 mutations Simulating the origins of life: The dual role of RNA replicases as an obstacle to evolution Rapid, sensitive, full genome sequencing of Severe Acute Respiratory Syndrome Virus Coronavirus 2 (SARS-CoV-2) Structure of the SARS-CoV-2 spike receptorbinding domain bound to the ACE2 receptor Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. The Lancet Population genomics insights into the recent evolution of SARS-CoV-2 The genomic variation landscape of globally-circulating clades of SARS-CoV-2 defines a genetic barcoding scheme Phylogenetic network analysis of SARS-CoV-2 genomes Induction of decision trees Computational phylogenetics: An introduction to designing methods for phylogeny estimation 5: Programs for Machine Learning Deep learning on raw DNA sequences for identifying viral genomes in human samples Genotyping coronavirus SARS-CoV-2: methods and implications Detecting Life Signatures with RNA Sequence Similarity Measures Binary codes capable of correcting deletions, insertions, and reversals A general method applicable to the search for similarities in the amino acid sequence of two proteins Identification of Common Molecular Subsequences An improved algorithm for Matching biological sequences Similarities and Evolutionary Relationships of COVID-19 and Related Viruses A measure of the similarity of sets of sequences not requiring sequence alignment Average Values of a Dissimilarity Measure Not Requiring Sequence Alignment Are Twice the Averages of Conventional Mismatch Counts Requiring Sequence Alignment for a Computer-Generated Model System Information theory applictions for biological sequence analysis Alignment-free sequence comparison -a review A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering K-mer content, correlation, and position analysis of genome DNA sequences for the identification of function and evolutionary features Three Approaches to the Quantitative Definition of Information. Problems of Informtion Transmission Clustering by Compression The Similarity Metric A novel method of characterizing sequences: Genome space with biological distance and applications Virus classification in 60-dimensional protein space Real Time Classification of Viruses in 12 Dimensions Alignment-free sequence comparison: benefits, applications, and tools Benchmarking of alignment-free sequence comparison methods Biomedical data analysis in translational research: Integration of expert knowledge and interpretable models Prototype-based models in machine learning Deep Learning Architectures for DNA Sequence Classification Deep learning models for bacteria taxonomic classification of metagenomic data Identification of 12 cancer types through genome deep learning Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead Application of an interpretable classification model on early folding residues during protein folding Learning Vector Quantization Methods for Interpretable Classification Learning and Multilayer Networks Interpretable classification models for recidivism prediction Robustness of Generalized Learning Vector Quantization Models against Adversarial Attacks NCBI Viral Genomes Resource K 2 and K * 2 : efficient alignment-free sequence similarity measurement based on Kendall statistics Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984 Information Theory, Inference and Learning Algorithms On information and sufficiency Rényi continuous entropy of DNA sequences Divergence based vector quantization Batch and Median Neural Gas Neural Maps and Topographic Vector Quantization of Studies in Fuzziness and Soft Computing Robust K-Median and K-Means Clustering Algorithms for Incomplete Data Neural-Gas' Network for Vector Quantization and its Application to Time-Series Prediction Magnification Control in Self-Organizing Maps and Neural Gas Clustering by Message Passing Between Data Points Probabilistic Reasoning in Intelligent System Learning Vector Quantization Generalized learning vector quantization Aspects in Classification Learning -Review of Recent Developments in Learning Vector Quantization. Foundations of Computing and Decision Sciences Foundations of Quantization for Probability Distributions A stochastic approximation method Divergence based classification in Learning Vector Quantization Distance learning in discriminative vector quantization Limited Rank Matrix Learning, discriminative dimension reduction and visualization Adaptive Relevance Matrices in Learning Vector Quantization Regularization in Matrix Relevance Learning Can Learning Vector Quantization be an Alternative to SVM and Deep Learning? Efficient kernelized prototype based classification Kernelized Vector Quantization in Gradient-Descent Learning Efficient rejection strategies for prototype-based classification Classification with reject option Stochastic Neighbor Embedding Heavy-Tailed Distributions: Properties and Tests Visualizing High-Dimensional Data Using t-SNE Combining Image Compression and Classification Using Vector Quantization