key: cord-0216400-omkxs3ce authors: Jenkinson, Garrett; Oliver, Gavin R.; Khezeli, Kia; Kalantari, John; Klee, Eric W. title: Universally Rank Consistent Ordinal Regression in Neural Networks date: 2021-10-14 journal: nan DOI: nan sha: 83abe7ebcb85d905856d391ed4156bc95a555648 doc_id: 216400 cord_uid: omkxs3ce Despite the pervasiveness of ordinal labels in supervised learning, it remains common practice in deep learning to treat such problems as categorical classification using the categorical cross entropy loss. Recent methods attempting to address this issue while respecting the ordinal structure of the labels have resorted to converting ordinal regression into a series of extended binary classification subtasks. However, the adoption of such methods remains inconsistent due to theoretical and practical limitations. Here we address these limitations by demonstrating that the subtask probabilities form a Markov chain. We show how to straightforwardly modify neural network architectures to exploit this fact and thereby constrain predictions to be universally rank consistent. We furthermore prove that all rank consistent solutions can be represented within this formulation, and derive a loss function producing maximum likelihood parameter estimates. Using diverse benchmarks and the real-world application of a specialized recurrent neural network for COVID-19 prognosis, we demonstrate the practical superiority of this method versus the current state-of-the-art. The method is open sourced as user-friendly PyTorch and TensorFlow packages. Ordinal regression (sometimes called ordinal classification) is applied to data in which the features of the n-th example x n ∈ X correspond to a label y n ∈ Y := {r 1 , . . . , r K } from a set of elements that have a well-defined ranking or ordering r 1 < r 2 < · · · < r K . However, unlike traditional metric regression, the ranks cannot be assumed to have quantitative differences or distances amongst themselves. For example, while a syntactic statement such as "terrible"<"great"<"best" may be intuitive, it conveys nothing about quantitative distance between the ranks nor qualitative differences i.e. whether the distance between "terrible" and "great" is equal to that between "great" and "best". The aim in this setting is to build a reliable rule or regression function h : X → Y from the domain of the features X to the range of the ordinal labels Y. In the published literature for applied problems, it remains commonplace to ignore the ordering of the labels and apply categorical algorithms to such data (Levi & Hassncer, 2015; Rothe et al., 2015) , which in neural networks often results in application of the categorical cross entropy (CCE) loss. Problematically, a categorical loss assumes all mislabeling by h is equally wrong, whereas it is clear that predicting "great" when the true label is "best" would be preferable to a prediction of "terrible". Although the problematic nature of this practice has been recognized for more than 35 years (Forrest & Andersen, 1986) , it still remains common to make the implicit or explicit assumption that ordinal data or labels exist on an interval or ratio scale. Most contemporary algorithms found in the ordinal regression literature (McCullagh, 1980; Obermayer, 1999; Crammer & Singer, 2002; Shashua & Levin, 2002; Rajaram et al., 2003; Shen & Joshi, 2005; Chu & Keerthi, 2005; Li & Lin, 2007; Baccianella et al., 2009; Niu et al., 2016; Fernandes & Cardoso, 2018; Cao et al., 2020) , can be viewed through the lens of a general framework proposed by Li & Lin (2007) wherein the labels are encoded as binary vectors by an invertible encoder e : Y → {0, 1} K−1 and the regression function h becomes a collection of K − 1 binary classifiers along with the decoder e −1 : {0, 1} K−1 → Y. However, many of these existing algorithms share one major limitation: rank-inconsistency among predictions. Briefly, the K − 1 binary tasks are not independent and training K − 1 classifiers that incorrectly treat them as independent will produce conflicting predictions on the binary tasks, impeding both performance and interpretation of the results. Most recently, Cao et al. (2020) attempted to address this problem in deep neural network (DNN) architectures by proposing a final layer that shares weights among K − 1 binary outputs (differing only in their bias terms). Herein, we identify the theoretical and practical limitations of the CORAL approach by Cao et al. (2020) , and address these concerns by implementing a new algorithm 'Conditionals for Ordinal Regression' (CONDOR). We prove that CONDOR is universally rank consistent and that it is sufficiently expressive to reach any rank consistent solution. In theory, the method is compatible with any combination of binary classification algorithms producing probabilities, but herein we focus on DNN architectures trained by backpropagation. Using our open source PyTorch and TensorFlow packages, CONDOR can be implemented with minor modifications to any categorical DNN model using these frameworks, allowing for increased adoption of ordinal regression in the applied literature. We begin by introducing the CONDOR framework and its notations, and then provide proofs of the universal rank consistency and the full expressiveness of the framework. The Li & Lin (2007) encoder e converts the ordinal regression label y into K −1 binary classification labels y (1) , . . . , y (K−1) using indicator variables y (1) := 1 y>r1 , . . . , y (K−1) := 1 y>rK−1 , where the indicator variable 1 expr is defined as 1 expr := 1, expr is true, 0, expr is false. The ordinal classification problem then becomes a matter of producing K − 1 binary classifier subtasks f k : X → {0, 1}, which we assume come from thresholding predicted binary class probabilities p k : X → [0, 1] as f k (x) = 1 p k (x)>0.5 . When using parametric techniques such as DNNs, these functions will be parameterized by θ ∈ Θ and denoted notationally as p k (x; θ) and f k (x; θ). For convenience, we often deal with the rank index s ∈ {1, . . . , K − 1} associated with the rank r s ∈ Y. From the binary classifier subtasks, the rank index s i for input feature vector x i can be estimated as although multiple methods are possible to produce point estimates for s i from the probabilities p k (x i ; θ), k = 1, . . . , K − 1. As shown in Figure 1 , the aforementioned binary encoding approach requires that p k , k = 1, . . . , K − 1 be rank-monotonic p 1 (x i ; θ) ≥ · · · ≥ p K−1 (x i ; θ), for all x i ∈ X and θ ∈ Θ to guarantee consistent predictions. Rather than directly estimating the marginal probabilities p k (x; θ) = P (y (k) = 1|x = x, θ = θ), k = 1, . . . , K − 1, as in existing approaches based on Li & Lin (2007) , we estimate the conditionals q k (x; θ) := P (y (k) = 1|x = x, y (k−1) = 1, y (k−2) = 1, . . . , y (0) = 1, θ = θ) Figure 1: Existing methods can produce rank inconsistent predictions, whereas CONDOR sits atop any DNN architecture and produces universally rank consistent results. This improves performance and interpretability of the ordinal model. for k = 1, . . . , K − 1 where we set the boundary condition y (0) = 1 with unit probability by convention. Equality (2) follows by construction of the binary labels since y (k−1) = 1 implies y > r k−1 > r k−2 > · · · > r 1 , which by definition means y (k ) = 1 for k ≤ k − 1. By the same reasoning, the marginal probability is equivalent to the joint probability p k (x, θ) := P (y (k) = 1|x = x, θ = θ) = P (y (k) = 1, y (k−1) = 1 . . . , y (0) = 1|x = x, θ = θ) and by the product rule we produce a heterogeneous Markov chain representation of our marginal probabilities The above method in Equations (2) & (3) are exact (i.e., not approximations) and thus fully general, and this fact has previously been exploited in the literature for performing ordinal semantic segmentation (Fernandes & Cardoso, 2018) . These equations can in theory be applied in any classifier estimating binary probabilities, but we focus on the application to DNNs where θ represents the trainable parameters of the network. Namely, we select the final layer of the neural network to have K − 1 nodes with sigmoid activations representing q k (x; θ), k = 1, . . . , K − 1. For training, a reasonable heuristic loss akin to the one used in Cao et al. (2020) is the weighted binary cross-entropies (WBCE) of all the subtasks where λ k > 0 is the importance parameter for task k, which we set to one in the subsequent. However, as we will demonstrate, minimizing the following loss function results in the maximum likelihood (ML) estimate for the neural network parameters We call this approach Conditionals for Ordinal Regression (CONDOR). Here we substantiate CONDOR's guarantee for preserving rank consistency and its ability to represent any rank consistent solution. We first provide a proof regarding the maximum likelihood loss function. Theorem 2.1. The parameters θ that minimize the loss function in Equation (5) are the maximum likelihood estimators. Proof. Note that we have And thus under independent and identically distributed data, we find the negative log likelihood Λ of our data to be where the constant C = − N n=1 ln p(x = x n ) is independent of our parameters θ as it depends only on the (unspecified) distribution of the features, and the penultimate equality made use of the fact that y (5) with respect to our neural network parameters θ will minimize the negative log likelihood Λ(θ) = C + L(θ), and provide the maximum likelihood estimates of our parameters. Lemma 2.2. CONDOR provides universal rank consistency (i.e., rank consistent estimates for all input data x ∈ X and any parameterization θ ∈ Θ of the DNN). Proof. In neural networks, we can enforce 0 < q k (x; θ) < 1 for all x and any weight parameterization θ of the DNN by having K − 1 output nodes with sigmoid activations representing q k (x; θ), k = 1, . . . , K − 1. Because 0 < q k (x; θ) < 1 for all x and θ, we have by Equation (3) for all x, θ and k = 1, . . . , K − 2. Thus we have rank consistency for all x and any weight parameterization θ of the DNN. To simplify the presentation, we will suppress the notational dependence on θ in the remainder of the manuscript. Theorem 2.3. Assuming that a neural network can universally approximate any C 1 function g : X → R K−1 such that g(x) = g(x) + O( ) for all x and some > 0, then adding a CONDOR output layer to said network can approximate any rank consistent continuous ordinal regressors Proof. By rank consistency for any x and we have defined the boundary condition p * 0 (x) = 1. Then we define for > 0 and note that for each k and x the function q k (x) is continuous and satisfies Then define the continuous functions Because the upstream neural network can approximate any continuous function g : and have the upstream neural network produce the functions Then after the CONDOR sigmoid activations we would have the neural network produce for all x at its output nodes 3 Then the CONDOR approach yields By iteration it follows that Using the definition of q * k (·) in Equation (6), we get for all x and k = 1, . . . , K − 1. The last Equality (8) comes from considering separately the cases In the former we have p * k (x) = 0 and Equation (7) reduces to p k (x) = 0 + O( ). In the latter we find from Equation (7) p In the subsequent sections, we then demonstrate CONDOR's superior performance compared to the state-of-the-art on several benchmark and real-world data sets. Specifically, we profile the WBCE from Equation (4), the earth movers distance (EMD) on the rank indices (assuming unit distance between ranks), and the mean absolute error (MAE) on the rank indices (assuming unit distance between ranks) using Equation (1) for the point estimate. We benchmark CONDOR trained using the maximum-likelihood loss in Equation (5) as well as CON-DOR trained using the WBCE loss in Equation (4), which will be denoted as CONDOR-WBCE in the subsequent. In these experiments, the only difference between the four methods is the choice of loss function and final layer of the neural network; all other details of the DNN architecture, the optimization algorithms, hyperparameters and random number seeds are kept equal throughout each experiment. Namely, CORAL and CONDOR-WBCE both have the WBCE for the K − 1 subtasks as a loss, whereas Categorical uses the CCE and CONDOR uses Equation (5). Likewise, CORAL uses a custom final layer with weight sharing Cao et al. (2020) among K − 1 output nodes, CONDOR uses a final dense layer with K − 1 output nodes which after sigmoid activation represent q k (x), k = 1, . . . , K−1, and the categorical algorithm uses a dense layer with K nodes and a softmax activation. All results were gathered with three random number seeds and reported as the mean plus or minus the standard deviation across these seeds. We consider the simple task of ordinal classification wherein the labels 0, 1, 2, 3 are the quadrants of the plane going counterclockwise and the features are generated from a 2D standard normal distribution. We draw 1000 samples and do a 90/10 train/test split of the dataset. We select as the upstream network architecture two dense layers with ten neurons and RELU activations and an Adam optimizer with 100 epochs and early stopping patience of 10 using a validation split of 0.2. As can be seen in Table 1 , the algorithms proposed in CONDOR demonstrate the best performance in WBCE, EMD and MAE. Depending on the application, MNIST can be considered a categorical problem or an interval regression problem. If the digits are used for license-plate recognition, then the problem is categorical since there is no notion of "close" errors. By contrast, if the digits are used for GPS coordinates or postal codes, then the ordering and distance between numerals becomes relevant and categorical classification is no longer the most appropriate framing of the task. It is valid to treat interval regression as an ordinal problem since this assumes less structure on the dataset, although it is recommended to exploit the interval scale. Here we treat MNIST as ordinal data for the purpose of benchmarking our ordinal algorithms, while acknowledging that it should likely be treated as either a categorical classification or interval regression as dictated by the specific real-world application setting. The MNIST data are split into training, validation and test sets of 55K, 5K and 10K images, respectively. We utilize a convolutional neural network with two convolutional layers of 64 and 32 filters respectively and a kernel size of 3, before flattening and passing to the appropriate output layer and loss function for our four models (CONDOR, CONDOR-WBCE, CORAL, Categorical). Training is performed with the Adam optimizer, a maximum of 100 epochs and an early-stopping patience of 10. The results in Table 2 indicate that CONDOR demonstrates superior performance in all three metrics. Here we consider a natural language processing (NLP) dataset consisting of 99, 025 (non-duplicate and non-empty) Amazon Pantry text reviews with their corresponding one to five star ratings (Ni et al., 2019) . We split the data to have a test set with 10, 000 examples. For the neural network architecture, we use the fixed and pre-trained Google universal sentence encoder (?) and append a dense layer with 64 ReLu-activated neurons and a dropout of 0.1, followed by the appropriate output layer and loss function for each model. Training is performed with the Adam optimizer, a maximum of 100 epochs and an early-stopping patience of 10 with a validation split of 0.2. The results in Table 3 demonstrate that CONDOR-WBCE provides the strongest performance in this benchmark across all three performance metrics. This study adheres to a research protocol approved by the Mayo Clinic Institutional Review Board. Here we extend the results from Sankaranarayanan et al. (2021) to progress from their binary classification predicting mortality to ordinal regression predicting severity of outcomes. Namely, this clinical dataset includes two binary severity outcomes: an indicator variable for mechanical ventilation or extracorporeal membrane oxygenation (ECMO), as well as an indicator variable of whether patient death occurred. From these, a clear three point ordinal scale can be constructed whereby a patient is scored a zero when they have no severe outcome, a one when they experienced the severe outcome of ventilation or ECMO, and a two corresponding to death (with or without prior ventilation or ECMO). Sankaranarayanan et al. (2021) identified the GRU-D recurrent neural network architecture as the best performing model for binary mortality prediction, and we extend that approach to the ordinal problem. We do this for the CONDOR, CORAL, and categorical algorithms using their corresponding final layers and loss functions. This GRU-D architecture (?) deals explicitly with the 55 dimensional time series that is missing not at random (MNAR) due to the manner in which clinical data is ordered and recorded in an electronic health record (EHR). The default hyperparameters (dropout of 0.3, l2 regularizaton of 0.0001, 100 hidden and recurrent neurons, batch size of 256, adam learning rate 0.001, no batch norm, no bidirectional RNN, 50 max time steps, 100 epochs with early stopping patience of 10 epochs) have previously demonstrated strong performance (Sankaranarayanan et al., 2021) and so are retained here. The dataset is split into a training/validation set of 9, 435 patients who tested positive for COVID-19 prior to December 15 2020 by PCR test, and a prospective testing set of 2, 372 patients who tested positive on or after that date. For training we use an identical 90/10 training/validation split on the 9, 435, which facilitates early stopping with patience. The results in Table 4 demonstrate that CON-DOR is superior in EMD while CONDOR-WBCE is superior in the remaining metrics. Furthermore, the CONDOR-WBCE GRU-D model has a prospective test set AUROC of 0.9038 ± 0.0025 for the mortality prediction subtask, which is greater than the 0.901 reported in Sankaranarayanan et al. (2021) wherein the authors trained the algorithm as a binary classifier specifically for mortality prediction. This demonstrates that there is no loss of mortality prediction performance when building a DNN to address the more challenging task of prognostication. We have demonstrated the ability of the CONDOR approach to overcome limitations present in popular alternative methods and to produce rank consistent results in the classification of data with ordinal labels. Rank consistency is not only important for theoretical soundness, but in application settings where explainability is important and a rank inconsistent prediction will be unacceptably contradictory and fundamentally unexplainable. Regardless of the loss function being optimized or the parameterization of the neural network, CONDOR provides universal guarantees of rank consistency by Lemma 2.2, which is to say the CONDOR approach is "sufficient" for rank consistency. Our next result leverages the fact that there are a wide-variety of universal approximation theorems for neural networks each with their own technical conditions (e.g., see Chong (2020) for discussion of various universal approximation proofs and technical conditions). Namely, Theorem 2.3 states than any upstream neural network satisfying the conditions for universal approximation can be provided a CONDOR output layer, which will create a universally rank consistent network that can approximate any rank consistent solution. This theorem can be interpreted as CONDOR being "necessary" for rank consistency, insofar as any rank consistent solution can be represented by a CONDOR neural network. Finally, we provide in Theorem 2.1 the maximum likelihood loss function, which should also be more numerically stable than the heuristic loss of WBCE particularly when the number of ordinal classes is large. Thus we suggest the maximum likelihood loss in practice, even though its performance was similar to WBCE in our numerical experiments. In contrast to our theoretical results, note that Theorem 1 of Cao et al. (2020) only provides rank consistency at the global minimum of an optimization problem with the specified loss function. Since neural network training is not guaranteed or expected to reach a global optimal parameterization, the Cao et al. (2020) approach can in theory produce rank inconsistent solutions and thus in practice requires post hoc checks of the estimated bias terms to verify rank consistency. Furthermore, Cao et al. (2020) restricts expressiveness in the K − 1 binary classifier outputs that must have "parallel slopes" (i.e., differ only by a bias parameter whose impact is completely independent of the feature vector). In Appendix A.1, we formalize these comments with two proofs demonstrating that the CORAL framework (Cao et al., 2020) lacks the theoretical guarantees of CONDOR. Beyond our mathematical justifications, ultimately it is critical that the method perform well within a wide variety of neural network architectures and ordinal problem settings. Our benchmarking of dense networks, CNNs, attention networks, and exotic RNNs all shows practical benefit of using the CONDOR algorithm in a diverse set of ordinal applications using a variety of ordinal metrics. Furthermore, beyond ordinal measures of performance, CONDOR remains competitive in the categorical performance measure of accuracy, and in-fact provides improved classification in true ordinal problems when compared to categorically optimized neural networks. We attribute this to the ability of the network to exploit "clues" encountered during ordinal training (Appendix A.2). In addition to the theoretical strengths and performance improvements of the CONDOR method, we note that many applied machine learning papers simply use categorical classification in their problem settings rather than consider current state-of-the-art methods for ordinal regression. Part of this may be educational, as most beginners are only taught binary/categorical classification and continuous regression. However, the authors also believe part of the barrier is programmatic easeof-use. We provide production-ready and user-friendly software packages in both PyTorch and TensorFlow, in order to minimize the effort required to convert existing categorical code-bases into CONDOR ordinal code-bases. In Appendix A.3, we demonstrate the modest code changes required to implement our methodology in an existing categorical code base. The key requirements for successful supervised learning tasks are algorithms that respect the structure of the problem, and access to sufficient amounts of labeled data. Since CONDOR satisfies the first requirement by providing a robust algorithm for ordinal regression, we conclude with the latter by emphasizing the prevalence of available ordinal outcome measurements, using medical applications as a prototypical applied problem domain. Survey research for instance, frequently utilizes ordinal responses such as the psychometric Likert scale (Likert, 1932) , providing a large corpus of existing data with ordinal labels. Furthermore, while labeling outcomes from the Electronic Health Record (EHR) is one of the most time-consuming and expensive aspects of applied machine learning in the medical space, the proliferation of ordinal scales in modern medical practice (see Appendix A.4) means the EHR already contains physician-provided ordinal outcomes from a large variety of settings. The ubiquity of ordinal outcome measurements throughout survey research and medical settings represents a rich untapped reserve of training data that have yet to be fully explored by ordinal regression machine learning algorithms, and it is our hope that CONDOR's demonstrated capabilities and its ease-of-use will encourage its adoption and enable broad exploration of underutilized data across these and other domains. The funding for this research has been provided Mayo Clinic Center for Individualized Medicine. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The authors thank Saranya Sankaranarayanan and Jagadheshwar Balan for sharing their preprocessed versions of the COVID-19 data set and their code for GRU-D mortality prediction. A.1 CORAL PROOFS In this appendix, we demonstrate formally that the CORAL framework of Cao et al. (2020) does not have the universal rank consistency nor the expressiveness of CONDOR. Lemma A.1. CORAL is not universally rank consistent. Proof. For CORAL the last layer shares weights and only has different bias terms b 1 , . . . , b k meaning it represents the output probabilities as (Cao et al., 2020) for some a(x). This means in the notation of CONDOR that for k = 2, . . . , K − 1 and q 1 (x) = p 1 (x). Note that if the neural network parameters are chosen such that b k > b k−1 for some k then q k (x) > 1 and CORAL is rank inconsistent. It is quite clear from Equation (10) that the functional form of CORAL is far more restrictive than CONDOR, which allows arbitrary q k : X → [0, 1] functions. For completeness, we prove formally in the next lemma that it is less expressive. Lemma A.2. CORAL can not approximate every rank consistent solution with O( ) error. Proof. For simplicity, consider a univariate input x and K = 3. Then we have for CORAL q 2 (x) = p 2 (x) p 1 (x) = 1 + exp(−a(x)) exp(−b 1 ) 1 + exp(−a(x)) exp(−b 2 ) = exp(a(x)) + exp(−b 1 ) exp(a(x)) + exp(−b 2 ) , and q 1 (x) = p 1 (x) = σ(a(x) + b 1 ). Consider an extremely simple CONDOR network with no hidden layers, bias parameters fixed to zero and only two weights w 1 = 1, w 2 = 2 producing q * k (x) = 1 1 + exp(kx) , k = 1, 2 which is rank consistent by Lemma 2.2. Suppose by way of contradiction that there exists CORAL a(x) and b 1 , b 2 such that q k (x) = q * k (x) + O( ) for all k and x. Thus and plugging in Equation (11) into Equation (12) we find after rearrangement which is a contradiction since the left hand side has an infinite range depending on x whereas the right hand side is a constant up to an error of order . Accuracy is a categorical performance measure wherein there is not an increasing penalty for being further from the correct label, and therefore no relative "credit" given for being close to the correct label. One might therefore expect that training a neural network with a categorical loss (i.e., CCE) would result in higher categorical accuracy than if the network were trained with a ordinal method. Table 5 we find in the majority of benchmarks the proposed ordinal methods provided higher categorical accuracy than the networks trained using categorical cross entropy to specifically optimize categorical performance. We attribute this remarkable finding to the "clues" provided to the network when the ordinal nature of the problem is exploited during training. For instance, if the network incorrectly guesses rank index 7 when the true rank index is 8, the categorical loss treats this equivalently to a guess of a rank index of 1; the back propagation does not send any signals indicating that the guess of rank 7 was "close" to the true rank of 8. By contrast, Equation (4) would capture the fact that most of the binary subtasks are correctly predicted when a rank 7 is estimated for a ground truth rank of 8. In problems like MNIST where the features are not necessarily trending with increasing rank, we could understand how a categorical loss produces a stronger categorical accuracy. But in problems like Amazon star ratings, where the language and sentiment of 4 and 5 star reviews are likely closer in the NLP embedding space than the language and sentiment of 1 star reviews, one can also understand how training with an ordinal loss could provide higher categorical accuracy than a categorical loss that only has the capacity to indicate "correct" versus "incorrect" and never "incorrect but close". CONDOR is open sourced as both TensorFlow 4 and PyTorch 5 repositories that make it simple to modifying existing categorical code bases to use CONDOR. See Figure 2 for a hypothetical example in TensorFlow. Both the TensorFlow and PyTorch versions of the GitHub repositories have full mkdocs documentation, docker files, ipynb tutorials, continuous integration testing and pip packaging. The authors hope this reduces the barrier to using proper and cutting-edge ordinal regression in applied problems. Evaluation measures for ordinal regression Rank consistent ordinal regression for neural networks with application to age estimation A closer look at the approximation capabilities of neural networks New approaches to support vector ordinal regression Pranking with ranking Ordinal image segmentation using deep neural networks Ordinal scale and statistics in medical research Understanding Lung-RADS 1.0: A Case-Based Review Age and gender classification using convolutional neural networks Ordinal regression by extended binary classification A technique for the measurement of attitudes Regression models for ordinal data (with discussion) BI-RADS update American Association for the Surgery of Trauma Organ Injury Scaling: 50th anniversary review article of the Journal of Trauma Justifying recommendations using distantly-labeled reviews and fined-grained aspects Ordinal regression with multiple output cnn for age estimation Support vector learning for ordinal regression Classification approach towards ranking and sorting problems Differential effects of the Glasgow Coma Scale Score and its Components: An analysis of 54,069 patients with traumatic brain injury Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology Dex: Deep expectation of apparent age from a single image Covid-19 mortality prediction from deep learning in a large multistate electronic health record and laboratory information system data set: Algorithm development and validation Ranking with large margin principle: Two approaches Ranking and reranking with perceptron Bosniak Classification of Cystic Renal Masses, Version 2019: An Update Proposal and Needs Assessment Pain: a review of three commonly used pain rating scales Grading diabetic retinopathy (DR) using the Scottish grading protocol nostication, treatment, and decision making all require evidence-and consensus-based labeling of patient disease states. Frequently, these categorizations are made ordinal to align with the expected prognosis or severity of disease Well-known to the general public is the use of tumor staging in oncology to characterize neoplasms. However, we provide a non-comprehensive sampling of other specialties that are perhaps less well-known. For instance, the American Association for the Surgery of Trauma provides 32 ordinal scales (Moore & Moore, 2010) for assessing the severity of trauma to 32 organs on scale of 1 (minimal) to 6 (lethal) While these examples are not intended to be a comprehensive review, they hopefully provide some insight into just how prevalent ordinal scales are in modern medicine