key: cord-0191868-zy7cggp7
authors: Cao, Alexander; Klabjan, Diego; Luo, Yuan
title: Open-Set Recognition of Breast Cancer Treatments
date: 2022-01-09
journal: nan
DOI: nan
sha: 2e96ce894cf2f2c5a4dda79e5e493c58848db5bf
doc_id: 191868
cord_uid: zy7cggp7

Open-set recognition generalizes a classification task by classifying test samples as one of the known classes from training or"unknown."As novel cancer drug cocktails with improved treatment are continually discovered, predicting cancer treatments can naturally be formulated in terms of an open-set recognition problem. Drawbacks, due to modeling unknown samples during training, arise from straightforward implementations of prior work in healthcare open-set learning. Accordingly, we reframe the problem methodology and apply a recent existing Gaussian mixture variational autoencoder model, which achieves state-of-the-art results for image datasets, to breast cancer patient data. Not only do we obtain more accurate and robust classification results, with a 24.5% average F1 increase compared to a recent method, but we also reexamine open-set recognition in terms of deployability to a clinical setting.

Most prior work in classification is closed-set; meaning the classes are assumed to be the same for both training and testing. Only relatively recently have classifiers designed for open-set evaluation, where unknown classes appear only in testing, gained attention as a real-world necessity. In particular, open-set image recognition arises from increasingly automated computer vision systems such as those in self-driving cars. It would certainly be impossible to include every object class that could possibly be seen while driving in model training [1] . Given the inherent dynamism of healthcare, one can argue a greater need for open-set classifiers. Some diseases are too rare to include sufficient samples in training [2] . There is also a consistent cycle of identifying novel diseases and developing treatments for them (recently COVID-19, for instance). Both of these circumstances necessitate generalizing medical classification tasks to open-set recognition. In this paper we study an example from a more prevalent circumstance: the discovery of new drug combinations for already existing diseases.

Personalized medicine through quantitative, phenotypic profiling shows promise in medical care by guiding drug combination strategies [3] . In cancer treatments, these drug combinations are becoming the standard of care and many drug combination therapies have been approved or are under clinical trials [4] . The landscape of cancer drug combinations, or "cocktails," evolves with discoveries of novel cocktails with improved treatment and lessening side effects. Although some guidelines exist for certain cancer types, individual patients' responses to various drug combinations are still not well understood. For instance, which drug combinations would likely benefit a specific patient the most is still a critical, open question. In this vein, we formulate predicting cancer cocktail treatments as an open-set classification problem. Our goal is to classify patients by cocktail treatments based on medical and demographic features. In addition, however, sufficiently unique patients unlike those historically associated with known cocktails (i.e., cocktails in the training set) should be classified as "novel." This "novel" class is an indication that different or new cocktails may be more suitable for those patients' treatments. To our knowledge, this is the first application of open-set recognition to cancer treatment predictions.

In this paper, we focus on the open-set learning variant of training (and validating) on only the C known classes for (C + 1)-class classification during inference. The (C + 1)-th class aggregates all novel test samples not belonging to the known classes. To reflect a real-world scenario, we do not have samples from "novel" cocktails during the training and validation phases. For this pilot study, we focus on predicting if a patient will benefit from a novel cocktail treatment versus known cocktails. Previous healthcare open-set studies rely on the use of fabricated or auxiliary data and standard softmax classifiers [2, 5] . Bypassing this data necessity and instead exploiting reconstruction, we adapt the existing Gaussian Mixture Variational Autoencoder (GMVAE) model [6] , which achieves state-of-the-art results in open-set image recognition, to our open-set cancer treatment recognition task. GMVAE is a deep neural network, autoencoder-based model in which the bottleneck latent layer simultaneously performs class-based clustering and learns reconstruction. In this way, the patient data is embedded in a lower-dimensional representation that discriminates between known cocktail classes and, unlike standard approaches, simultaneously captures interdependent patient information. This dual nature of classification and reconstruction leads to a more flexible and amenable latent representation. With the model's embedding in hand, [6] applies an "uncertainty" threshold based on distances to class centroids to more accurately and robustly distinguish between the known and "novel" cocktail classes. We apply these methods to breast cancer patients' electronic health records (EHRs) from Northwestern Memorial Hospital. In doing so, this study achieves a step towards implementing a system to help physicians identify cancer patients who may benefit from a novel drug cocktail in a real-time clinical setting. Our paper is organized as follows. In §2, we compare related work along with a comprehensive summary of the benchmark model [7] . Following in §3, we first provide background on the GMVAE model coupled with the "uncertainty" threshold [6] . In particular, we emphasize the intuition behind dual reconstruction-classification learning and "uncertainty" for open-set recognition. Next, we present the complete experimental design from data feature engineering to model evaluation. Subsequently in §4, we conduct open-set recognition experiments on our breast cancer patient dataset.

From these experimental results, we stress two findings, which are the main contributions. First, GMVAE outperforms a state-of-the-art, solely classification-based, deep open-set classifier both in terms of accuracy and robustness to an increasing number of unknown cocktails. Second, relevant prior methods [2, 5, 7] bypass selecting a single optimal threshold for rejecting unknowns by reporting AUC or ROC metrics or simply assuming a binary known-unknown false positive rate. However, all of these are uninformative for actual model deployment where a single threshold would be used for decisions. In contrast, GMVAE combined with "uncertainty" showcases an intuitive heuristic for selecting a single, optimal threshold. This process fits a threshold based on the known validation set classification accuracies and is further explained in §4. We emphasize this is a more practical model evaluation comparison. Summary ROC metrics can be useful in comparing different models in a holistic sense. However, in terms of real-world model usage in a clinical setting, it is more apt to compare actual decision accuracies which are only apparent after choosing a threshold. Finally in §5 and §6, we end with a discussion on limitations and future work, and conclude.

For literature placement, it is important to note that open-set recognition reduces to outlier detection in the case where the number of known classes C = 1 (viewed as a "normal" class). Outlier detection, or the related novelty or anomaly detection, is a longer studied topic [1, 8, 9] . Such methods are utilized in healthcare to detect outliers in breast cancer survivability predictions [10] and anomalous activity in EHRs [11] . Outlier detection, however, does not generally extend to differentiating between multiple known classes hence open-set recognition. For instance, in our breast cancer patient experiments there are three and four known cocktail classes.

There is an immense body of existing work concerning traditional closed-set classification. Openset recognition, on the other hand, is only relatively recently receiving more consideration. Earlier examples of (C +1)-class classification employ SVMs [12, 13] or sparse representation [14] . Open-set recognition in conjunction with deep neural networks is the current trend [2, 15, 16, 17] . However, these methods are almost exclusively designed solely for image recognition; network architecture reliance on image patching, channel activation, spatial pooling, feature map modulation, and pixel reconstruction inhibit usability for non-image-based tasks (such as ours).

While image classification benefits from a well-rounded surge in open-set recognition, applications to general healthcare data are wanting. Specifically in [2] , eye diseases are open-set classified using optical coherence tomography (OCT) images but the method is contingent on a patchGAN-derived model [18] to generate synthetic, "boundary" images that are deliberately difficult to classify with a pretrained, closed-set softmax classifier. These manufactured outliers are then added to the original dataset and used to train a standard (C + 1)-class classifier. The multi-phase training, known complications of training GANs, and assumed image-based data limit the generalization of this work to other healthcare applications. Furthermore, the authors in [2] visualize the generated "unknown" class images to verify they are "different yet reasonable," it is not clear how to apply this criterion to patient demographics or abnormal lab tests that comprise our data. Relatedly, [5] proposes framing medical diagnosis classification in terms of open-set recognition. Their method treats samples from less common conditions as a proxy for the unknown classes and instead maximizes their cross-entropy during classic softmax training. During inference, a simple threshold is applied to closed-set, softmax probabilities to reject unknown samples. A shortcoming of this method is the restricting assumption that one's dataset can afford such a large enough and representative residual subset. Indeed in [5] , the authors have a known training set of 160 diagnosis classes and a counterpart set of another 160 diagnoses (each with at least 10% of training diagnosis' samples) to model the unknown classes. For our breast cancer patient dataset, there are orders of magnitude differences in the number of samples per cocktail, as well as a drug approval timeline. Both reasons render our residual samples inadequate and possibly time-inconsistent for such a procedure.

In contrast, GMVAE naturally serves non-image data and entirely circumvents the need for artificial "unknown" or "novel" samples. Accordingly, for a compatible comparison, we benchmark GMVAE against the so-called ii-loss and outlier score method of [7] . This specific benchmark is also fitting because it (i) attains state-of-the-art open-set recognition accuracies on two non-imagebased datasets, and (ii) makes use of similar latent space distance-based thresholding to reject "novel" samples. It is worth noting that [7] demonstrates that naive thresholding on closed-set, softmax classifiers can lead to significantly poorer open-set recognition. The ii-loss is still wholly classification-based and by contrast GMVAE has the advantage of dual classification-reconstruction learning.

For completeness, we now summarize the ii-loss and outlier score. The authors in [7] argue that open-set recognition is most amenable in a data embedding that clusters samples from the same known class tightly together (low intra-spread) but pushes samples from different known classes far apart from each other (high inter-spread). To directly produce such a neural network mapping z of the data x, they minimize the following loss function:

where N is the total number of samples, |C i | is the number of samples in class i = 1, ..., C and

is the centroid of each class i. The first term in (1) measures intra-spread and so aims to minimize the distance between each latent z and its own centroid. The second term in (1) quantifies interspread and seeks to maximize the minimum distance between class centroids. Batch normalization layers prevent this term from diverging to infinity. The neural network projection z(x) has no set architecture and can be composed of any architectural designs.

With a trained latent representation z(x) in hand, the outlier score, or squared distance to the nearest centroid, is given by

for a test sample x. Consequently, distances to centroids also naturally emit a softmax posterior class probability

Finally, thresholding on the outlier score, the open-set prediction is

We argue that a drawback of this entire procedure is the unsystematic, ad-hoc method of selecting the threshold τ . It is assumed that some percentage, a so-called contamination ratio α, of the training set are outliers. Correspondingly, the threshold τ is set to the 1 − α percentile of all training outlier scores. In experiments, [7] finds that a 1% contamination ratio is broadly suitable. While this is certainly easily understood for the user, it lacks any guidance from the embedding clustering and simply follows from the early presumption. In §4, we illustrate a more deliberate selection for GMVAE's "uncertainty" threshold τ .

In this section we outline the complete methodology used for the open-set recognition of breast cancer treatments experiments. First, we describe the GMVAE model; second, each step of the dataset construction is detailed. Finally, we summarize model training and evaluation procedures.

While the detailed derivation and technical mathematics of GMVAE and the "uncertainty" algorithm can be found in [6] , we also briefly overview the model in the Appendix. Additionally, in [6] , the authors spend a great time justifying the need for a Gaussian mixture embedding per class for images as well as a procedure for identifying the number of components. However, for our breast cancer dataset, initial assessments indicate that we cannot discern enough patient encounter heterogeneity to warrant multiple components per drug cocktail treatment class. Accordingly, we utilize K = 1 (one cluster per class) for all experiments and simplify GMVAE to a single Gaussian prior for each class.

The bulk of this section focuses on an intuitive understanding of GMVAE and "uncertainty," as it relates to the open-set recognition of breast cancer cocktail treatments. While GMVAE's structure is more intricate than a standard VAE, its essence can still be understood as the encoder-decoder composition. The principal difference with unsupervised VAEs is that the latent, bottleneck layer patient encounter features Figure 1 : One-dimensional depiction of GMVAE's class-based and reconstruction bottleneck for our breast cancer patients dataset. Neural network q φz encodes patient encounter features x into learned embedding z. The class numbers (and colors) correspond to drug cocktail treatments prescribed for patients' encounters. Latent variable z is then used to reconstruct the original data

cooperatively performs class-based clustering (clinically can be thought of as endophenotyping) and learns reconstruction. We illustrate this duality in Figure 1 .

Patient encounter features x are projected to latent space z of significantly fewer dimensions hence the bottleneck, with neural network q φz . In z-space, GMVAE ELBO's latent covering term clusters drug cocktail treatment classes together as depicted with class numbers and colors. However, these class clusters are translated and scaled by the reconstruction term (see Appendix) which promotes patient encounters with similar features to be closer and vice-versa. For example, drug cocktails 1 and 2 may be separated based on neutrophil levels and this characteristic further discriminates these classes. Conversely, a subset of patients of drug cocktails 3 and 4 may share the same insurance, forcing the class clusters to overlap (shown as the yellow-green striped region). While this overlap may be seen as counterproductive, we believe one should not weight or select features to maximize class separation. The reason being that one does not know apriori which features will best separate the "novel" samples. Therefore, an embedding z which most accurately represents data features will naturally separate those distinguishing "novel" samples. Finally, samples' representations z are used to reconstruct the features x via network p θ . Of course this cooperative, multi-task learning occurs across the entire multi-dimensional z-space. It is important to note here that because of GMVAE's tendency to overlap classes based on reconstruction, the ii-loss exhibits better discrimination among the known classes. However, this closed-set weakness becomes an open-set strength as this behavior shrinks high-risk open-space between the known clusters.

While GMVAE's encoder produces a latent space Gaussian (outputs mean and diagonal covariance), the mean µ(x; φ z ) is instinctively designated as the effective z-space mapping. Similar to the outlier score, [6] then applies a distance measure to carve the "novel" decision boundaries around each known centroid. The "uncertainty" threshold quantity is defined as the ratio between the distance to the nearest centroid and the average distance to all other centroids. For our K = 1 case with test sample x, let us denote z c as each known class's training latent centroid and c * = arg min c ||µ ( x; φ z ) − z c || 2 . Then "uncertainty" U is mathematically expressed as

The key differences are that this threshold captures orientation with respect to known centroids (unlike the rotationally symmetric outlier score) and is scale invariant. We visualize these attributes in Figure 2 . For non-trivial open-set recognition, we may assume the "unknown" or "novel" samples are comparable to the known classes. As such, there exists a large risk of incorrectly predicting a "novel" sample as one of the known classes. Thresholding upon U seeks to minimize this risk by penalizing the open space between known centroids more heavily, as perceived in Figure 2 . If U = 0 then the test sample's latent embedding is exactly one of the known centroids with no doubt of its classification. However, if U = 1 then the test sample's embedding is equidistant to all known centroids and is unclassifiable among the known classes. In addition, U approaches 1 if the test sample's embedding is sufficiently far from all known centroids. Finally, the metric U is designed to be standardized between 0 and 1, making it more universally applicable as opposed to the outlier score's raw distance.

The dataset consists of breast cancer patient records at Northwestern Memorial Hospital spanning from 2000 to 2015. We consider patient encounters as independent samples. While this removes the longitudinal aspect of the data, it allows for a more direct application of existing, timeless open-set recognition methods and also matches the cancer treatment narrative because physicians can adjust drug cocktail treatments as patients respond differently or side effects flare.

The samples' classes are specific drug cocktails and are assembled by simply aggregating the prescribed medications for each patient encounter. We only include drugs principally related to treating cancer (as listed by the National Cancer Institute). For the purposes of our breast cancer-related task, drugs like Acetaminophen are extraneous and therefore excluded. In addition, we ultimately only take those cocktails with more than 1,000 encounters to maintain reasonably-sized classes. Table 1 below summarizes the drug cocktail classes of our dataset. Each cocktail's FDA approval year is set to its latest component drug's FDA approval year. Perhaps not so surprising is that most encounter-level cocktails are composed of only a single drug. Even so, we refer to these as cocktails for convenience. Phenotypic feature engineering for the patient encounters is relatively straightforward with few transformations. We enumerate and categorize demographic and physical characteristic features in Table 2 , diagnoses (ICD-9 codes) in Table 3 , and lab features in Table 4 . All demographics except "age at encounter" are the same for a single patient across encounters. Physical characteristics, having little variance, are averaged across encounters for each patient. In addition, we cutoff ICD-9 codes with less than 1,000 total encounters so that diagnoses are relevant to the dataset as a whole.

After physical characteristic averaging, the lab results are the only remaining features with missing values. The initial missing rates for lab results, resulting from a naive encounter-based merge between cocktails and lab results, are as great as 94.9%. While this is very large, it is consistent with our understanding of patient encounters. Physicians often do not re-order lab tests if there is no reason to expect a change in results. Accordingly, our first-pass data imputation is performing carry forward on all encounters (even those where a cancer drug was not prescribed). After this procedure, nearly all of the missing rates for cocktail-prescribed encounters' lab results fall drastically. We list these missing rates for each lab in Table 4 . As the final step, we use MICE [19] to impute all outstanding missing lab results. We run five MICE trials and average the results to create the final, holistic dataset.

To conduct the open-set recognition experiments, we must regard a subset of the breast cancer drug cocktails as the "novel" class. Novel drug development signifies an obvious chronology (hence the FDA approval years in Table 1 ) and so we designate the more recent cocktails as "novel." For the purposes of considering well-balanced "known" and "novel" class splits (in terms of both the number of classes and samples), we conduct two separate experiments. In the first experiment, we designate cocktails 1, 2, 3, and 4 as "known" and cocktails 5, 6, and 7 as "novel" (cocktail numbers in Table 1 ). The second experiment has cocktails 1, 2, and 3 as "known" and cocktails 4, 5, 6, and 7 as "novel." More details are given in the respective experimental results subsections. Finally for a model-ready dataset, the training, validation, and testing sets are created as follows. The training set is composed of 2/3 of each "known" cocktail's samples. The validation set is composed of 1/6 of each of the "known" cocktail's samples. These two sets have no novel samples. Finally, the test set is composed of 1/6 of each "known" cocktail's samples and a random subset of size 1/6 of each "novel" cocktail's samples. In this way, the class balances of each split reflects the population. We create 100 test sets by sampling without replacement within each "novel" cocktail. (There are no repeated "novel" samples within a test set, but there are across test sets.) Accordingly, we can present test evaluation minimum-to-maximum intervals. 

For neural network inputs, the numerical features are normalized and the categorical features are one-hot encoded. We then minimize the loss over the training set (using Adam optimizer with learning rate 0.001) until the objective, evaluated on the known validation set, plateaus or begins to increase. For the numerical features, a Gaussian models the reconstruction. The latent space dimension of z equals 10 for both GMVAE and the ii-loss benchmark. A table of network architectures for GMVAE is presented in Table 5 . The θ network is the mirrored φ z network. For GMVAE, sigmoid activations follow each hidden layer. The φ z network is pretrained on the known classes and the respective weights are then frozen. The z network for the ii-loss benchmark has the same fully connected layers as GMVAE's φ z network except for ReLU activations, dropout, and batch normalization layers. We follow their implementation and the details can be found in [7] .

From experimental results given next, we clearly demonstrate that GMVAE outperforms the stateof-the-art, deep open-set classifier based on the ii-loss and outlier score, both in terms of accuracy and robustness to an increasing number of novel cocktails (and samples). We attribute this to two primary reasons. First, the GMVAE model also considers reconstruction, which captures additional data structure information, as well as classifier information. Second, GMVAE is more deliberate in algorithmically selecting an "uncertainty" threshold τ based on the known validation set. Indeed, the outlier score in [7] does not even necessitate a validation set. For this threshold selection as well as model comparison, we utilize the macro-averaged F1 score as our accuracy measure to account for class imbalance.

For this first experiment, we divide the cocktails according to Table 6 . While "known" and "novel" splits have a similar number of cocktails, here we are considering a scenario in which there are more "known" samples. There are approximately twice as many sample in the "known" cocktails as "novel" cocktails. To contextualize this particular split, we can imagine we are in the year 2000. Can we identify if a patient encounter should receive one of the four "known" cocktails (and which one) or should a "novel" cocktail be prescribed? The "novel" class (composed of three cocktails) is an indication these encounters are opportune for a new, original drug treatment. In reality, we must acknowledge that our study's experimental design is only a proxy to this scenario. We did not enforce samples to this timeline and so there may be "known" cocktail patient encounters after say 2003, who actually could have been prescribed one of the "novel" cocktails. Such is an inherent limitation of retrospective data used for simulation. Given our already small dataset, we lack the samples prior to 2000 to implement this true timeline. However, our study still captures the relevant, timeless task of classifying which patient encounters should be prescribed an existing versus novel cocktail.

As per the authors of [7] , the outlier score's threshold τ corresponds with a 1% contamination rate. We now detail our procedure for selecting the "uncertainty" threshold. Plotted in Figure 3 are the known validation F1 scores versus τ for GMVAE's U quantity. Work in [6] deduces (and empirically observes) that a consistently good threshold τ to pick for GMVAE's "uncertainty" is the saturation or plateau point of the known validation F1 curve. This is plotted with the red dashed line in Figure 3 . Intuitively, this can be thought of as increasing the decision boundary around each class's centroid until diminishing classification accuracy returns. Further increasing τ is tantamount to overfitting the known validation samples and risks under-recognizing "novel" samples. Mathematically we define this saturation point in the following way. Letting

then the selected threshold is set to the saturation point τ * = min τ : τ > τ and F1 ′ (τ ) ≤ ǫ 2 .

For this experiment, we use ǫ 1 = 1 and ǫ 2 = 0.25 and approximate F1 ′ (τ ) by using the simple forward difference scheme. With the selected threshold τ * for "uncertainty," we proceed to the testing phase with "novel" cocktails. To study robustness to increasing "novel" samples, as well as accuracy, we incrementally increase the number of "novel" cocktails (according to the order in Table 6 ) and measure F1 scores. These test F1 scores versus the number of "novel" cocktails are plotted in Figure 4 . As previously discussed, GMVAE is not as accurate in the closed-set regime with no "novel" samples because the ii-loss more directly optimizes known-class discrimination. However, GVMAE and "uncertainty" (GMVAE + U) quickly outperform ii-loss and outlier score (ii-loss + OS) with the introduction of "novel" cocktails. In addition, we clearly see that the GMVAE's method remains more robust to an increasing number of "novel" cocktails and samples while the benchmark outlier score's accuracy continuously diminishes. This observation is magnified in the right panel of Figure 4 . Averaging over the number of "novel" cocktails, GMVAE leads to a 14.4% increase in the F1 score. Again, we attribute this increased accuracy and robustness to GMVAE's reconstruction learning and "uncertainty's" penalization of interior latent representations relative to the known clusters. From [6] , we surmise the latter's effect is substantial for more homogeneous, difficult-to-discriminate samples. This is certainly true of healthcare data. While F1 scores paint a broad picture of classification accuracy, the confusion matrices in Table 7 more closely inspect prediction ability on an individual cocktail basis. We clearly see that ii-loss + OS is more accurate in the closed-set regime. In particular, cocktails 3 and 4 are classified very accurately. However, this comes at the expense of severely under-recognizing "novel" cocktails. The ii-loss + OS model rarely predicts "novel" leading to a dramatic decrease in overall open-set classification accuracy. No doubt this is due, in part, to ii-loss solely optimizing known cocktail discrimination with no regard for capturing underlying feature information. Conversely, GMVAE is less accurate in the closed-set regime but more readily recognizes "novel" cocktails. The main point these confusion matrices demonstrate is that there is a tradeoff between accurately classifying the known classes and robustly identifying novel or unknown classes. This is evidenced by GMVAE's over-classifying the known cocktails as "novel." Finally, we wish to further address the threshold selection. To alleviate concerns that open-set accuracies are more sensitive to threshold selection, we plot the test F1 scores versus thresholds in neighborhoods of α and τ * for ii-loss + OS and GMVAE, respectively in Figure 5 . We consider a 10 percentage point neighborhood so that α ∈ [0, 0.1] and τ * ∈ [0.64, 0.74]. We clearly see that GMVAE's F1 scores are more constant with respect to τ * than ii-loss + OS's F1 versus α. This translates to GMVAE being more robust to threshold selection (error) and, relatedly, its underlying embedding having a stronger ability to distinguish "novel" cocktails. 

For this second experiment, we divide the cocktails according to Table 8 . While "known" and "novel" splits again have a similar number of cocktails as in the previous experiment, here we increase the number of "novel" cocktails by one and consider the different scenario in which there are more "novel" samples. There are approximately 1.5 times as many samples in the "novel" cocktails as "known" cocktails. The goal of this second experiment is to reiterate GMVAE's success while varying the number of "known" cocktails and having a qualitatively different "known"-to-"novel" samples ratio. Parallel Figures 6-8 and Table 9 from the first experiment show just this. Figure 6 again plots the known validation F1 scores versus τ for GMVAE's U quantity with the saturation point and corresponding picked threshold τ * plotted in red. Again, we incrementally increase the number of "novel" cocktails (according to the order in Table  8 ) and plot the F1 scores in Figure 7 . The behavior is qualitatively the same as the first experiment. The ii-loss + OS is more accurate with just the "known" cocktails as it directly optimizes class separation. However, GMVAE+U yields much higher open-set classification accuracies for increasing "novel" cocktails and samples. In fact, averaging over the number of "novel" cocktails in Figure 7 's right panel, GMVAE+U leads to an average F1 increase of 34.6%. We stress that this increased open-set recognition is from GMVAE's reconstruction learning and "uncertainty" leading to better discernment of "novel" cocktails. This is made clear in the confusion matrices in Table  9 . The ii-loss + OS's z embedding only captures class information and therefore it is difficult to tease "novel" cocktail information from. To again show robustness to threshold selection, we plot the test F1 scores versus thresholds in neighborhoods of α and τ * for ii-loss + OS and GMVAE respectively in Figure 8 . We consider a 10 percentage point neighborhood so that α ∈ [0, 0.1] and τ * ∈ [0.51, 0.61]. Here we clearly see that both the ii-loss + OS's and GMVAE's F1 scores are relatively constant with respect to α and τ * . However, the difference in F1 is stark with GMVAE dominating.

While the experimental results highlight GMVAE's capability, we do wish to stress again the importance of methodically selecting a single threshold for rejecting "novel" samples in testing.

Previous open-set experiments [2, 5, 7] escape this consideration by comparing high-level metrics like AUC. While this may give an indication of the overall behavior, it does little to inform actual model usage in a practical setting. GMVAE's validation F1 curve saturation procedure begins to address this decision boundary optimization without "novel" samples. However, it is still an ad-hoc heuristic worthy of further development. A satisfying solution to this subproblem is critical for real applications to a current patient's treatment plan.

Additional discussions of our current application of open-set recognition to drug treatment predictions are more subtle. From the experiments above, while we achieve more accurate results, an F1 score below 0.5 has room for improvement. Prior work and our own experience suggest there exists a tradeoff between closed-set classification and open-set recognition. In other words, it is natural to expect a compromise between accurately classifying the "known" classes and robustly identifying "novel" or "unknown" classes. Herein lies this issue, as it is generally more difficult to discriminate real healthcare data (as opposed to academic image datasets), we start at a severe disadvantage with separating the known cocktails. We visualize this with a t-SNE [20] plot of GMVAE's latent embedding from the first experiment in Figure 9 . It indicates there is high degree of feature "overlap" among the cocktails and thus it is difficult to distinguish patient encounters. This is likely expected from only utilizing the phenotypic features available in the dataset, but perhaps also from the nature of patients potentially benefiting from multiple cocktails. In that respect, open-set recognition within a multi-label setting may be a natural extension we may pursue in future work.

Lastly, perhaps drug treatment predictions, at least in our cancer context, necessitate a longitudinal study. One would like to track multiple encounters for a given patient and be able to make breast cancer drug treatment recommendations along the way. With this framework, the efficacy feedback information from past treatments could potentially provide significant guidance. While this would require a complete overhaul from data collection to modifying the model to accept a variable-length time series, it certainly represents an obvious and exciting expansion. Relatedly, this immediately suggests the addition of multi-modal data such as mammogram images and physician notes' text in a recurrent neural network model structure. 

We formulate breast cancer drug cocktail treatment predictions in terms of open-set recognition, focusing on a methodology conducive to a practical clinical implementation, and accordingly apply the GMVAE and "uncertainty" model framework. Together, these achieve more accurate and robust classification results for our patient-encounter healthcare dataset, compared to a state-of-the-art benchmark. In doing so, we also resolve obstacles in prior work concerning open-set recognition applications to healthcare. First, we emphasize formally addressing a methodical selection of a specific threshold for rejecting "novel" or "unknown" samples as we believe it is more meaningful in deployment to compare and use models with a single testing instance. Second, many other works on this subject take advantage of fabricated or auxiliary data to model "novel" or "unknown" samples. We dismiss the implicit assumption that this step is always feasible and instead call for methods like GMVAE which only learn from known, available data. Finally, we spotlight the inherent limitations to solely classification-based models in open-set recognition. Whether it is reconstruction or not, latent representations must encapsulate structural information of the data to be more effective.

To be sure, this particular application to healthcare opens interesting avenues of further research to the expanding scope of open-set recognition. Likewise, this study hopefully represents a stride towards these techniques benefiting actual patients' treatments in the future.

Vector K = (K 1 , K 2 , ..., K C ) is decided by the user and so the ELBO dependence on K is made explicit. The reconstruction term endeavors to group samples with similar features together in latent space z. Simultaneously, the latent covering term attempts to cluster the latent representations z based on classes. The w-prior and component v-prior terms aim for the respective posteriors and priors to coincide. This mirrors the standard ELBO with reconstruction and regularization terms.

Recent advances in open set recognition: A survey

Open-set oct image recognition with synthetic learning

Defining phenotypes in asthma: a step towards personalized medicine

Rational cancer treatment combinations: An urgent clinical need

Open set medical diagnosis

Open-set recognition with Gaussian mixture variational autoencoders

Learning a neural-network-based representation for open set recognition

Anomaly detection with robust deep autoencoders

Deep anomaly detection with outlier exposure

Support vector machine for outlier detection in breast cancer survivability prediction

Density-based outlier detection for safeguarding electronic patient record systems

Towards open set recognition

Multi-class open set recognition using probability of inclusion

Sparse representation-based open set recognition

Towards open set deep networks

Makoto Iida, and Takeshi Naemura. Classification-reconstruction learning for open-set recognition

C2AE: Class conditioned auto-encoder for open-set recognition

Image-to-image translation with conditional adversarial networks

MICE: Multivariate imputation by chained equations in R

Visualizing data using t-SNE

The work of the first author is supported by the Predoctoral Training Program in Biomedical Data Driven Discovery (BD3) at Northwestern University (National Library of Medicine Grant 5T32LM012203). The work of the last author is supported in part by NIH Grant R01LM013337.

The authors declare there are no conflicts of interest.

Background for GMVAE and "uncertainty" algorithm Here in this Appendix section, we briefly overview GMVAE, which extends standard unsupervised VAEs by assuming a Gaussian mixture prior for each class. To accommodate this, the basic VAE architecture is modified with additional latent variables. [6] starts with C known classes with each class composed of K c mixture components where c = 1, 2, ..., C. The features x ∈ R d and labels y ∈ R C , represented as one-hot vectors, comprise the labeled, known dataset. GMVAE's decoder model p β,θ (x, v, w, z|y) = p θ (x|z)p β (z|w, y, v)p(w)p(v|y) conditions on class and factors aswhere µ ck (·; β), σ 2 ck (·; β), and µ(·; θ) are neural networks parametrized by β and θ, respectively. It is common to assume a uniform prior π(y). The encoder process is factorized as q φ (v, w, z|x, y) = p β (v|z, w, y)q φw (w|x, y)q φz (z|x) where φ = (φ x , φ w ). Factors φ are parametrized with networks that output mean and diagonal covariance for Gaussian posteriors:There is a p β factor in the q φ factorization because it is derived from the generative factors (see [6] ). GMVAE's objective is to maximize the log-evidence lower bound (ELBO) given by − KL(q φw (w|x, y)||p(w)) (w-prior) − E q φw (w|x,y)q φz (z|x) [KL(p β (v|z, w, y)||p(v|y))] (component v-prior).