key: cord-0233293-grqy9ac4
authors: Alaa, Ahmed M.; Breugel, Boris van; Saveliev, Evgeny; Schaar, Mihaela van der
title: How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models
date: 2021-02-17
journal: nan
DOI: nan
sha: d9c381fec7572b273e1214cfcc238d7d92a75790
doc_id: 233293
cord_uid: grqy9ac4

Devising domain- and model-agnostic evaluation metrics for generative models is an important and as yet unresolved problem. Most existing metrics, which were tailored solely to the image synthesis setup, exhibit a limited capacity for diagnosing the different modes of failure of generative models across broader application domains. In this paper, we introduce a 3-dimensional evaluation metric, ($alpha$-Precision, $beta$-Recall, Authenticity), that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion. Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity. We introduce generalization as an additional, independent dimension (to the fidelity-diversity trade-off) that quantifies the extent to which a model copies training data -- a crucial performance indicator when modeling sensitive data with requirements on privacy. The three metric components correspond to (interpretable) probabilistic quantities, and are estimated via sample-level binary classification. The sample-level nature of our metric inspires a novel use case which we call model auditing, wherein we judge the quality of individual samples generated by a (black-box) model, discarding low-quality samples and hence improving the overall model performance in a post-hoc manner.

Intuitively, it would seem that evaluating the likelihood function of a generative model is all it takes to assess its performance. As it turns out, the problem of evaluating generative models is far more complicated. This is not only because state-of-the-art models, such as Variational Autoencoders Figure 1 . Pictorial depiction for the α-Precision, β-Recall and Authenticity metrics. Blue and red spheres correspond to the αand β-supports of real and generative distributions, respectively. Blue and red points correspond to real and synthetic data. (a) Synthetic samples falling outside the blue sphere will look unrealistic or noisy. (b) Overfitted models can generate ostensibly high-quality samples that are "unauthentic" because they are copied from training data. (c) High-quality samples should reside in the blue sphere. (d) Outliers do not count in the β-Recall metric. (Here, α=β=0.9, α-Precision = 8/9, β-Recall = 4/9, Authenticity = 9/10.) (VAE) (Kingma & Welling, 2013) and Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) , do not possess tractable likelihood functions, but also because the loglikelihood score itself (or equivalently, statistical divergence) is a flawed measure of performance-it scales badly in high dimensions, and it obscures distinct modes of model failure into a single uninterpretable number (Theis et al., 2015) . Absent objective domain-agnostic metrics, previous works focused on crafting domain-specific evaluation scores, e.g., the Inception score (Salimans et al., 2016) , with an almostexclusive emphasis on image data .

In this paper, we introduce an alternative approach to evaluating generative models, where instead of assessing the generative distribution by looking at all synthetic samples collectively to compute likelihood or statistical divergence, we classify each sample individually as being of high or low quality. In this way, our metric comprises interpretable probabilistic quantities-resembling those used to evaluate discriminative models (e.g., accuracy, AUC-ROC, F 1 scores, etc)-which describe the rates by which a model makes different kinds of errors. When aggregated over all samples, our sample-level scores reflect the discrepancy between the real and generative distributions in a way similar to statistical divergence (or distance) measures such as KL divergence, Frechet Inception distance (Heusel et al., 2017) , or maximum mean discrepancy (Sutherland et al., 2016) . In this sense, our metric enables diagnosing a model's performance on both the sample and distribution levels.

But what exactly does our metric measure? We represent the performance of a generative model as a point in a threedimensional space; each dimension corresponds to an independent quality of the model. These qualities are: Fidelity, Diversity and Generalization. Fidelity corresponds to the quality of a model's synthetic samples, and Diversity is the extent to which these samples cover the full variability of the real samples, whereas Generalization quantifies the extent to which a model overfits (copies) the training data.

We introduce the α-Precision and β-Recall metrics to quantify model Fidelity and Diversity, respectively. Both metrics assume that a fraction 1 − α (or 1 − β) of the real (and synthetic) data are "outliers", and α (or β) are "typical". α-Precision is the fraction of synthetic samples that resemble the "most typical" α real samples, whereas β-Recall is the fraction of real samples covered by the most typical β synthetic samples. α-Precision and β-Recall are evaluated for all α, β ∈ [0, 1], providing entire precision and recall curves instead of single numbers. To compute both metrics, we embed the (real and synthetic) data into hyperspheres with most samples concentrated around the centers, i.e., the real and generative distributions (P r and P g ) has spherical-shaped supports. Typical samples are located near the centers whereas outliers are close to the boundaries.

To quantify Generalization, we introduce the Authenticity metric as the probability that a synthetic sample is copied from training data. We implement Authenticity as a hypothesis test for data copying based on the observed proximity of synthetic samples to real ones in the embedded feature space. A pictorial illustration for all metrics is shown in Figure 1 .

How is our metrics different? If one think of standard precision and recall metrics as "hard" binary classifiers of real and synthetic samples, our α-Precision and β-Recall can be thought of as soft-boundary classifiers that do not only compare the supports of P r and P g , but also assesses whether both distributions are calibrated. Precision and recall metrics are special cases of α-Precision and β-Recall for α = β = 1. As we show later, our new metric definitions solve many of the drawbacks of standard precision-recall analysis, such as lack of robustness to outliers and failure to detect distribu-tional mismatches (Naeem et al., 2020) . They also enable detailed diagnostics of different types of model failure, such as mode collapse and mode invention. Moreover, optimal values of our metrics are achieved only when P r and P g are identical, thereby eliminating the need to augment the evaluation procedure with measures of statistical divergence.

While previous works relied on pre-trained embeddings (using ImageNet feature extractors (Deng et al., 2009) ), our feature embeddings are model-and domain-agnostic, and are tailored to our metric definitions and data set at hand. To the best of our knowledge, this is the first work were the feature embedding step is bespoke to the data set at hand and meaningfully integrated in the model evaluation pipeline.

Overfitting is a crucial mode of failure of generative models, especially when modeling sensitive data (e.g., clinical data) for which data copying may violate privacy requirements (Yoon et al., 2020) , but it has been overlooked in previous works which focused exclusively on quantifying the Fidelity-Diversity tradeoff (Brock et al., 2018) . As we show in our experiments (Section 5), because our metric accounts for Generalization, it can provide a fuller picture of a generative model's performance. Precisely, we show that some of the celebrated generative models score highly for Fidelity and Diversity simply because they memorize real samples, rendering them inappropriate for privacy-sensitive applications.

A comprehensive survey of prior work, along with a detailed discussion on how our metric relates to existing ones are provided in the Supplementary material.

Model auditing as a novel use case. In addition to evaluating and comparing models, the sample-level nature of our metrics inspires the new use case of model auditing, wherein we judge individual synthetic samples by their quality, and reject samples that have low Fidelity or are unauthentic. We show that model audits can indeed improve the outputs of a black-box model in a post-hoc fashion without any modifications to the model itself. In Section 5, we demonstrate the utility of model auditing in synthesizing clinical data.

We denote the real and generated data as X r ∼ P r and X g ∼ P g , respectively, where X r , X g ∈ X , with P r and P g being the real and generative distributions, and X being the input space. The generative distribution, P g , is estimated (explicitly or implicitly) using a generative model (e.g., a GAN or a VAE). The real and synthetic data sets are D real = {X r,i } n i=1 and D synth = {X g,j } m j=1 , where X r,i iid ∼ P r and X g,j iid ∼ P g . In the rest of the paper, we drop the subscripts i and j unless necessary for clarity.

Our goal is to construct a metric E(D real , D synth ) that measures the quality of D synth in order to (i) evaluate the performance of the underlying generative model P g , and (ii) audit the model outputs by discarding (individual) "lowquality" samples, thereby improving the overall quality of D synth . In order for the metric E to fulfill the evaluation and auditing tasks, it must satisfy the following desiderata:

(1) it should be able to disentangle the different modes of failure of P g through interpretable measures of performance, and (2) it should be sample-wise computable, i.e., we should be able to tell if any given (individual) sample X g ∼ P g from the generative model is of a low quality.

Having outlined the desiderata for our sought-after evaluation metric, we now propose three qualities of synthetic data that the metric E should be able to quantify. Failure to fulfill any of these three qualities correspond to independent modes of failure of the model P g . These qualities are:

1. Fidelity-the generated samples resemble real samples from P r . A high-fidelity synthetic data set should contain "realistic" samples, e.g. visually-realistic images.

2. Diversity-the generated samples are diverse enough to cover the variability of real data, i.e., a model should be able to generate a wide variety of good samples.

3. Generalization-the generated samples should not be mere copies of the (real) samples in training data, i.e., models that overfit to D real are not truly "generative".

In Section 3, we propose a three-dimensional evaluation metric E that captures all of the qualities above. Our proposed metric can be succinctly described as follows:

. (1) The α-Precision and β-Recall metrics are generalizations of the conventional notions of precision and recall used in binary classification analysis (Flach & Kull, 2015) . Precision measures the rate by which the model synthesizes "realisticlooking" samples, whereas the recall measures the fraction of real samples that are covered by P g . The authenticity score measures the fraction of synthetic samples that are invented by the model and not copied from the training data.

Having provided a bird's-eye view of our proposed metric E, we now briefly summarize the steps involved in the evaluation and auditing tasks. Since statistical comparisons of complex data in the raw input space X are difficult, the evaluation pipeline starts by embedding X r and X g into a "meaningful" feature space through a representation Φ, dubbed the evaluation embedding, and then computing E on the embedded features (Figure 2 (a)). In Section 4, we propose a representation learning approach to construct embeddings tailored to our metric and the data set at hand in Section.

The auditing task computes sample-level metrics for each X g,j in D synth , discarding samples with low scores, which results in a "curated" synthetic data set. When granted direct access to the model P g , the auditor serves as a rejection sampler that repeatedly draws samples from P g , only accepting ones with high precision and authenticity (Figure 2(b) ).

Let X r = Φ(X r ) and X g = Φ(X g ) be the embedded real and synthetic feature instances. For simplicity, we will use P r and P g to refer to the distributions over the raw and the embedded features interchangeably. Let S r = supp(P r ) and S g = supp(P g ), where supp(P) is the support of P.

Central to our proposed evaluation metrics is a more general notion for the support of a distribution P, which we dub the α-support. We define the α-support as the smallest subset of S = supp(P) supporting a probability mass α, i.e.,

where V (s) is the volume (Lebesgue measure) of the set s, and α ∈ [0, 1]. One can think of an α-support as dividing the full support of P into "normal" samples concentrated in S α , and "outliers" residing inS α , where S = S α ∪S α .

Finally, we define the distance between a data sample X and the training data D real as the distance between X and the closest sample in D real , i.e.,

where d is a distance metric defined over the input space X .

In the following Subsections, we provide formal definitions for the components of the metric E in (1), and then develop an estimator for all components of E in Section 4.

3.2.1. α-PRECISION AND β-RECALL α-Precision. The conventional Precision metric is defined as the probability that a generated sample is supported by the real distribution, i.e. P( X g ∈ S r ) (Sajjadi et al., 2018) . We propose a more refined measure of sample fidelity, dubbed the α-Precision (denoted as P α ), defined as follows:

That is, α-Precision measures the probability that a synthetic sample resides in the α-support of the real distribution.

β-Recall. To assess diversity in synthetic data, we propose the β-Recall metric as a generalization of the conventional Recall metric. Formally, we define the β-Recall as follows:

The β-Recall metric measures the fraction of real data that resides in the β-support of the generative distribution.

Interpreting α-Precision & β-Recall. To interpret (4) and (5), we first need to revisit the notion of α-support. From (2), we know that the α-support hosts the most densely packed probability mass α in a distribution, hence S α r and S β g always concentrate around the modes of P r and P g ( Figure 3 ); samples residing outside of S α r and S β g can be thought of as outliers. In this sense, α-Precision and β-Recall do not count outliers when assessing a model's fidelity and diversity. That is, the α-Precision score deems a synthetic sample to be of a high fidelity not only if it looks "realistic", but also if it looks "typical". Similarly, β-Recall counts a real sample as being covered by P g only if it is not an outlier in P g .

By sweeping the values of α and β from 0 to 1, we obtain a varying definition of which samples are "typical" and which are "outliers"-the smaller the values of α and β, the tighter the αand β-supports become, and the more are the samples that count as outliers. This gives us entire precision and recall curves, P α vs. α and R β vs. β, instead of single values as in the standard precision-recall analysis.

Generalizing precision-recall analysis. How do the definitions in (4) and (5) improve on standard precision-recall diagnostics? Unlike these metrics, α-Precision and β-Recall take into account not only the supports of P r and P g , but also the actual probability densities of both distributions. Standard precision (and recall) correspond to one point on the P α (and R β ) curve; they are equal to P α and R β evaluated on the full support (i.e., P 1 and R 1 ). By defining our metrics with respect to the αand β-supports, we do not treat all samples equally, but rather assign higher importance to samples that land in "denser" regions of S r and S g . Hence, P α and R β reflect the extent to which P r and P g are calibratedi.e., good P α and R β curves are achieved when P r and P g share the same modes and not just a common support.

The new P α and R β metrics address the major shortcomings of precision and recall. Among these shortcomings are: lack of robustness to outliers, failure to detect matching distributions, and inability to diagnose different types of distributional failure (such as mode collapse, mode invention, or density shifts) (Naeem et al., 2020) . Basically, a model P g will score perfectly on precision and recall (R 1 =P 1 =1) as long as it nails the support of P r , even if P r and P g place totally different densities on their common support. Figure 3 illustrates how our metrics remedy these shortcomings.

While optimal R 1 and P 1 can be achieved by arbitrarily mismatched P r and P g , our P α and R β curves are optimized only when P r and P g are identical as stated by Theorem 1.

Theorem 1. The α-Precision and β-Recall satisfy the condition P α /α = R β /β = 1, ∀α, β, if and only if the generative and real densities are identical, i.e., P g = P r .

That is, a model is optimal if and only if its P α and R β are both straight lines with unity slopes. Any model P g that does not perfectly recover P r will achieve suboptimal P α and R β curves that behave non-linearly with α and β ( Figure 3 ). This is a significant result because it enables us to distill a measure of statistical distance out of P α and R β .

Measuring statistical discrepancy with P α & R β . While P α and R β curves provide a detailed view on a model's fidelity and diversity performance, it is often more convenient to summarize performance in a single number. To this end, we define the mean absolute deviation of P α and R β as:

That is, ∆P α ∈ [0, 1/2] and ∆R β ∈ [0, 1/2] quantify the extent to which the α-Precision and β-Recall deviate from their optimal values. We define the integrated P α and R β metrics as IP α = 1 − 2∆P α and IR β = 1 − 2∆R β . Both metrics take values in [0, 1], and following Theorem 1, we have IP α = IR β = 1 only if P g and P r are identical.

The IP α score represents the probability that a synthetic sample is adequately represented by the modes P r , whereas IR β is the probability that a real sample is adequately represented by modes of P g . Together, IP α and IR β serve as a measure of the discrepancy between the distributions P r and P g , eliminating the need to augment our precision-recall analysis with statistical divergence metrics. Moreover, unlike f -divergence measures, the (IP α , IR β ) metric does not require that P r and P g share a common support, and it disentangles fidelity and diversity into separate components.

score = 1, suboptimal curve score = 1, poor curve score < 1, better curve

score < 1 poor curve score = 1 poor curve score = 1 suboptimal curve Figure 3 . Interpretation of the Pα and R β curves. Real distribution is colored in blue, generative distribution is in red. Distributions are collapsed into 1 dimension for simplicity. Here, Pr is a multimodal distribution of cat images, with one mode representing orange tabby cats and another mode for Calico cats; outliers comprise exotic Caracal cats. Shaded areas represent the probability mass covered by αand β-supports-these supports concentrate around the modes, but need not be contiguous for multimodal distributions, i.e., we have S α r = S α r,1 ∪ S α r,2 , and S β g = S β g,1 ∪ S β g,2 . (a) Here, the model Pg exhibits mode collapse where it over-represents orange tabbies. Such model would achieve a precision score of P1 = 1 but a suboptimal (concave) Pα curve (panel (d)). Because it does not cover all modes, the model will have both a suboptimal R1 score and R β curve. (b) This model perfectly nails the support of Pr, hence it scores optimal standard metrics P1 = R1 = 1. However, the model invents a mode by over-representing outliers, where it mostly generates images for the exotic cat breed. Standard metrics imply that model (a) outperforms (b) where in reality (a) is more faithful to the real data. Pα and R β give us a fuller picture of the comparative performances of both models. (c) This model realizes both types of cats but estimates a slightly shifted support and density; intuitively, this is the best of the three models, but it will appear inferior to (b) under P1 and R1. By examining the Pα-R β curves, we see that model (c) has less deviation from optimal performance (the dashed black lines in panel (d)).

Generalization is independent of precision and recall since a model achieves perfect fidelity and diversity without truly generating any samples, simply by resampling training data. Unlike discriminative models for which generalization is easily tested via held-out data, evaluating generalization in generative models is not straightforward (Adlam et al., 2019; Meehan et al., 2020) . We propose an authenticity score A ∈ [0, 1] to quantify the rate by which a model generates new samples. To pin down a mathematical definition for A, we reformulate P g as a mixture of densities as follows:

where P g is the generative distribution conditioned on the synthetic samples being non-overfitted (a=1), and δ g, is a noisy distribution over training data. In particular, we define δ g, as δ g, = δ g * N (0, 2 ), where δ g is a discrete distribution that places an unknown probability mass on each training data point in D real , is an arbitrarily small noise variance, and * is the convolution operator. Essentially, (7) assumes that the model flips a (biased coin), pulling off a training sample with probability 1 − A and adding some noise to it, or innovating a new sample with probability A. A model with A = 1 always innovates, whereas an overfitted model will concentrate P g around the training data.

With all the metrics in Section 3 being defined on the sample level, we can obtain an estimate E = ( P α , R β , A) of the metric E (for given α and β) in a binary classification fashion, by assigning binary scores P α,j , A j ∈ {0, 1} to each synthetic sample X g,j in D synth , and R β,i ∈ {0, 1} to each real sample X r,i in D real , then averaging over all samples, i.e.,

To assign binary scores to individual samples, we construct three binary classifiers

We explain the operation of each classifier in what follows.

Precision and Recall classifiers (f P and f R ). Based on definitions (4) and (5), both classifiers check if a sample exists in an α-support, i.e., f P ( X g ) = 1{ X g ∈ S α r } and f R ( X r ) = 1{ X r ∈ S β g }. Hence, the main difficulty in implementing f P and f R is estimating the supports S α r and S β g -in fact, even if we know the exact distributions P r and P g , computing their αand β-supports is not straightforward as it involves solving the optimization problem in (2).

To address this challenge, we pre-process the real and synthetic data in a way that renders estimation of the αand βsupports straightforward. The trick is to train the evaluation embedding Φ so as to cast the support of the real data, S r , into a hypersphere with radius r, and cast the distribution P r into a isotropic density concentrated around the center c r of the hypersphere. We achieve this by modeling Φ as a one-class (feed-forward) neural network trained with the following loss function:

The loss is minimized over the radius r and the parameters of Φ; the output dimensions of Φ, c r and ν are viewed as hyperparameters (see Supplementary material). The loss in (8) is based on the seminal work on one-class SVMs in (Schölkopf et al., 2001) , which is commonly applied to outlier detection problems, e.g., (Ruff et al., 2018) . In a nutshell, the evaluation embedding squeezes the real data into the minimumvolume hypersphere centered around c r (as illustrated in Figure 1 ), hence the real α-support is easily estimated as:

where B(c, r) is a Euclidean ball with center c and radius r, and Q α is the quantile function. The set of all α-supports of P r corresponds to the set of all concentric spheres with center c r and radii r α , ∀α ∈ [0, 1]. Thus, the precision classifier assigns a score 1 to a synthetic sample X g if it resides in the Ball S α r , i.e., f p ( X g ) = 1{ X g − c r ≤ r α }. Now define c g = 1 m j X g,j , and consider a hypersphere given by B(c g , r β ), where r β = Q β { X g,j − c g : 1 ≤ j ≤ m}. We construct the recall classifier as follows:

where X β g,j * is the synthetic sample in B(c g , r β ) that is closest to X r,i , and NND k ( X r,i ) is the distance between X r,i and its k-nearest neighbor in D real . Note that, since Φ is trained on real data, it does not necessarily transform the support of synthetic data into a hypershpere: hence, B(c g , r β ) contains S β g but does not coincide with it. (9) is a nonparametric estimate of S β g that checks if each real sample i is locally covered by a synthetic sample in B(c g , r β ). (See visual illustration of (9) in the Supplementary material.) Authenticity classifier. We construct the classifier f A as a binary hypothesis test, whereby we test the hypothesis that the sample X g,j is non-memorized. Let H 1 : A j = 1 be the hypothesis that sample j is authentic, and let H 0 : A j = 0 be the null hypothesis. To test this hypothesis, we use the likelihood-ratio test (LRT) statistic (Van Trees, 2004) :

which follows from the decomposition in (7). Since both likelihood functions in (10) are unknown, we need to test the hypothesis H 1 : A j = 1 using an alternative sufficient statistic with a known probability distribution.

Let d g,j = d( X g,j , D real ) be the distance between synthetic sample j and the training data set, and let i * be the training sample in D real closest to X g,j , i.e., d g,j = d( X g,j , X r,i * ). Let d r,i * be the distance between X r,i * and D real /{ X r,i * }, i.e., the training data with sample i * removed. Now consider the (binary) statistic a j = 1{d g,j ≤ d r,i * }, which indicates whether a given synthetic sample j is closer to the training data than any other training sample. The likelihood ratio for observations {a j } j under hypotheses H 0 and H 1 is

Here, we used the fact that if sample j is a memorized copy of i * , and if the noise variance in (7) is arbitrarily small, then a j = 1 almost surely and P(a j | A j = 0) ≈ 1. If j is authentic, then X g,j lies in the convex hull of the training data, and hence P(a j | A j = 0) → 0 and Λ → ∞ for a large real data set. Thus, the authenticity classifier f A issues a label A j = 1 if a j = 0, and A j = 0 otherwise. Intuitively, f A deems sample j unauthentic if it is closer to i * than any other real sample in the training data. Based on the Neyman-Pearson Lemma, the LRT above is the most powerful test for authenticity (Huber & Strassen, 1973) .

In this Section, we showcase the use cases of our metric in various application domains. Experimental details and additional experiments are provided in the Appendix.

In this experiment, we test the ability of our metric E to assess the utility of different generative models in synthesizing COVID-19 patient data that can be shared with researchers without leaking sensitive patient information. We use data from SIVEP-Gripe database (SIVEP-Gripe, 2020), which comprises records for 99,557 COVID-19 patients in Brazil, including personal information (ethnicity, age and location). We use generative models to synthesize replicas of this data, with the goal of fitting predictive models on the replicas.

Models and baselines. We create 4 synthetic data sets using GAN, VAE, Wasserstein GANs with a gradient penalty (WGAN-GP) (Gulrajani et al., 2017) , and an ADS-GAN which is specifically designed to prevent patient identifiablity in the generated data (Yoon et al., 2020) . To evaluate the synthetic data sets, we use Frechet Inception Distance (FID) (Heusel et al., 2017) , Precision and Recall (P 1 and R 1 ) (Sajjadi et al., 2018) , the Density and Coverage (D and C) metrics (Naeem et al., 2020) , Parzen window likelihood esti- Here, we show how the 4 generative models are ranked with respect to each evaluation metric (leftmost is best). For each metric, we select the synthetic data set with the highest score, and then train a predictive model on the selected data set and test its AUC-ROC performance on real data. We consider the ground-truth ranking of the quality of the 4 synthetic data sets to be the ranking of the AUC-ROC scores of the predictive models trained on them. mates (P W ) (Bengio et al., 2013) and Wasserstein distance (W ) as baselines. On each synthetic data set, we fit a predictive Logistic regression (binary classification) model to predict patient-level COVID-19 mortality.

Predictive modeling on synthetic data sets. In the context of predictive modeling, a generative model is assessed with respect to its usefulness in training predictive models that generalize well on real data. Hence, the "ground-truth" ranking of the quality of the 4 generative models corresponds to the ranking of the AUC-ROC scores achieved by predictive models fit to their respective synthetic data sets and tested on real data (Figure 4(a) ). The data synthesized by the ADS-GAN model (×) displayed the best performance, followed by WGAN-GP (•), VAE ( ), and GAN ( ).

To assess the accuracy of the baseline evaluation metrics, we test whether they can recover the ground-truth ranking of the 4 generative model (Figure 4(a) ). Our integrated precision and recall metrics IP α and IR β both assign the highest scores to ADS-GAN; IP α exactly nails the right ranking of the generative models. On the other hand, competing metrics such as P 1 , C and D, seem to over-estimate the quality of VAE and WGAN-GP-if we use these metrics to decide on which generative model to use, we will end up with predictive models that perform poorly, i.e. AUC-ROC of the predictive model fitted to synthetic data with best P 1 is 0.55, compared to an AUC-ROC of 0.79 for our IP α score.

These results highlight the importance of accounting for the densities P g and P r , and not just their supports, when evaluating a generative model. As we can see in Figure 4 (a), metrics that compare distributions such as P W and F ID are able to accurately rank the 4 generative models. This is because a shifted generative distribution would result in a "covariate shift" effect in synthetic data, leading to poor gen-eralization for the fitted predictive model, even if all synthetic samples are realistic and all real samples are covered.

Our metrics are able to diagnose this problem because they account for densities as well as supports (Section 3).

Another use case for our metric is hyper-parameter optimization for generative models. Here we focus on the bestperforming model in our experiment: ADS-GAN. This model has a hyper-parameter λ ∈ R that determines the importance of the privacy-preservation loss function used to regularize the training of ADS-GAN (Yoon et al., 2020): smaller values of λ mean that the model is more prone to overfitting, and hence privacy leakage. Figure 4 (b) shows how our precision and authenticity metrics change with the different values of λ: the curve provides an interpretable tradeoff between privacy and utility. For instance, for λ = 2, an Authenticity score of 0.4 means that 60% of Brazilian patients may have their personal information exposed. Increasing λ improves privacy at the expense of precision. By visualizing this tradeoff using our metric, clinical institutions can better understand the risks associated with different modeling choices involved in sharing synthetic data.

Improving synthetic data via model auditing. Even if the whole range of values for the hyper-parameter λ provide unsatisfactory precision and authenticity performances, we can still improve the quality of the ADS-GAN synthetic data in a post-hoc fashion using model auditing. Because our metrics are computable on a sample level, we can discard samples that are unauthentic or imprecise. This does not only lead to nearly optimal precision and authenticity for the resulting curated data (Figure 4(c) ), but also improves the AUC-ROC of the predictive model from 0.76 to 0.78. This is because removing imprecise samples eliminates noisy data points that would otherwise undermine generalization performance.

Highly-authentic models Winner model 

In this experiment, we test the ability of our metrics to detect common modes of failure in generative modeling-in particular, we emulate a mode dropping scenario, where the generative model fails to recognize the distinct modes in a multimodal distribution P r , and instead recovers a single mode in P g . To construct this scenario, we fit a conditional GAN (CGAN) model (Wang et al., 2018) on the MNIST data set, and generate 1,000 samples for each of the digits 0-9. (We can think of each digit as a distinct mode in the real distribution P r .) To apply mode dropping, we first sample 1,000 instances of each digit from the CGAN, and then delete individual samples of digits 1 to 9 with a probability P drop , and replace the deleted samples with new samples of the digit 0 to complete a data set of 10,000 instances. The parameter P drop ∈ [0, 1] determines the severity of mode dropping: for P drop = 0, the data set has all digits being equally represented with 1,000 samples, and for P drop = 1, the data set has 10,000 samples of the digit 0 only as depicted pictorially in Figure 5 (a) (bottom panel).

We show how the different evaluation metrics respond to varying P drop from 0 to 1 in Figure 5 (a) (top panel). Because mode dropping pushes the generative distribution away from the real one, statistical distance metrics such as W and F ID increase as P drop approaches 1. However, these metrics only reflect a discrepancy between P r and P g , and do not disentangle the Fidelity and Diversity components of this discrepancy. On the other hand, standard precision and recall metric are completely insensitive to mode dropping except for the extreme case when P drop = 1. This is because both metrics only check the supports of P r and P g , so they cannot recognize mode dropping as long as there is a non-zero probability that the model will generates digits 1-9. On the contrary, mode dropping reflects in our metrics, which manifest in a declining IR β as P drop increases. Since mode dropping affects coverage of the digits and not the quality of images, it only affects IR β but not IP α .

Finally, we use our metric to re-evaluate the generative models submitted to the NeurIPS 2020 Hide-and-Seek competition (Jordon et al., 2020) . In this competition, participants were required to synthesize intensive care time-series data based on real data from the AmsterdamUMCdb database. A total of 16 submissions were judged based on the accuracy of predictive models fit to the synthetic data (an approach similar to the one in Section 5.1). The submissions followed various modeling choices, including recurrent GANs, autoencoders, differential privacy GANs, etc. Details of all submissions are available online 1 . Surprisingly, the winning submission was a very simplistic model that adds Gaussian noise to the real data to create new samples.

To evaluate our metrics on time-series data, we trained a Seq-2-Seq embedding that is augmented with our One-class representations to transform time-series into fixed feature vectors. (The architecture for this embedding is provided in the Supplementary material.) In Figure 5 (b), we evaluate all submissions with respect to precision, recall and authenticity. As we can see, the winning submission comes out as one of the least authentic models, despite performing competitively in terms of precision and recall. This highlights the detrimental impact of using naïve metrics for evaluating generative models-based on the competition results, clinical institutions seeking to create synthetic data sets may be led to believe that Submission 1 in Figure 5 (b) is the right model to use. However, our metrics-which give a fuller picture of the true quality of all submissions-shows that such model creates unauthentic samples that are mere noisy copies of real data, which would pose risk to patient privacy. We hope that our metrics and our pre-trained Seq-2-Seq embeddings can help clinical institutions evaluate the quality of their synthetic time-series data in the future. Anonymization through data synthesis using generative adversarial networks (ads-gan). IEEE Journal of Biomedical and Health Informatics, 2020.

1. Statistical divergence metrics

Divergence metrics are single-valued measures of the distance between the real and generative distributions, whereas precision-recall metrics classify real and generated samples as to whether they are covered by generative and real distributions, respectively. In what follows, we list examples of these two types of metrics, highlighting their limitations.

Statistical divergence metrics.

The most straightforward approach for evaluating a generative distribution is to compute the model log-likelihood-for density estimation tasks, this has been the de-facto standard for training and evaluating generative models. However, the likelihood function is a model-dependent criteria: this is problematic because the likelihood of many state-of-the-art models is inaccessible. For instance, GANs are implicit likelihood models and hence provide no explicit expression for its achieved log-likelihood (Goodfellow et al., 2014) . Other models, energy-based models has a normalization constant in the likelihood expression that is generally difficult to compute as they require solving intractable complex integrals (Kingma & Welling, 2013) .

Statistical divergence measures are alternative (modelindependent) metrics that are related to log-likelihood, and are commonly used for training and evaluating generative models. Examples include lower bounds on the loglikelihood (Kingma & Welling, 2013) , contrastive divergence and noise contrastive estimation (Hinton, 2002; Gutmann & Hyvärinen, 2010) , probability flow (Sohl-Dickstein et al., 2011), score matching (Hyvärinen et al., 2009 ), maximum mean discrepancy (MMD) (Gretton et al., 2012) , and the Jensen-Shannon divergence (JSD).

In general, statistical divergence measures suffer from the following limitations. The first limitation is that likelihoodbased measures can be inadequate in high-dimensional feature spaces. As has been shown in (Theis et al., 2015) , one can construct scenarios with poor likelihood and great samples through a simple lookup table model, and vice versa, we can think of scenarios with great likelihood and poor samples. This is because, if the model samples white noise 99% of the time, and samples high-quality outputs 1% of the time, the log-likelihood will be hardly distinguishable from a model that samples high-quality outputs 100% of the time if the data dimension is large. Our metrics solve this problem by measuring the rate of error on a sample-level rather than evaluating the overall distribution of samples.

Moreover, statistical divergence measures collapse the different modes of failure of the generative distribution into a single number. This hinders our ability to diagnose the different modes of generative model failures such as mode dropping, mode collapse, poor coverage, etc.

Precision and recall metrics.

Precision and recall metrics for evaluating generative models were originally proposed in (Sajjadi et al., 2018) . Our metrics differ from these metrics in various ways. First, unlike standard metrics, α-Precision and β-Recall take into account not only the supports of P r and P g , but also the actual probability densities of both distributions. Standard precision (and recall) correspond to one point on the P α (and R β ) curve; they are equal to P α and R β evaluated on the full support (i.e., P 1 and R 1 ). By defining our metrics with respect to the αand β-supports, we do not treat all samples equally, but rather assign higher importance to samples that land in "denser" regions of S r and S g . Hence, P α and R β reflect the extent to which P r and P g are calibrated-i.e., good P α and R β curves are achieved when P r and P g share the same modes and not just a common support. While optimal R 1 and P 1 can be achieved by arbitrarily mismatched P r and P g , our P α and R β curves are optimized only when P r and P g are identical as stated by Theorem 1.

The new P α and R β metrics address the major shortcomings of precision and recall. Among these shortcomings are: lack of robustness to outliers, failure to detect matching distributions, and inability to diagnose different types of distributional failure (such as mode collapse, mode invention, or density shifts) (Naeem et al., 2020) . Basically, a model P g will score perfectly on precision and recall (R 1 =P 1 =1) as long as it nails the support of P r , even if P r and P g place totally different densities on their common support.

In addition to the above, our metrics estimate the supports of real and generative distributions using neural networks rather than nearest neighbor estimates as in (Naeem et al., 2020) . This prevents our estimates from overestimating the supports of real and generative distributions, thereby overestimating the coverage or quality of the generated samples.

To prove the statement of the Theorem, we need to prove the two following statements:

(1) P g = P r → P α /α = R β /β = 1, ∀α, β

(2) P α /α = R β /β = 1, ∀α, β → P g = P r

To prove (1), we start by noting that since we have P g = P r , then S g α = S r α , ∀α ∈ [0, 1]. Thus, we have P α = P( X g ∈ S α r ) = P( X g ∈ S α g ) = α,

for all α ∈ [0, 1], and similarly, we have

for all β ∈ [0, 1], which concludes condition (1). Now we consider condition (2). We first note that S α r ⊆ S α r for all α > α. If P α = α for all α, then we have

Now assume that α = α + ∆α, then we have

Thus, the probability masses of P g and P r are equal for all infinitesimally small region S α+∆α r /S α r (for ∆α → 0) of the α-support of P r , hence P g = P r for all subsets of S 1 r . By applying the similar argument to the recall metric, we also have P g = P r for all subsets of S 1 g , and hence P g = P r .

Appendix C: Experimental details .

In this research the argue for the versatility of our metrics, hence we have included results for tabular (static), timeseries and image data (see Table 1 ). For the tabular data we use (Baqui et al., 2020) (1) back-fill, (2) forward-fill, (3) feature median imputation. This preprocessing is chosen to match the competition (Jordon et al., 2020) . The competition "hider" submissions were trained on this dataset and the synthetic data generated.

For metric consistency and the avoidance of tedious architecture optimization for each data modality, we follow previous works (e.g. (Heusel et al., 2017; Sajjadi et al., 2018; Kynkäänniemi et al., 2019; Naeem et al., 2020) ) and embed image and time series data into a static embedding. This is required, since the original space is non-euclidean and will result in failure of most metrics. The static embedding is used for computing baseline metrics, and is used as input for the One-Class embedder.

For finding static representations of MNIST, images are upscaled and embedded using InceptionV3 pre-trained on Im-ageNET without top layer. This is the same embedder used for computing Frechet Inception Distance (Heusel et al., 2017) . Very similar results were obtained using instead a VGG-16 embedder (Brock et al., 2018; Kynkäänniemi et al., 2019) . Preliminary experimentation with random VGG-16 models (Naeem et al., 2020) did not yield stable results for neither baselines nor our methods.

.

The time series embeddings used throughout this work are based on Unsupervised Learning of Video Representations using LSTMs (Srivastava et al., 2015) , specifically the "LSTM Autoencoder Mode". A sequence-to-sequence LSTM network is trained, with the target sequence set as the input sequence (reversed for ease of optimization), see Figure 6. The encoder hidden and cell states (h and c vectors) at the end of a sequence are used as the learned representation and are passed to the decoder during training. At inference, these are concatenated to obtain one fixed-length vector per example.

The specifics of the LSTM autoencoder used here are as follows. Two LSTM layers are used in each encoder and decoder. The size of h, c vectors is 70 (280 after concatenation). The model was implemented in PyTorch (Paszke et al., 2017) , utilising sequence packing for computational efficiency. All autoencoders were trained to convergence on the original data; the synthetic time series data was passed through this at inference. The time column (when present in data) was discarded. . For computing the density and coverage metrics, we set a threshold of 0.95 on the minimum expected coverage, as recommended in the original work (Eq. 9 (Naeem et al., 2020) ). For all datasets, this is achieved for k = 5. For consistency in these comparisons, we use k = 5 for the precision and recall metrics too.

.

We use Deep SVDD (Ruff et al., 2018) to embed static data into One-Class representations. To mitigate hypersphere collapse (Propostions 2 and 3 of (Ruff et al., 2018) ), we do not include a bias term and use ReLU activation for the One-Class embedder. Original data is split into training (80%) and validation (20%) set, and One-Class design is fine-tuned to minimise validation loss. We use the SoftBoundary objective (Eq. 3 (Ruff et al., 2018) ) with ν = 0.01 and center c = 1 for tabular and time-series data and c = 10 · 1 for image data. Let n h be the number of hidden layers with each d h nodes, and let d z be the dimension of the representation layer. For tabular data, we use n h = 3, d h = 32 and d z = 25; for time-series data, n h = 2, d h = 128 and d z = 32; and for MNIST n h = 3, d h = 128 and d z = 32. Models are implemented in PyTorch (Paszke et al., 2017) and the AdamW optimizer is used with weight decay 10 −2 .

For the β-recall metric, note that by definition β-recall increases monotonically with increasing k. We set k = 1 for maximum interpretability.

.

We include two toy experiments that highlight the advantage of the proposed metrics compared to previous works. We focus our comparison on the improved precision and recall (Kynkäänniemi et al., 2019) and density and coverage (Naeem et al., 2020) metrics.

.5.1. ROBUSTNESS TO OUTLIERS Naeem et al. (2020) showed that the precision and recall metrics as proposed by (Sajjadi et al., 2018; Kynkäänniemi et al., 2019) are not robust to outliers. We replicate toy experiments to show the proposed α-Precision and β-Recall Table 2 . Metrics on tabular data for different generative models. Row "audited" contains results for data generated by ADS-GAN, but in which samples are rejected if they do not meet the precision or authenticity threshold. Let X, Y ∈ R d denote original and synthetic samples respectively, with original X ∼ N (0, I) and Y ∼ N (µ, I).

We compute all metrics for µ ∈ [−1, 1]. In this setting we conduct three experiments:

1. No outliers 2. One outlier in the real data at X = 1 3. One outlier in the synthetic data at Y = 1

We set d = 64 and both original and synthetic data we sample 10000 points. Subsequent metric scores are shown in Figure As can be seen, the precision and recall metrics are not robust to outliers, as just a single outlier has dramatic effects. The IP α and IR β are not affected, as the outlier does not belong to the α-support (or β-support) unless α (or β) is large.

.

The precision and recall metrics only take into account the support of original and synthetic data, but not the actual densities. The density and coverage metric do take this into account, but here we show these are not able to capture this well enough to distinguish similar distributions.

In this experiment we look at mode resolution: how well is the metric able to distinguish a single mode from two modes? Let the original distribution be a mixture of two gaussians that are separated by distance µ and have σ = 1,

and let the synthetic data be given by Y ∼ N (0, 1 + µ 2 ).

This situation would arise if a synthetic data generator fails to distinguish the two nodes, and instead tries to capture the two close-by modes of the original distribution using a single mode. We compute metrics for µ ∈ [0, 5]. As can be seen, neither P&R nor D&C notice that the synthetic data only consists of a single mode, whereas the original data consisted of two. The α-precision metric is able to capture this metric: for small α the α-support of the original distribution is centred around the two separated, and does not contain the space that separates the modes (i.e. the mode of the synthetic data).

A. Investigating under and overfitting in wasserstein generative adversarial networks

Ethnic and regional variations in hospital mortality from covid-19 in brazil: a cross-sectional observational study

A note on the inception score

Better mixing via deep representations. In International conference on machine learning

Pros and cons of gan evaluation measures. Computer Vision and Image Understanding

Large scale gan training for high fidelity natural image synthesis

Eval all, trust a few, do wrong to none: Comparing sentence generation models

Imagenet: A large-scale hierarchical image database

Precision-recall-gain curves: Pr analysis done right

Generative adversarial nets

Optimal kernel choice for large-scale two-sample tests

A domain agnostic measure for monitoring and evaluating gans

Improved training of wasserstein gans

Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

Gans trained by a two time-scale update rule converge to a nash equilibrium

Training products of experts by minimizing contrastive divergence

Minimax tests and the neymanpearson lemma for capacities. The Annals of Statistics

Estimation of nonnormalized statistical models

Quantitatively evaluating gans with divergences proposed for training

Differentially private bagging: Improved utility and cheaper privacy than subsample-and-aggregate

M. Hide-and-seek privacy challenge

Auto-encoding variational bayes

Improved precision and recall metric for assessing generative models

The mnist database of handwritten digits

Are gans created equal? a large-scale study

How Faithful is your Synthetic Data?

A nonparametric test to detect data-copying in generative models

Reliable fidelity and diversity metrics for generative models

Automatic differentiation in pytorch

Deep one-class classification

Assessing generative models via precision and recall

Improved techniques for training gans

Estimating the support of a highdimensional distribution

Revisiting precision recall definition for generative modeling

Ministry of Health. SIVEP-Gripe public dataset

New method for parameter estimation in probabilistic models: minimum probability flow. Physical review letters

In this Section, we provide a comprehensive survey of prior work, along with a detailed discussion on how our metric relates to existing ones. We classify existing metrics for evaluating generative models into two main classes: