key: cord-0602456-ndny961i
authors: Hu, Jingchen; Savitsky, Terrance D.; Williams, Matthew R.
title: Private Tabular Survey Data Products through Synthetic Microdata Generation
date: 2021-01-15
journal: nan
DOI: nan
sha: 68ee661e3f91612cf08494776ef4c7f96903d1fd
doc_id: 602456
cord_uid: ndny961i

We propose two synthetic microdata approaches to generate private tabular survey data products for public release. We adapt a pseudo posterior mechanism that downweights by-record likelihood contributions with weights $in [0,1]$ based on their identification disclosure risks to producing tabular products for survey data. Our method applied to an observed survey database achieves an asymptotic global probabilistic differential privacy guarantee. Our two approaches synthesize the observed sample distribution of the outcome and survey weights, jointly, such that both quantities together possess a privacy guarantee. The privacy-protected outcome and survey weights are used to construct tabular cell estimates (where the cell inclusion indicators are treated as known and public) and associated standard errors to correct for survey sampling bias. Through a real data application to the Survey of Doctorate Recipients public use file and simulation studies motivated by the application, we demonstrate that our two microdata synthesis approaches to construct tabular products provide superior utility preservation as compared to the additive-noise approach of the Laplace Mechanism. Moreover, our approaches allow the release of microdata to the public, enabling additional analyses at no extra privacy cost.

Survey data are collected by government statistical agencies from individuals, households, and business establishments to support research and policy making. For example, the Survey of Doctorate Recipients (SDR) provides demographic, education, and career history information from individuals with a U.S. research doctoral degree in a science, engineering, or health (SEH) field. The SDR is sponsored by the National Center for Science and Engineering Statistics and by the National Institutes of Health. Conducted since 1973, the SDR is a unique source of information about the educational and occupational achievements and career movement of U.S.-trained doctoral scientists and engineers in the United States and abroad.

Survey sampling designs typically utilize unequal probabilities for the selection of respondents from the population in order to over-sample important sub-populations or to improve the efficiency of a domain (e.g., underrepresented minority) estimator (e.g., of income). Correlations are also induced in the sampling designs through the sampling of geographic clusters of correlated respondents, which is done for convenience and cost. As a result of unequal inclusion probabilities and dependence induced by the survey sampling design, the distribution of variables of interest (e.g., income) are expected to be different in the observed sample than in the underlying population. Therefore, models and statistics estimated on the observed sample without correction will be biased.

At the same time, many government statistical agencies are under legal obligation (such as CIPSEA and Title 13 in the U.S.) to protect the privacy and confidentiality of survey participants. Agencies utilize statistical disclosure control procedures before releasing any statistics derived from survey responses to the public.

More recent disclosure limitation methods include the addition of noise to statistics to perturb their values (Dwork et al., 2006) , on the one hand, and the release of synthetic data generated from a model that encodes smoothing to replace the confidential data (Little, 1993; Rubin, 1993) , on the other hand. Both classes of methods induce distortion into statistics or data targeted for release to encode privacy protection. By contrast, survey sampling weights (constructed to be inversely proportional to respondent inclusion probabilities) are used to correct statistics estimated in the sample to the population. The goal is to reduce distortion or bias. Such is the case with use of generalized regression estimator (GREG) for producing population statistics from an observed sample (Cassel et al., 1976) , since survey weights are based on nonresponse adjustments and benchmarking in addition to inclusion probability.

Therefore, there is a tension between using survey sampling weights to correct the distortion in the observed sample statistics for the population, on the one hand, from the injection of distortion into data statistics to induce privacy protection, on the other hand. Shlomo et al. (2019) is one of the early works that investigates privacy protection approaches for survey weighted frequency tables.

There is a strong connection between smoothness and disclosure risk, where the more local or less smooth is the confidential data distribution, the higher are the identification disclosure risks for survey respondents. The distribution of the sampling weights are typically highly skewed with extremely large values. In the SDR, individuals are sampled at different rates depending on their doctorate field of study and their demographic information to ensure adequate precision for estimating these domains. As a result, individuals in more common fields and demographic groups are assigned relatively low inclusions probability, which produces higher magnitude sampling weight values. This skewed distribution for sampling weights can inadvertently accentuate the disclosure risk for a relatively isolated participant (e.g., someone with an unusual income) by assigning them a large sampling weight. The result of employing the sampling weights to correct estimates of statistics performed on the observed sample back to the population adds peakedness or roughness to those statistics that, in turn, must be smoothed in a disclosure risk-limiting procedure. This paper focuses on the adaptation of statistical disclosure control procedures for the production of synthetic data to survey data for estimation of tabular statistics for the population.

The utility metrics we consider are tabular cell estimates and associated standard errors of the tabular data constructed from simulated synthetic data with survey weights. In the sequel, we develop two alternatives that correct the survey data distribution to the target population of interest, while simultaneously inducing distortion in the generated synthetic data to reduce the identification disclosure risks for survey respondents.

Our focus metric for measuring the relative privacy guarantees of our additive noise and synthetic data measures is differential privacy (DP) (Dwork et al., 2006) . Below we provide a formal definition for DP.

Definition 1 (Differential Privacy) Let D ∈ R n×k be a database in input space D. Let M be a randomized mechanism such that M() : R n×k → O. Then M is -differentially private if , may be viewed as a budget (Dwork et al., 2006) that may be expended on selective releases of privacy-protected statistics. An example is a mechanism that outputs a randomized statistic, f (q j (D)), for query, q j , by adding Laplace noise proportional to ∆ G,j / j , equipped with a privacy guarantee, j . ∆ G,j denotes the global sensitivity over the space of databases, D, of the randomized query statistic, f (q j ). If there are J such queries, then the owner of the confidential data will set each j such that an overall target guarantee, = J j=1 j is achieved.

Under this setup, the guarantee is viewed as a budget that is allocated to account for the information disclosed in each query.

An alternative to the addition of additive noise to statistics is the generation of synthetic data to replace the confidential data for release to the public. Synthetic data is produced by estimating a model on the confidential data, followed by generating replicate data from the model posterior predictive distribution (Little, 1993; Rubin, 1993) . A major advantage of synthetic data methods, in contrast with additive noise mechanisms, is that they do not need to account for interactive query data releases that spend a privacy budget for each release mechanism computed on the confidential data. The synthetic data release comes with a certain level of privacy protection. Moreover, these data may be used for any purpose, including producing unlimited tables with cells at any level of granularity, although their utility is often excellent on predetermined analysis-specific measures but not necessarily other measures . Nevertheless, from the privacy protection perspective, there is no subsequent privacy "accounting" required after the initial creation of the synthetic data. Dimitrakakis et al. (2017) employ the Exponential Mechanism of McSherry and Talwar (2007) for generating synthetic data by selecting the model log-likelihood as the utility function, which produces the posterior distribution, ξ(θ | X), as the random mechanism, M(X, θ).

They demonstrate a connection between the model-indexed sensitivity, sup x,y∈X n :δ(x,y)=1 sup θ∈Θ |f θ (x)− f θ (y)| ≤ ∆ and ≤ 2∆, where f θ (x) is the model log-likelihood and ∆ denotes a Lipschitz bound. The guarantee applies to all databases x in the space of databases of size n, X n . The posterior mechanism suffers from a non-finite ∆ = ∞ for most Bayesian probability models used, in practice.

Suppose a sample S of n individuals is taken from a population U of size N . The sample is taken under a survey design distribution that assigns indicators ω i ∈ {0, 1} to each individual in U with probability of selection P (ω i = 1 | A) = π i , where A denotes the accumulated information in the population. Often the selection probabilities π i are related to the response of interest y i . The balance of information of the observed sample y i , i ∈ S is different from the balance of information in the population y , ∈ U , commonly known as an informative sampling design (Pfeffermann and Sverchkov, 2009 ). To account for this imbalance, survey weights w i = 1/π i are used to create estimators on the observed sample that reduce bias;

for example, a consistent estimator of the population mean isμ = S w i y i / S w i . When the estimation focus moves beyond simple statistics, consistent estimation for more general models can be based on the exponentiated pseudo likelihood: L w (θ) = S p(y i |θ) w i . Use of this pseudo likelihood in Bayesian probability models, which we utilize in the sequel, provides for consistent estimation of θ for broad classes of both population models (Savitsky and Toth, 2016) and complex survey sampling designs (Williams and Savitsky, 2020a) . Godambe and Thompson (1986) and Pfeffermann (1996) discuss use of the pseudo likelihood in frequentist estimation.

The use of survey weights increases the influence of individual observations, thus increasing the sensitivity of the output mechanism. For example, a weighted count would have a sensitivity of max i w i instead of 1 in an unweighted case. Thus direct usage of additive noise and perturbation can lead to a large amount of noise at the expense of utility.

For estimation, survey weights mitigate the estimation bias, but the uncertainty distribution (covariance structure) also needs to be estimated and adjusted (Williams and Savitsky, 2020b) . The typical assumption for variance estimation (of the same models) is to assume an arbitrary amount of within cluster dependence both in the sampling design and the population generating model (Heeringa et al., 2010; Rao et al., 1992) .

The de facto approach for variance estimation is based on the approximate sampling independence of the primary sampling units (Heeringa et al., 2010) . Variance estimation can be in the form of Taylor linearization or replication based methods with a variety of implementations available for each (Binder, 1996; Rao et al., 1992) . Williams and Savitsky (2020b) propose a hybrid approach made possible by recent advances in algorithmic differentiation (Margossian, 2018) . Each of these methods re-use the data and thus require additional privacy budget allocation. Trade-offs between efficient estimation of variance (full number of clusters or replicates for the full precision of variance estimates) and conserving an budget (aggregating clusters and reducing replicates for reduced precision of variance estimates) are an open challenge.

In this work, we aim at producing tabular data products with privacy guarantee , conditioned on the local survey sample database. Specifically, we propose to synthesize microdata under the pseudo posterior mechanism (Savitsky et al., 2022) , coupled with survey weights. In other words, we extend Savitsky et al. (2022) to generating synthetic survey microdata. We may view the focus on tabular data, formed from the synthetic survey microdata, as one type of data utility for evaluating our adapted pseudo posterior mechanism. We note that, once generated, synthetic survey microdata may be used for many purposes, each with distinct measures of utility, e.g., weighted regressions with survey weights. Our focus on tabular statistics owes to their being the main data product released by government statistical agencies. Motivated by the SDR application presented in Section 3, which stratifies by field of study and oversamples based on gender and underrepresented minority status, we utilize an informative single stage stratified sample design with unequal probabilities of selection (with or without replacement).

We consider a local survey database y n = (y 1 , · · · , y n ), design information in variables X n = (x 1 , · · · , x n ), and associated survey weights w n = (w 1 , · · · , w n ). We consider univariate continuous outcome variable y i , such as salary, and categorical design information vector x i , such as field of study and gender, where fields of study are strata. These design variables are considered public information in our setup. Our goals are to create private tables of counts (cell, marginal, and total) and average salary (cell, marginal, and total) by field and gender from the synthetic microdata containing synthesized y n and w n , with privacy guarantee.

The basic setup of our proposed survey microdata synthesizers uses privacy protection weights α under the pseudo posterior mechanism. This mechanism estimates an α−weighted pseudo posterior distribution (without survey weights w n ),

where ξ(θ) is the prior distribution and p(·) denotes the likelihood function with corresponding

nism. Define ∆ α as the α−weighted Lipschitz bound or sensitivity over the space of databases and the space of parameters.

The weights, (α(y i )), are formulated locally from the observed database, y n . We denote the Lipschitz bound computed locally from the observed database as ∆ α,yn . Differential privacy is a guarantee over the space of all databases, y n ∈ Y n of size n. Savitsky et al. (2022) demonstrate that the pseudo posterior mechanism of Equation (1) satisfies a formal guarantee that they label "asymptotic differential privacy" with level = 2∆ α , or aDP− . Asymptotic differential privacy guarantees that the Lipschitz bounds computed locally on a collection of observed databases, ∆ α,yn , contract onto a global Lipschitz bound, ∆ α , as the number of database observations, n, increases. The contraction of the local Lipschitz bounds onto the global bound occurs at O(n −1/2 ). We may imagine the generation of multiple collections of databases, {y n,r } R r=1 , that produce the associated collection of Lipschitz bounds, (∆ α,yn,r ) R r=1 .

The aDP result guarantees that for n sufficiently large, the local Lipschitz bounds for that collection of databases contract onto the global Lipschitz bound. In practice, Savitsky et al. (2022) show that in a Monte Carlo simulation study that for sample sizes of a few hundred that the ∆ α,yn = ∆ α up to any desired precision.

The details of steps of calculating α i ∝ 1/∆ α,y i ∈ [0, 1] based on the local database (y n , X n ) are laid out in Algorithm 1 in Savitsky et al. (2020) . A quick overview of the algorithm, including the generation of synthetic data, involves the following steps:

1. Estimate the unweighted posterior distribution, ξ(θ | y n , X n ), and obtain parameter draws from the model posterior distribution for θ.

(1, . . . , S) represents a draw of the posterior distribution from the MCMC estimation algorithm used for the model and ∆ y i is the Lipschitz bond local to database, y n , when all weights α are set to 1. The maximum value of the absolute value of the log likelihood for record i over the parameter space represents the disclosure risk for that record. A high risk data record tends to be isolated from other data records such as in the case of a very high income individual where the sensitive variable of focus are incomes for a collection of individuals. In the case that the data value for a record is isolated, its absolute log likelihood value will be very large since there is little probability mass placed in that tail region of the distribution and the resulting α i for that record will be relatively small.

Each record-indexed weight, α i , is scaled to lie in [0, 1], where the likelihood contribution for a relatively higher risk record is downweighted such that the data for that record, 

where the latter inequality derives from α i ≤ 1 for all i ∈ (1, . . . , n).

4. Compute the overall Lipschitz bound of the pseudo posterior mechanism local to database y n , ∆ α,yn = max i∈(1,...,n) ∆ α,y i . As contrasted with the posterior mechanism of Dimitrakakis et al. (2017), the pseudo posterior mechanism guarantees ∆ α,yn < ∞ by setting the α i = 0 for any record with a non-finite absolute log-likelihood.

5. With a set of parameter draws for θ ( ) , simulate a synthetic dataset y * ,( ) n according to the sampling model (y i | x i , θ). If m synthetic datasets are simulated ( = 1, · · · , m), the synthetic data release determines the privacy guarantee, yn = 2∆ α,yn × m.

There is no leakage in computation of the data-dependent weights, α(y n ), to formulate the pseudo posterior mechanism, because these privacy protection weights are not released, but serve to downweight or remove the data contributions of highly risky records. The overall sensitivity and privacy guarantee are released, as is the synthetic microdata generated from the pseudo posterior mechanism (via the model posterior predictive distribution). The lower the weights are, (α i ) n i=1 , the greater the degree of distortion that will be induced in the distribution of the synthetic data as compared to the confidential data. It bears mention that the each weight, α i , can be scaled and shifted by (c 1 , c 2 ) to α * ∈ [0, 1] (as in α * i = c 1 × α i + c 2 ), to increase or decrease the weights as an indirect means to achieve a target yn = 2∆ α * ,yn for generation of synthetic databases.

The pseudo posterior mechanism has the virtue that it may be applied to any data synthe-sizing model and is readily estimable while being equipped with an asymptotic DP guarantee.

The pseudo posterior mechanism allows us to take a synthesizer model with good utility and modify it to become asymptotically differentially private. For survey data with sample weights, we could jointly synthesize the response variable y n together with the survey weights and then use the synthesized weights to extrapolate the synthesized response variable back to the population. Alternatively, we could directly synthesize from the population distribution for the response variable by using the survey weights released with the observed sample to correct back to the population during estimation. The former method synthesizes both the outcome variable y n and weights w n under the distribution of the observed sample and then uses the synthesized weights to correct the data back to the population distribution. The latter method uses the survey weights to synthesize the outcome variable under the population distribution and then discards the weights. In either approach, the design variables X n are considered public. We note that if any of these design variables were deemed private, we would co-model and synthesize them without loss of generality.

In this work, we propose two microdata synthesizers that incorporate survey weights under the pseudo posterior mechanism framework: (i) A Fully Bayes model for observed sample models (y n , w n ) under a bivariate normal model on the transformed data, using privacy protection weights α in the joint likelihood, and the public X n are used as predictors and unsynthesized;

(ii) A Fully Bayes model for the population that forms the exact likelihood for (y n , w n ) in the observed sample to model y n and corrects for population bias (Pfeffermann et al., 1998; Leon-Novelo and Savitsky, 2019) , where privacy protection weights α are used and the public X n are used as predictors and unsynthesized. Both methods are scaled to have equivalent asymptotic differential privacy guarantee ( yn = 2∆ α,(yn,Xn,wn) × m), where m ≥ 1 is the number of simulated synthetic datasets. Once synthetic microdata are available from each approach, we create survey tables of counts and average outcome by design variables and compare their utility performances, mainly the point estimates and standard error estimates.

We utilize a comparison method that adds noise proportional to the sensitivity local to the database from the Laplace Mechanism. A detailed review of the Laplace Mechanism is available in the Supplementary Materials. We use the same level of privacy guarantee ( yn = 2∆ α,(yn,Xn,wn) × m) as in our two microdata synthesis approaches. With the same level of privacy guarantee, we compare their utility performances in the created survey tables of counts and average salary by design variables. The privacy guarantee is "local" for the Laplace Mechanism in that is applies to the single observed dataset rather than the collection of all possible data sets of the same type (i.e., global). The pseudo posterior mechanism results, by contrast, are asymptotically (globally) differentially private. The comparisons between the two mechanisms on an observed or local database are useful to assess differences in utility performance for the same privacy guarantee, even though the pseudo posterior mechanism is equipped with an asymptotic DP guarantee.

The remainder of the paper is organized as follows. In Section 2, we lay out the details of our two proposed microdata synthesis approaches, with a discussion and comparison between the two, as well as how to create survey tables from synthetic microdata in each approach. We also describe details of how to construct the sensitivity of the Laplace Mechanism with survey weights. We present a real data application to a sample of the SDR in Section 3, where we focus on utility comparison among the three methods. 

Our first approach is fully Bayesian because it jointly models the outcome y n and sampling weight w n of the observed sample of size n. We label it FBS (Fully Bayes Sample). Since we model the observed sample, not the population, we do not assume the model estimated on the sample is the population generating model. In fact, the distribution of the outcome and weight variables are generally expected to be different in the observed sample than for the underlying population. We retain the smoothed / model-estimated version of the outcome and weights, and we utilize the latter to correct the distribution of the outcome in the sample back to the population. We use a bivariate normal synthesizer for the joint distribution of (ỹ i ,w i ) of unit i, whereỹ i andw i are y i and w i after appropriate transformation (e.g., log transformation)

with predictors x i , as in Equation (2):

For synthesizing data (y i , w i ), we then back-transform the generated MVN values of (ỹ i ,w i ).

We specify an independent and identically-distributed multivariate Gaussian prior for the coefficient locations, (β y ) and (β w ), and a uniform prior for covariance matrix, Σ, over the space of covariance matrices of size 2 × 2 (Stan Development Team, 2016). The details of the prior specification are included in the Supplementary Materials.

We fit this unweighted synthesizer for (y n , X n , w n ) and calculating the unit-level privacy protection weights α = (α 1 , · · · , α n ) using the procedure mentioned in Section 1.4.1 where we use Stan (Stan Development Team, 2016) to provide posterior estimates for parameters, θ = (β, Σ). For each unit i, we exponentiate its likelihood by α i , so that we arrive at the α−weighted pseudo posterior distribution of θ = (β, Σ), as in Equation (3):

Once we estimate this α−weighted pseudo posterior distribution of (β, Σ), we simulate m posterior samples, which achieve the local ( yn = 2∆ α,(yn,Xn,wn) × m) privacy guarantee. Given the simulated m posterior samples of (β, Σ), we can generate m synthetic survey datasets following the bivariate normal model in Equation (2), denoted as (Y * , X,

Each synthetic survey dataset (y * ,( ) n , X ( ) n , w * ,( ) n ), = 1, · · · , m, is used to form survey tables, which are to be released. This table creation process does not cost additional privacy budget since it is post-processing (the protected data y n is not used; Dwork et al. (2006) ; Nissim et al. (2007) ). Moreover, since the predictors x i are not synthesized, this approach produces partially synthetic data, and the survey tables are created by combining rules of partial synthesis (Reiter and Raghunathan, 2007; Drechsler, 2011) . Details of the combining rules are included in Appendix A.

In addition, we use smoothed w * ,( ) n from the conditional normal distribution derived from

where ρ denotes the correlation between y n and w n ; σ y and σ w are the standard deviation of y n and w n , respectively) to create survey tables from synthetic survey sample (y * ,( ) n , X ( )

). The smoothed weights w * ,( ) n will provide survey tables with less noise, an appealing feature of modeling w n with outcome variable y n .

We can also directly release synthetic survey samples (Y * , X, W * ) to the public. Data users will need to know how to incorporate survey weights w * ,( ) n for unbiased inference with respect to the population. The availability of synthetic survey samples allows data users to perform analyses of their interests which are only feasible with microdata, such as a weighted regression of income y n on predictors X n , whose accuracy needs to be evaluated. Nevertheless, this increases the utility of the synthetic data compared to direct table protection without additional privacy loss.

Our second approach is also fully Bayesian and jointly models the outcome y n and the sampling weights w n . However, the specific joint specification modeled (on the observed sample) is for the generative model of the population and the sample design, rather than directly modelling the sample itself. We label it FBP (Fully Bayes Population). To form the exact likelihood for (y i , w i ) in the observed sample which corrects for population bias, we follow the fully Bayesian approach proposed by Leon-Novelo and Savitsky (2019) and form the exact likelihood through inclusion probability w i = 1/π i ; that is, we model (y i , π i ) in the observed sample. We first assume a linear model for the population as

Given y i , the conditional population model for inclusion probabilities is

where κ y and κ x are regression coefficients for y i and x t i , respectively. We specify independent multivariate normal priors for β and (κ y , κ x ) and half Cauchy priors for σ y and σ x . The details of the prior specification are available in the Supplementary Materials. In practice, we may first transform the outcome (log(y i )). (2019) has shown that the posterior distribution observed on the sample is

After fitting this unweighted synthesizer for (y n , X n , π n ), we calculate the unit-level privacy protection weights α = (α 1 , · · · , α n ). For each unit i, we exponentiate its likelihood by α i , so that we arrive at the α−weighted pseudo posterior distribution of (β, σ y , κ y , κ x , σ π ), as in Equation (7):

Once we estimate this α−weighted pseudo posterior distribution of (β, σ y , κ y , κ x , σ π ), we simulate m posterior samples, which achieve the local ( yn = 2∆ α,(yn,Xn,wn) × m) privacy guarantee 1 . Given the simulated m posterior samples of (β, σ y , κ y , κ x , σ π ), we can generate m synthetic survey datasets following the population model in Equation (4) and Equa-

, and = 1, · · · , m (the ∝ is used to account for normalization).

Similar to FBS, each synthetic survey dataset (y * ,( ) n , w * ,( ) n ), = 1, · · · , m, is used to form survey tables. Partial synthesis combining rules are used to create the survey tables due to the fact that the predictors x i are not synthesized. As with FBS, the table creation process does not cost additional privacy budget. Moreover, smoothed weights w * ,( ) n are generated from the smoothed π * ,( ) n by using only the κ y y i component of the mean in Equation (5). These weights are only needed when combining data across strata. Within strata, the individuals in the synthetic population are equally weighted. See Section 2.3 for more details.

Alternatively, we can directly release synthetic survey samples (Y * , X, W * ) to the public.

Data users do not need to know how to incorporate survey weights w * ,( ) n for unbiased inference with respect to the population, since Y * are corrected for survey sampling bias. However, data users need to aggregate survey weights w * ,( ) n to account for differences in population sizes across strata, for example when creating tables of counts. As with FBS, the release of synthetic survey samples (Y * , X, W * ) increases the utility of the synthetic data compared to direct table protection without additional privacy loss.

In this section, we discuss and compare the two microdata synthesis approaches, whose key features are summarized and compared in Table 1. FBS FBP Modeling w n yes yes Bias correction stage analysis synthesis/analysis Synthesis type partial synthesis partial synthesis Synthesized variables Both approaches are fully Bayesian approaches which jointly model outcome variables and the sampling weights. FBS models the observed sample without correction for population bias, therefore we will use the synthesized sampling weights w * n to form survey tables (correcting the bias at the analysis stage). FBP corrects for bias at the modeling stage. However, since FBP does not co-model X n (X n are used as predictors), the synthetic weights will be used to generate marginal estimates across different values of X n . In other words, the partially synthetic FBP does not correct for the difference in the distributions of X n between the sample and the population. Both approaches implement partial synthesis because design variables X n are used as predictors but not synthesized. We also note that while FBP can incorporate population bias correction directly into the model, it is less flexible compared to FBS in more complicated settings, for example, when there are more than one outcome variables of mixed types.

An important aspect of discussing and comparing the two approaches is how to create survey tables from each approach once synthetic microdata are obtained. These tables include both point estimates and standard error estimates.

FBS requires correction for population bias of the outcome variable with survey weights.

Therefore, with the th ( = 1, · · · , m) synthetic sample from FBS, (y * ,( ) n , X 

is an indicator variable for an individual i belonging to cell {f, g}. Variance estimates for a cell total count and salary for each database, , are produced via Taylor linearization (Binder, 1996) , a standard method for survey samples. For linear estimators such as totals and means, the approach is straight forward:

• For each cluster c of the n c sampled clusters, define the aggregate residual r c = i∈c 1 i

Note that c∈S r c = 0 exactly for linear estimators.

For a one-stage stratified design, such as the SDR, each respondent is an individual unit (i = c) and variance calculations are performed independently across strata and then aggregated.

The m synthetic databases are used to compute a between-databases variance and combining rules of partial synthesis are used to compute final point and standard error estimates.

FBP incorporates correction for population bias of the outcome variable in the model for given values of X n . In our example, X n are categorical data. Therefore, we create average salary valuesμ f g by design variables without any weights (w * = 1). Because we do not synthesize X n , however, we use the smoothed weights, w * ,( ) n , to construct marginal salary values (e.g., over genderμ g and fieldμ f ). When creating counts (or size) estimatesN f g for the population based on the X n categories, we create counts with smoothed weights w * ,( ) n .

As with FBS, we use combining rules of partial synthesis to create final point and standard error estimates. We estimate variances for the tabular data using the same Taylor linearization approach outlines above for FBS.

With co-modeling of survey weights together with the outcome salary and using smoothed weights in table construction, FBS and FBP will result in more accurate point estimates and smaller standard error estimates of counts and average salary values than any mechanism which uses the raw (unsmoothed) sampling weights, such as additive noise mechanisms. In the sequel, we ensure robust Markov chain Monte Carlo mixing for accurate estimation of the posterior distribution under both models, for the SDR application in Section 3 and the simulation studies in Section 4.

As a comparison to our two Bayesian microdata synthesizers that use smoothing to encode disclosure protection, we include the alternative of adding noise to tabular products produced directly from the confidential survey data. Each product (e.g., cell means, cell counts, and corresponding standard errors) has a different amount of noise added from the Laplace distribution. See Supplementary Materials for a review of the Laplace Mechanism. We present sensitivity calculations for adding noise according to the Laplace distribution, which are based on the outcome variable and survey weights local to the database (y n , X n , w n ). Since both the weights and outcomes are not required to be bounded, global additive noise mechanisms do not exist (i.e., they would add noise of infinite or unbounded scale). Let S f,g represent the set of observations in field of study f and gender g. Consistent with our use of partial synthesis for FBS and FBP, we assume that the unweighted sample sizes n f,g are not sensitive (i.e., publicly available) which is often the case for demographic surveys.

The local sensitivity ∆ c f,g for count of field f and gender g (cell count) is:

The local sensitivity ∆ a f,g for average salary of field f and gender g (cell average) is:

.

The marginal and total counts and averages are calculated in a similar fashion, once S f , S g , and S are defined accordingly. Finally, with calculated local sensitivity ∆ c f,g for the count of field f and gender g, the noise to be added to that cell count is sampled from the following Laplace distribution (Dwork et al., 2006) :

where is the privacy budget. A similar process is used when adding noise to the average salary value of field f and gender g with calculated local sensitivity ∆ a f,g . For a given privacy budget , the larger the local sensitivity, ∆ c f,g or ∆ a f,g , the larger the scale for the added noise.

It will often be the case that the observed values of y i and w i within each field f and gender g cell do not represent the full range of values in the corresponding cell population, so we take the maximum of the cell-specific sensitivities and use that instead for the cell level Laplace noise: ∆ c * = max f,g ∆ c f,g and ∆ a * = max f,g ∆ a f,g .

We acknowledge that our implementation of the Laplace Mechanism adds noise to each cell, which will not ensure features such as the counts of male and female adding up to the count of both genders in a given field. Some modifications can be made to the query or the post-processing to enforce these consistency requirements (Li et al., 2010) . There are also works on post-processing optimization (Li et al., 2014) , which would benefit our synthetic data approaches as well once tables are created. For illustration in this paper, we do not pursue such modifications.

In addition to the counts and averages, we must also add noise to their corresponding variance estimates. We use a replication method (Rao et al., 1992) for a set of R = 10 replicates. We choose the random replication method of Preston (2009) , as opposed to the stratified jackknife or the balanced repeated replication (BRR), so that we can tune the number of replicates, R, directly. We add noise to each of these 10 replicated point estimates, based on the sensitivity calculations for point estimates above to achieve a target vc = 10 rep for a given cell, where "vc" denotes "variance cell".

• For each set of R replicate weights (w i ) r we modify values for a subset for clusters. The method of Preston (2009) randomly selects half of the clusters in each strata, setting the weights of the others to 0. Non-zero weights are then doubled such that weight totals are invariante across strata: i∈S (w i ) r = i∈S w i =N .

• Estimate R = 10 setsN r

Then calculating between cluster variance V ar(μ f g ) is a post-processing step performed on the set ofμ r f g , and does not incur additional privacy loss. The FBS and FBP pseudo posterior and Laplace mechanisms all base their local privacy guarantees on the same observed database (y n , X n , w n ) and their local yn levels are set to be equal for our analyses in the sequel to compare utility performances. Only the pseudo posterior mechanisms, however, are equipped with an asymptotic global differential privacy guarantee. The local DP guarantee of the Laplace Mechanism, by contrast, is based on the observed range for the outcome and weight values, which will vary between datasets.

In our real data application, we apply our two microdata synthesis approaches and the Laplace Mechanism to a sample of the SDR. Our sample comes from the public use file of 2017. The sample contains information on salary, field of study (8 levels), and gender (2 levels). For ease of illustration, we subset to the n = 10, 355 respondents who are employed and from the 40-44 age group.

Our goal is to create survey tables of counts and average salary by field and gender along with corresponding standard error estimates, all with privacy protection. We fit our two micro-data synthesis models presented in Sections 2.1 and 2.2 to generate synthetic survey microdata, from which we create survey tables containing point estimates and standard error estimates.

As a comparison method, we add noise to the point estimates and standard error estimates using the Laplace distribution with local sensitivities for the Laplace Mechanism outlined in Section 2.4. Note that we scale the maximum Lipschitz bound in our two microdata synthesis approaches to express an equivalent value for the Lipschitz bounds, denoted as ∆ α,(yn,Xn,wn) .

Therefore, we use yn = 2∆ α,(yn,Xn,wn) × m as the total privacy budget for adding Laplace noise under the Laplace Mechanism comparison method, where m denotes the number of replicate synthetic databases generated. In this application, we use m = 3 based on the results of our simulation study that follows this application.

In general, data disseminators who are trying to create a private population estimator from samples using survey weights are faced with a challenge: the large variation in the survey design weights results in increased sensitivity of any additive noise privacy mechanism. Our microdata synthesis models, FBS and FBP, tackle this challenge by co-modeling the outcome variable and the survey weights. Therefore, under an informative sampling design, our methods would remove variation in weights unrelated to the outcome variable. The smoothed weights would be expected to produce improved utility, defined as preserving the sample-based tabular statistics, over additive noise mechanisms, such as the Laplace Mechanism that uses the raw sampling weights.

However, our motivating SDR sample utilizes a nearly non-informative design (the correlation between the outcome variable salary and the survey design weights is about 0.09), and thus nearly representative of the population. We, therefore, expect to see: contrast, the Laplace and related mechanisms add noise to the confidential data. Under our use of a modeling framework, we are able to selectively downweight record-indexed likelihood contributions to be more precise in encoding privacy, which better preserves distributional characteristics in the synthetic data.

If the sampling design is informative, which is the more common setup that we explore in our simulation studies of Section 4, our FBS and FBP are expected to outperform the Laplace Mechanism on both counts and average salary values, due to their co-modeling of salary and weight that produces weight smoothing for more efficient estimators.

We next investigate the model fit performances of FBS and FBP. Each plot panel in Figure   1 plots Figure 1 further reveals that the (weighted) Lipschitz distribution for FBS produces more records expressing relatively high values of ∆ α,(y i ,X i ,w i ) that are concentrated around the overall Lipschitz, ∆ α,(yn,Xn,wn) for the entire database as compared to FBP. Although only the overall Lipschitz bound (i.e., the maximum) controls the overall privacy guarantee, we observe that FBS avoids overly downweighting records' likelihood contributions through α, which will result in higher utility as we will see in Section 3.3, making it a more efficient synthesizer (See also Savitsky et al., 2020) .

It bears noting that for real data applications, such as Google's 2020 COVID-19 Mobility

Reports and LinkedIn's Audience Engagement API, the privacy budgets being used ranged from 8.6 to 79.22 for monthly queries (Bowen and Garfinkel, 2021) . The SDR is conducted every other year, and we believe a privacy budget of 10.8 is within an acceptance range in real data applications given current practices. Moreover, we note that releasing ∆ α,yn , such as in Figure 1 , is equivalent to releasing yn which contracts on the global for n sufficiently large (more than a couple of hundred units). In other words, there is no additional leakage by releasing ∆ α,yn .

With both Bayesian methods satisfy equivalent privacy guarantee, we compare their utility results, the point estimates and standard error estimates of the resulting privacy protected tables. Our methods are also compared to those of noise-added point and standard error estimates from the Laplace Mechanism satisfying equivalent privacy guarantee of yn = 10.8.

We recall that our goal is to create private survey tables of counts of observations and certain key features, such as the counts of male and female adding up to the count of both genders in a given field, which are naturally maintained by the microdata synthesis approaches, as evident in Tables 2 and 3 . Improvements can be made to enforce these consistency requirements, such as methods proposed by Li et al. (2010) , which are not pursued in this paper. (4874) 135918 (174237) 135373 (4204) 136571 (4271) Male 135522 (5648) 163687 (234994) 137929 (5157) (4885) 105217 (55443) 98907 (2758) values. We also note that FBS has higher modeling flexibility and is more straightforward to implement than FBP.

As discussed in Section 3.1, when working with a more informative sampling design, we anticipate our FBS and FBP to outperform the Laplace Mechanism on both counts and average salary values. By co-modeling the salary variable and the survey weights, FBS and FBP will remove the variation in survey weights through smoothing, which in turn creates smaller standard error estimates for the cells. The Laplace Mechanism, on the other hand, will suffer greatly from the variation of survey weights due to increased sensitivity values. To fully explore such scenarios, next we conduct a series of simulation studies where we utilize an informative sampling design that is more typical of those administrated by government statistical agencies.

In our simulation studies to follow we take a sample from a simulated population under an informative sampling design. We perform estimation on the observed sample using the FBS and FBP approaches for synthesis and create survey tables from the resulting synthetic microdata samples. As in the SDR application, we scale the two approaches at equivalent levels of privacy guarantee to compare their utility performances. We also add noise under the Laplace 

We design our simulation study based on the public use file of the 2017 SDR. We simulate a population of N = 100, 000 units containing unit-level information on salary, field, and gender.

In our simulated population, the field and gender percentages follow those in the public use file. Given simulated field and gender, each unit's salary value y i is simulated from a lognormal distribution with a field and gender specific mean (obtained from the public use file) and a fixed scale of 0.4. We simulate inclusion probability π i for unit i by generating additive noise i from a normal distribution with 0 mean and the same fixed scale of 0.4 in the following construction, log(π i ) = log(y i ) + noise i . We then obtain survey weight w i ∝ 1/π i 2 . Less noise corresponds to a more informative sampling design, whereas more noise corresponds to a weaker relationship between the outcome and the selection probability. We choose a moderate level of noise corresponding to a moderately informative design, resulting in a −0.57 correlation between the salary and the survey weights in the population.

Next, we take a stratified probability proportional to size (PPS) sample of n = 1000 units, where fields are used as strata (and π i is used as the size variable in Equation (11)). We denote y n as the outcome variable salary, X n as the field and gender variables, and w n as the sampling weights of the realized / observed sample. The correlation between y n and w n is -0.58 in the sample.

To investigate the model fit performances of FBS and FBP pseudo posterior mechanisms on the simulated sample, we examine the distributions of record-level Lipschitz bounds ∆ α,(y i ,X i ,w i ) of FBS and FBP in Figure 3 . As before, their corresponding unweighted posterior mechanism Lipschitz bounds distributions are included for comparison. Our (weighted) FBS and FBP have an equivalent maximum Lipschitz bound of about 1.8, which means that both approaches 2 The actual marginal inclusion probabilities π * i = 1 − (1 − π i / i∈h π i ) n ≈ n * π i / i∈h π i ∝ π i and the sample is selected with replacement. This is close to sampling without replacement when using small sampling fractions. The π * i 's are calculated with the inclusionprobabilities() function from the sampling R package, which computes the first-order inclusion probabilities for a probability proportional-to-size sampling design (Tillé and Matei, 2021) . provide an ( yn = 2 × 1.8 × 3 = 10.8) asymptotic differential privacy guarantee with m = 3 simulated synthetic datasets. The maximum Lipschitz bounds of the two unweighted are 8.89 and 9.92, respectively. As in the SDR application, our α−weighted FBS and FBP pseudo posterior mechanisms produce lower overall Lipschitz bounds and provide an ( yn = 10.8) privacy guarantee. Moreover, FBS downweights less of records' likelihood contributions through α, which is expected to result in higher utility. 

We proceed to compare the their utility results, defined as the point estimates and standard estimates of the resulting private tables achieved by the three methods (FBS, FBP, Laplace) under an equivalent privacy guarantee.

Unlike the SDR application where we do not know the population truth, in our simulation studies we know the population value of counts and average salary values of all cells of interest, which involve 27 cells for counts and another 27 cells for average salary values. Therefore, we calculate the RMSE value of each cell of each of the three methods compared to the cell's population value. We produce the corresponding RMSE value based on the confidential sample, and create an RMSE ratio, defined to be the cell-specific RMSE value of our methods over that of the sample. The smaller the RMSE ratio, the higher the utility. We present the distributions of 27 RMSE ratios for counts (left) and average salary values (right) in Figure 5 .

On the counts, FBS and FBP clearly outperform the Laplace Mechanism, with FBS producing the smallest RMSE ratios and therefore the highest utility. The superior performance of FBS and FBP lies in their co-modeling of the outcome salary and the survey weights. While the specific synthesizing model is different between them, both methods take the advantage of co-modeling the weights and outcome, which lead to weight smoothing and result in more stable estimates of domain counts which are tabulated from the weights (See for example Beaumont, 2008) . This is especially true when an informative sampling design is employed, as we do here in our simulation design. Between the two methods, FBS performs better. As shown in the Supplementary Materials under repeated sampling, FBP overcovers by producing longer confidence intervals, which explain its higher RMSE ratio values.

For average salary values, FBS and FBP perform dramatically better than the Laplace Mechanism. These results once again illustrate the main advantage of co-modeling of our FBS and FBP over the Laplace Mechanism: to fully utilize any information in the sampling weights to improve estimation of the outcome variable(s). Between the two methods, FBP shows a more contracted RMSE ratio distribution, indicting higher utility than FBS. The advantage of FBP lies in its incorporation of population bias by design and its enhanced weight smoothing in relation to the salary (as shown in Figure 4 and discussed in Section 4.2).

By contrast, when implementing the Laplace Mechanism, given a fixed privacy budget yn , the amount of noise to be added solely depends on the cell sensitivity: the scale of Laplace noise is proportional to the cell sensitivity, so that larger cell sensitivity results in larger noise to be added (Dwork et al., 2006) . When units are accompanied with sampling weights, and if the sampling weight distribution has a large variability, as is often the case in practice, the cell sensitivity of counts can be large. Moreover, as with the SDR application in Section 3.3, the Laplace Mechanism does not maintain certain key features, such as the counts of male and female adding up to the count of both genders in a given field, which are naturally maintained by FBS and FBP (results omitted for brevity).

In summary, our microdata synthesis approaches outperform the Laplace Mechanism in utility preservation of point estimates and standard error estimates, illustrated by the RMSE ratio metric. Between the two methods, FBS produces counts with higher utility while FBP produces average salary with higher utility. Overall both methods perform reasonably well across the two sets of tabular statistics. We next illustrate how to tune the utility-risk tradeoff of FBS and FBP, by investigating the effects of m, the number of simulated synthetic datasets, and the overall Lipschitz bounds.

Both our FBS and FBP approaches provide ( yn = 2 × ∆ α,(yn,Xn,wn) × m)−DP privacy guarantee with m synthetic datasets. Our results so far are all based on simulating m = 3 synthetic datasets, which given ∆ α,(yn,Xn,wn) = 1.8, provide an ( yn = 2 × 1.8 × 3 = 10.8) privacy guarantee. In this section, we investigate the utility-risk trade-off of the FBS and FBP approaches from two angles: the effects of various choice for m, the number of simulated synthetic datasets, and the scaling and shifting our weights, α to shift ∆ α,(yn,Xn,wn) , both of which are positively linear with the overall privacy guarantee, according to the ( yn = 2 × ∆ α,(yn,Xn,wn) × m) result.

The first set of investigation on the effects of m requires no additional model fits of FBS and FBP. Given the current model fits which result in equivalent overall Lipschitz bounds of ∆ α,(yn,Xn,wn) = 1.8, we experiment with different values of m and evaluate its impact on the privacy budget ( yn = 2 × 1.8 × m) and its RMSE ratio utility. We choose to experiment with the set of m = {1, 3, 5}, which are associated with yn = {3.6, 10.8, 18} for both methods. Figure 5 . These enlarged panels without the Laplace Mechanism clearly demonstrate that FBS performs better on counts while FBP performs better on average salary values. Moreover, in many cases FBS achieves smaller-than-1 RMSE ratios for counts and FBP produces smallerthan-1 RMSE ratios for average salary values, suggesting that the private FBS and FBP tabular data produce more efficient estimators than the non-private, confidential sample.

The impact of m on the RMSE ratio utility metric is in accordance to the expected utilityrisk trade-off for the most part: As m increases from 1 to 3 and then to 5, the RMSE ratio distributions of both methods become slightly more contracted and overall smaller in for the counts, indicating slightly improved utility at the price of higher privacy budget (i.e., higher risks). This is true for FBP on average salary values, while increasing m has little impact on FBS results on average salary values. Given the small utility improvement on estimated salary totals for the cells, increasing m from 3 to 5, does not sufficiently improve utility to justify the required amount of added privacy budget (from 10.8 to 18 in this case). Therefore the (m = 3, yn = 10.8) setup is ideal for a utility-risk trade-off balance in our simulation setting.

For the second investigation on the effects of overall Lipschitz bound ∆ α,(yn,Xn,wn) , we refit the two methods on the selected sample and perform less downweighting overall to achieve a higher overall Lipscthiz bound, ∆ α,(yn,Xn,wn) = 3.4, for both methods. The refits require Although the overall Lipschitz bound ∆ α,(yn,Xn,wn) increases from 1.8 to 3.4, which results in higher privacy budgets for a given m, we do not see much utility improvement for FBP, and in some cases the utility slightly deteriorates (note that the y-axis scale in Figure 7 left for counts goes up to 2.5 while that in Figure 6 goes up to 2). By contrast, the utility of FBS, especially on average salary values, shows notable improvement as ∆ α,(yn,Xn,wn) increases from 1.8 to 3.4. It is then a decision for the data disseminators to strike a balance of the utility-risk trade-off when determining the ideal combination of (m, yn ) given their dissemination goals and priority. that is shared amongst different y n samples taken from the population.

In summary, based on the SDR application results and the simulation results, our FBS and FBP dramatically outperform the Laplace Mechanism in our setting. Combined with the modeling flexibility FBS, we conclude that FBS is the preferred microdata synthesis approach.

In particular, FBS is very straightforward to implement and accommodates any desired model for the observed sample. Our experiments on m and ∆ α,(yn,Xn,wn) for tuning the utility-risk trade-off in our simulation prefers the (m = 3, = 10.8) combination. Data disseminators are encouraged to tune m and ∆ α,(yn,Xn,wn) and make decisions according to their dissemination goals and priority.

We address the issue of formal privacy for data collected from an informative sampling design.

There are three major challenges for inducing formal privacy protection into survey data:

(a) The correlation between weights and outcome variable(s) affects data utility since the distribution in the sample is different from the underlying population. It also makes privacy protection more complex because the sampling weights also need to be protected. (b) Under skewed and possibly unbounded outcomes (weights and salary) typically present for survey data, traditional additive noise mechanisms perform poorly and global privacy guarantees are typically not available due to unbounded variables and sampling weights that produce unbounded sensitivities. (c) Both standard errors and point estimates must be produced and released in a private manner. Standard error calculations from complex survey data are nonstandard.

We develop and apply two modeling approaches to perform partial data synthesis, both of which co-model the outcome variable(s) and the survey weights, jointly: FBS performs modeling under the distribution for the observed sample, while FBP directly incorporates correction for population bias of the outcome variable(s). FBS and FBP are specific implementations of our pseudo posterior mechanism that comes equipped with an asymptotic differential privacy guarantee.

Our SDR application with a nearly non-informative sampling design shows that FBS and Future extensions of this work will include multiple variables / responses (where we would extend FBS and FBP synthesizers to the modeling of a multivariate response together with the survey weight variable to privacy protect the joint distribution of these sensitive variables), twostage sample surveys, and alternative formulations of DP for Bayesian methods; for example, censoring / transforming the loglikelihood to ensure global DP for all sample sizes (Savitsky et al., 2022) .

As noted earlier, there are more advanced additive noise processes than that we used (For example Li et al., 2010 Li et al., , 2014 , which may lead to some efficiencies for the additive noise method, but they are complicated to implement and require specialized optimization software unavailable to the authors of this paper. 

In this section we describe some of the major classes of mechanisms for data release that can be shown to be differentially private (i.e. satisfy the property in Definition 1 in the main text). In particular, we describe additive noise, the Exponential Mechanism, and the Bayesian posterior mechanism and discuss connections between the three. Variations of these approaches will be compared in our analyses.

Perhaps the simplest way to protect a data release is to generate random noise and add it the original data value (record level or tabular). The key insight from Dwork et al. (2006) is that we can calibrate the amount of noise (scale or variability)

to meet specific target values of . For smaller , more noise (larger scale) is needed.

The "right" amount of noise to add is based on the sensitivity of the mechanism

Definition 2 Let D, D ∈ R k×n . Let q define a (non-differentially private) data release (e.g. mean), q() : R k×n → S. Then define the L1-sensitivity of q as

The sensitivity of the desired data release mechanism q() tells us how much noise to add to get a differentially private version of the mechanism M(). The most common example is the Laplace Mechanism, which adds noise from a Laplace distribution.

The density of a Laplace distribution is the following:

Given a deterministic data release q() : R k×n → S, define a Laplace Mechanism M L () = q() + LAP(0, ∆ q / ). Then M L is an -differentially private release mechanism (Dwork et al., 2006) . Dwork et al. (2006) also considers metrics beyond the absolute difference measure (L1) to measure sensitivity and how to create a mechanism that is differentially private. Subsequent works clarified this formulation and named it the Exponential Mechanism and showed how the Laplace Mechanism is a special case (McSherry and Talwar, 2007; Wasserman and Zhou, 2010) .

Switching notation, slightly, let θ be the desired output (previously s) which could be a synthetic value, a model parameter, or a tabular summary statistic. The Exponential Mechanism inputs a non-private mechanism for θ and generates θ in such a way that induces a DP guarantee on the overall mechanism.

Definition 3 The Exponential Mechanism M E uses a utility function u(D, θ) to generate values of θ from a distribution proportional to,

where, ∆ u = sup D,D ∈R k×n :δ(D,D )=1 sup θ∈Θ |u(D, θ) − u(D , θ)| is the sensitivity, defined globally over D ∈ R k×n , δ(D, D ) = #{i : Di = D i} which is the Hamming distance between D, D ∈ R k×n . ξ (θ | γ) is a proper probability measure.

Each single draw of θ from the Exponential Mechanism M E (θ|D) satisfies -DP.

See McSherry and Talwar (2007) or Wasserman and Zhou (2010) .

The main benefit of the Exponential Mechanism over (symmetric) additive noise is that the general utility function u(D, θ) can be constructed in such a way that restrictions to the range of output (for example enforcing positive counts) are possible and the input data can be less restrictive (for example variables without natural bounds, such as revenue). In other words, the perturbation is not symmetric by default and can be applied to more variable types besides categories and counts. A major challenge when using the Exponential Mechanism is actually sampling from the implied distribution in (12). For stability, a base (or prior) probability measure is often used to guarantee that implied distribution is a proper probability measure (ξ (θ | γ) integrates to 1 instead of ∞). Still, generating a θ using an arbitrary utility function u() is a significant challenge (Wasserman and Zhou, 2010; . Wang et al. (2015) show the connection between the Exponential Mechanism and sampling from the posterior distribution of a probability model by setting the utility u as the log-likelihood. The main benefit of using Bayesian methods is the extensive research into computational methods to generate samples, something that presents a significant challenge for the general Exponential Mechanism. Dimitrakakis et al.

(2017) provide alternative extensions and proofs for the differential privacy property of samples from the posterior distribution. They also assume the log-likelihood is bounded and suggest truncating the support for θ to achieve this. Savitsky et al.

(2022) extend these results to incorporate individual level adjustments, where the weights α i ∝ 1 ∆ i are related to record-specific sensitivity estimates∆ i .

C Prior specification of FBS in Section 2.1

We specify an independent and identically-distributed multivariate Gaussian prior for the coefficient locations, (β y ) and (β w ), in Equations (14) and (15):

where K is the dimension of the predictors x i , and Σ β is a K ×K correlation matrix.

We give Σ β a Cholesky factor of 6 and each component of σ βy and σ βw receives a student-t prior with 3 degrees of freedom and scale 10.

For covariance parameter Σ, we specify an LKJ prior as in Equation (16):

where d = 2 is the dimension of the response vector [y i , w i ] and Ω Σ is a d × d correlation matrix. We give Ω Σ a Cholesky factor of 6 and each component of σ Σ receives a student-t prior with 3 degrees of freedom and scale 10.

We refer interested readers to Stan Development Team (2016) for the usage and specification of LKJ priors.

We specify independent multivariate normal priors for β and (κ y , κ x ), in Equations (17) and (18):

where I is the identity matrix.

For σ y and σ x , we specify a half Cauchy prior, in Equation (19):

which is restricted to the positive real line.

E Additional results for weight smoothing effects in Section 4.2 For each of the r = 1, · · · , R sample, we apply the FBS and FBP approaches, and scale them through (c 1 , c 2 ) to have equivalent ( yn = 2∆ α,(yn,Xn,wn) × m), where we simulate m = 3 synthetic datasets from each approach. In fact, from now on we will use instead of yn to indicate that we target each yn to achieve the target global , which mimics real world application of our pseudo posterior synthesizers.

Through scaling and shifting, we are able to achieve the targeted overall Lipschitz bound ∆ α,(yn,Xn,wn) in every simulated database, and therefore the guarantee is global.

For utility evaluation, we calculate the effective coverage rate of the nominal 95% confidence interval of the population count and average salary by field and gender.

The closer the coverage rate to 0.95, the better the performance. In addition, we calculate the average ratio of the coefficients of variation (CV), which are the standard errors scaled by the corresponding estimates, and compare between the two approaches. The CV provides information about the relative efficiency, and the lower the CV, the better the performance. The detailed cell-level results of m = 3 are available in Tables 4 and 5 . Here, we use Figure 10 to display coverage of the 95% nominal interval on the x-axis as compared to CV on the y-axis for cell counts and average salary, respectively. Each point represents one of the 24 cells:

8 × 2 = 16 cells for a field and gender combination, and 8 cells for both genders for 8 fields. Table 4 for cell-by-cell comparison). These results indicate that FBP overcovers and produces overly long intervals, which is sometimes desirable by being conservative. By contrast, FBS only slightly undercovers, and achieves greater efficiency with smaller variances and shorter intervals.

For results of average salary values shown on the right of Figure 10 , FBS and FBP perform equally well in terms of coverage efficiency manifested by CV, and FBS performs slightly better in terms of coverage rates. It is worth noting that on average, the length of the confidence intervals of FBP is 0.95 times of that of FBS (see Table 5 for cell-by-cell comparison). The longer confidence intervals produced by FBS could partially explain its slightly better coverage performance. Nevertheless, as we have seen in the results of one replicate of simulated sample in Sections We also evaluate the effects of m, the number of simulated synthetic datasets, on the utility-risk trade-off under repeated sampling. Figure 11 shows the coverage vs CV results between FBS and FBP for (m = 1, = 3.6), and Figure 12 shows results for (m = 5, = 18). Simulating only m = 1 synthetic dataset creates more serious undercoverage issues for both models for counts and average salary values, suggesting too much a utility reduction compromise at a lower privacy budget.

When simulating m = 5 synthetic datasets at the price of a higher privacy budget, there is slight improvement of coverage performance of both models for average salary values, although FBP still has an overcoverage issue with most of the cells and the serious undercoverage issue for the two cells related to Field 7. These results suggest simulating m = 3 synthetic datasets achieves a good balance of utility-risk trade-off for both models, with reasonably high utility at a reasonable privacy budget of = 10.8. Further increasing m for (m = 10, = 36) produces similar utility results compared to that of (m = 5, = 18), as Figures 12 and 13 illustrate. 

comfortable with an increased privacy budget. Results of other m values for ∆ α

A new approach to weighting and inference in sample surveys

Linearization methods for single phase and two-phase samples: a cookbook approach

Philosophy of differential privacy

Some results on generalized difference estimation and generalized regression estimation for finite populations

Differential privacy for bayesian inference through posterior sampling

Synthetic Datasets for Statistical Disclosure Control

Calibrating noise to sensitivity in private data analysis

Parameters of superpopulation and survey population: Their relationships and estimation

Applied Survey Data Analysis

Fully Bayesian estimation under informative sampling

A data-and workload-aware algorithm for range queries under differential privacy

Optimizing linear counting queries under differential privacy

Statistical analysis of masked data

A review of automatic differentiation and its efficient implementation

Mechanism design via differential privacy

Smooth sensitivity and sampling in private data analysis

The use of sampling weights for survey data analysis

Parametric distributions of complex survey data under informative probability sampling

Chapter 39 inference under informative sampling

Rescaled bootstrap for stratified multistage sampling

Some recent work on resampling methods for complex surveys

The multiple adaptations of multiple imputation

Discussion statistical disclosure limitation

Re-weighting of vector-weighted mechanisms for utility maximization under differential privacy

Bayesian estimation under informative sampling

Bayesian pseudo posterior mechanism under asymptotic differential privacy

Confidentiality protection approaches for survey weighted frequency tables

General and specific utility measures for synthetic data

pMSE mechanism: Differentially private synthetic data with maximal distributional similarity

RStan: the R interface to Stan

sampling: Survey sampling

Privacy for free: Posterior sampling and stochastic gradient monte carlo

A statistical framework for differential privacy

Bayesian estimation under informative sampling with unattenuated dependence

Uncertainty estimation for pseudo-Bayesian inference under complex sampling

A Review of combining rules for partial synthesis For estimand Q, let q ( ) be the point estimator q of Q, and u ( ) be variance of q in thThe analyst can useq m = m =1 q ( ) /m to estimate Q, Inferences can be based on t distributions with degrees of freedom v = (m−1)(1+ū m /(b m /m)) 2 when the sample size is large.