key: cord-0550973-d0869wn3
authors: Bao, Le; Li, Changcheng; Li, Runze; Yang, Songshan
title: Causal Structural Learning on MPHIA Individual Dataset
date: 2022-02-14
journal: nan
DOI: nan
sha: 778114294ae1f45ced7e7787d9289639e789be32
doc_id: 550973
cord_uid: d0869wn3

The Population-based HIV Impact Assessment (PHIA) is an ongoing project that conducts nationally representative HIV-focused surveys for measuring national and regional progress toward UNAIDS' 90-90-90 targets, the primary strategy to end the HIV epidemic. We believe the PHIA survey offers a unique opportunity to better understand the key factors that drive the HIV epidemics in the most affected countries in sub-Saharan Africa. In this article, we propose a novel causal structural learning algorithm to discover important covariates and potential causal pathways for 90-90-90 targets. Existing constrained-based causal structural learning algorithms are quite aggressive in edge removal. The proposed algorithm preserves more information about important features and potential causal pathways. It is applied to the Malawi PHIA (MPHIA) data set and leads to interesting results. For example, it discovers age and condom usage to be important for female HIV awareness; the number of sexual partners to be important for male HIV awareness; and knowing the travel time to HIV care facilities leads to a higher chance of being treated for both females and males. We further compare and validate the proposed algorithm using BIC and using Monte Carlo simulations, and show that the proposed algorithm achieves improvement in true positive rates in important feature discovery over existing algorithms.

In 2014, the United Nations Joint Programme on HIV and AIDS (UNAIDS) set the 90-90-90 targets as the primary strategy to end the HIV/AIDS epidemic (Joint United Nations Programme on HIV/AIDS, 2014), which includes identifying 90% of people living with HIV through expanded testing, placing 90% of positively identified individuals on antiretroviral therapy, and ensuring that 90% of those on therapy can achieve undetectable viral loads by 2020. Considerable progress has been made towards Tri90 (Gaolathe et al., 2016; Gisslen et al., 2017 To end HIV/AIDS epidemic, we shall learn from the past efforts towards Tri90 and identify important features that could guide more targeted and effective health policies. Novel datasets and sophisticated modeling tools are needed to enhance our understanding of Tri90 achievements. The Population-based HIV Impact Assessment (PHIA) survey is a nationally representative HIV-focused survey that started data collection in 2015. It is designed to measure the reach and impact of HIV programs. We believe it offers a unique opportunity to better understand the key factors that drive the HIV epidemics in the most affected countries in sub-Saharan Africa. In this article, we propose a novel causal structural learning algorithm to discover important covariates and potential causal pathways for the Tri90. Causal structural learning aims to build a directed acyclic graph (DAG) that shows direct causal relations among variables of interest in a given domain. The resulting DAG helps us to understand the mechanism behind data.

Many classical structural learning algorithms are constrained-based, such as the PC algorithm (Spirtes and Glymour, 1991) , the PC-stable algorithm (Colombo and Maathuis, 2014) , and the MMPC algorithm (Tsamardinos et al., 2003b) . The constrained-based algorithms learn graphical structures by d-separation set searching. D-separation is an important graphical concept for causal structural learning , where the "d" stands for dependence or directed. Roughly speaking, in a directed acyclic graph (DAG), two vertices X and Y are d-separated by a set of vertices Z = {Z 1 , · · · , Z d } if any only if all the paths/information between X and Y are blocked by vertices in Z (see Section 2.1 for a rigorous definition of d-separation).

The d-separation relationship in a DAG can be related to the conditional independence through the Markov condition and the faithfulness assumption (Spirtes, 2010 ) (see definitions in Section 2.1). That is to say, vertices X and Y are d-separated by a vertex set Z = {Z 1 , · · · , Z d } is equivalent to the conditional independence of the corresponding variables X and Y given the corresponding set of variables Z = {Z 1 , · · · , Z d } under the Markov condition and the faithfulness assumption.

However, existing constrained-based structural learning algorithms can be quite aggressive in edge removal: if a type II error is made such that two connected covariates X and Y are thought to be conditionally independent given some Z, the edge between X and Y is removed mistakenly, and useful information about important features and possible causal pathways get lost during this edge removal process. Especially in the case of categorical variables and relatively small sample sizes, the conditional independence tests used by structural learning algorithms can have high type II error rates that lead to many false edge-removals and a severe information loss. As illustrated in the PHIA data analysis, very few features are connected by using the constrained-based structural learning algorithms. Results and discussions in detail can be found later in Section 3.

In literature dealing with the unreliable conditional independence tests, most literature focuses on their negative effects in the orientation procedure but much less on those in the skeleton learning procedure, with a few exceptions. Bromberg and Margaritis (2009) proposed a method to resolve the inconsistencies in the conditional independence tests in the skeleton learning procedure. Their idea is to deduce a "preference" score on the conditional independence test results from a certain set of axioms, such as Pearl's axioms (Dawid, 1979; Pearl, 1988) , which all true conditional independence relationships should follow. These axioms can be seen as integrity constraints that can avoid certain inconsistent test outcomes. However, the computational complexity of the algorithm proposed by Bromberg and Margaritis (2009) significantly increases as the number of vertices increases. Thus, the algorithm cannot be employed on the MPHIA dataset for which the graph is not sparse enough.

In this article, we propose a new causal structural learning algorithm that aims to preserve more information on important features and potential causal pathways. We apply the proposed algorithm on the Malawi PHIA (Ministry of Health, Malawi et al., 2017) data set, which we refer to as MPHIA, and obtain interesting findings related to Tri90 pathways. We further compare and validate the proposed algorithm with some classical structural learning algorithms using information criteria and simulations.

The remaining part of the paper is organized as follows. In Section 2.1, we provide definitions for important concepts in causal structural learning, such as d-separation, and also a basic summary for the graphical notations used in the paper. In Section 2.2, we use a simple example to illustrate the problem of aggressive edge removal of existing structural learning algorithms. In Sections 2.3 ∼ 2.6, we propose a new causal structural learning algorithm to overcome the aggressive edge removal issue. In Section 3, we apply the proposed algorithm on the MPHIA data set, compare the results obtained from the proposed algorithm with those of the existing algorithms in Section 3.1, and discuss the discovered Tri90 pathways in detail in Section 3.2. In Section 4, we compare the numerical performance of the proposed algorithm with classical structural learning algorithms in simulation studies. Section 5 provides a summary and discussion for the paper. To save space, technical details, additional numerical results, and the relevant codebooks are provided in the Supplement.

In this section, we propose our causal structural learning algorithm. We first provide definitions for important concepts in causal structural learning, such as d-separation, and also a basic summary for the graphical notations used in the paper in Section 2.1.

We then discuss the problem of aggressive edge removal of existing structural learning algorithms with a simplified example in Section 2.2. To deal with this problem, we propose our new algorithm and provide a high-level overview of the algorithm in Section 2.3.

The proposed algorithm consists of two main steps, the forward step and the maximization step, which are explained in detail in Sections 2.4 and 2.5, respectively. Section 2.6 provides the orientation procedure of our proposed algorithm and also summarizes the proposed method.

Suppose G is a directed acyclic graphical (DAG) model which represents the joint probability distribution over the vertex set V with directed edges and no directed loop. Each vertex in the graph represents a variable. We use variables X, Y , etc., to refer to the variables corresponding to the vertices X, Y , etc., and use the edge X → Y or Y ← X to refer to the directed edge from X to Y , where X is a parent vertex of Y , and Y is a child of X. X − Y denotes an undirected edge that could be either X → Y or Y → X.

We use a path to refer to an acyclic sequence of adjacent vertices and a causal path from X to Y to refer to a path that all arrows are pointing away from X and into Y . If there is a causal path from X to Y , we say that X is an ancestor of Y and that Y is a descendant of X.

Next, we provide the formal definition of d-separation . A collider is a vertex on a path with two incoming arrows. More specifically, a vertex Z is a collider (v-structure) on a path U if and only if the path U contains a subpath X → Z ← Y . For vertices X, Y and a vertex set Z which does not contain X and Y , X is d-connected to Y given Z if and only if there is an acyclic path U between X and Y such that every collider on U is either a member of Z or an ancestor of a member of Z, and no non-collider on U is in Z. X is d-separated from Y given Z if and only if X is not d-connected to Y given Z. As a simple illustration, suppose that Z is the only vertex in Z and U . X and Y are d-connected given Z for X → Z ← Y ; X and

Moreover, a set of variables V is causally sufficient if and only if no variable outside V is a direct cause of more than one variable in V. For a causally sufficient set of variables V with probability distribution P (V), the Markov condition assumes that the d-separation in the DAG G implies conditional independence in P (V), i.e. if X is dseparated from Y by Z in G, then X is independent of Y conditional on Z in P (V); and the faithfulness assumption assumes that every conditional independence relationship in P (V) is entailed by the d-separation relationship for the causal DAG G, i.e. if X is independent of Y conditional on Z, then X is d-separated from Y by Z. Therefore, the d-separation in the graph is equivalent to conditional independence in the distribution under the Markov condition and the faithfulness assumption. See Spirtes (2010) for a more detailed introduction and discussion. In this paper, we always assume the causal sufficiency, the Markov condition, and the faithfulness assumption.

In the MPHIA data set, classical constrained-based structural learning algorithms discover very few important features for Tri90, as shown later in Section 3. In this section, we explain the main cause of the problem and illustrate the motivation of our method with a simplified example from the MPHIA data. A complete analysis of the MPHIA data can be seen in Section 3.

Example 1. Suppose our graph contains only three vertices X, Y and Z, and X is Tri90Aware, the indicator variable for awareness of the HIV positive status, which is one of the variables in which we are mainly interested in our Tri90 goal study. We want to check whether Y (AlcoholFrequency, an ordinal variable for alcohol drinking frequency) and Z (WealthQuintile, an ordinal variable for the wealthiness) are neighbors of X (Tri90Aware). That is to say, we are interested in whether the wealthiness and alcohol drinking frequency have direct causal relationships with the awareness of the HIV positive status. We have the following four (conditional) independence test results among X, Y , and Z for a confidence level α = 0.05 in the MPHIA dataset of males who 6 are included in the Tri90 study.

X ⊥ Y is rejected, and X ⊥ Z is rejected, X ⊥ Y |Z is not rejected, and X ⊥ Z|Y is not rejected.

(1)

Since both Y (AlcoholFrequency) and Z (WealthQuintile) are not independent with X (Tri90Aware) marginally from the testing results, they should be connected to X (Tri90Aware) either directly or indirectly, according to the Markov condition. However, based on the faithfulness assumption, neither Y (AlcoholFrequency) nor Z (WealthQuintile) should be a neighbor of X (Tri90Aware) since X (Tri90Aware) is conditionally independent with Y (AlcoholFrequency) given Z (WealthQuintile) and X (Tri90Aware) is conditionally independent with Z (WealthQuintile) given Y (AlcoholFrequency). Those four testing results lead to contradicted conclusions under the Markov condition and the faithfulness assumption. Existing structural learning algorithms, such as the PC-stable algorithm, remove both the edge between Tri90Aware and AlcoholFrequency (X − Y ) and the edge between Tri90Aware and WealthQuintile (X − Z) from the conditional independence testing results and conclude that there are no neighbors of X (Tri90Aware), which might be too strict in detecting edges.

Many covariates in the MPHIA data are categorical. When used as the conditional set, those categorical covariates lead to relatively high type II error rates for the conditional independence tests. In addition, triples (X, Y, Z) with contradictory/inconsistent testing results such as Equation (1) are quite common in the MPHIA data. Such contradiction/inconsistency also leads to aggressive edge-removal for the constrained-based causal structural learning algorithms. To solve the false edge-removal issue, we propose a new graphical structural learning algorithm, and we illustrate how the new algorithm successfully finds more edges later in Example 2.

In a directed acyclic graph (DAG) G with vertex set V and X ∈ V. Suppose N X = (N 1 , N 2 , · · · , N q ) is the parents and children set of X. Under the faithfulness assumption and the Markov condition, respectively, we have

where M = {M 1 , M 2 , · · · , M r } = V\(N X ∪ {X}) is the set of vertices not connected to X. Furthermore, under the Markov condition and the faithfulness assumption, suppose the variable set, N satisfies the conditions (2) and (3), then it is easy to verify that N is the set of parents and children of X and thus N = N X .

For each vertex X, we want to get the best subset of V that fulfills the conditions (2) and (3). Our procedure consists of two steps: a forward step and a maximization step.

The forward step finds all sets N that satisfy (2) for vertex X and will be illustrated in Algorithm 1 in Section 2.4. The maximization step picks the best set that fulfills (3) among those sets found by Algorithm 1 and will be provided in Section 2.5.

Before deriving the details of the algorithms, let us come back to Example 1 and see how algorithms based on (2) and (3) can solve the problem of aggressive edge removal that presents in the existing classical structural learning algorithms. 

and

Note that the neighborhood of X (Tri90Aware) satisfies (2) and (3) for X. In the situations of (4), the neighbor of X (Tri90Aware) is Z (WealthQuintile); and in the situations of (5), the neighbor of X (Tri90Aware) is Y (AlcoholFrequency or WealthQuintile (Z) is the neighbor of Tri90Aware (X) but not both of them. So, the proposed algorithm is less aggressive in edge removal than the existing classical structural learning algorithms such as the PC-stable algorithm as discussed in Example 1.

Before stating Algorithm 1, we first provide some useful definitions and their properties. Let p be the total number of vertices. For each vertex X, let T = V\{X} = {T i , i = 1, 2, · · · , p − 1} and vertices are considered/added sequentially in the order of T 1 , T 2 , · · · , T p−1 to form a candidate neighborhood of X. That is to say, for any already formed non-empty candidate neighborhood S = {T s 1 , · · · , T sq } ⊆ T, where 1 ≤ s 1 < s 2 < · · · < s q ≤ p − 1, we consider whether an additional vertex from L X (S) := {T sq+1 , T sq+2 , · · · , T p−1 } can be added into the variable set, S. We also define L X (∅) = T, which means that we need to consider all the vertices in T when S starts from an empty set. Furthermore, let C X (S) be the vertices in L X (S) that can be added into S while still satisfying equation (2) . That is to say, 

Proposition 1 establishes some useful properties of C X (S), which are used in Algorithm 1 to facilitate the calculation of C X (S).

Proposition 1. C X (S) has the following properties.

1. Let ∅ denote the empty set. If S = ∅, then

2. If S does not satisfy (2), then

3. If S 1 ⊆ S, then

5. Let S = {S 1 , S 2 , · · · , S n }, n ≥ 1. If S satisfies (2), then

The properties of C X (S) in Proposition 1 can be shown by using its definition, and the proof can be found in Supplement S.1. Proposition 1 establishes a recursive structure for C X (S). (7) shows that C X (S) consists of all vertices that are marginally correlated with X when S is an empty set. (10) states that C * X (S) is an upper-bound (concerning the partial order of inclusion) for C X (S) when S is non-empty. Furthermore, (11) tells us that we only need to examine the conditional independence relationship with conditional set of size |S| to get C X (S) from C * X (S) ∩ L X (S) for non-empty S. (11) is used in Algorithm 1 to reduce the number of conditional independence tests and speed up the finding process. We can show that using (11) , no conditional independence test is repeated in Algorithm 1. Indeed, let s max = max{s 0 , s 1 , · · · , s q }, S max = T smax , s − max = {s 0 , s 1 , · · · , s q }\{s max }, and S − max = {T s |s ∈ s − max }, then the conditional independence test of X and T s 0 given S = {T s 1 , T s 2 · · · , T sq } can only happen when we check whether S max can be added into candidate set S − max .

Step: Find all potential parent and children sets of X that satisfy (2) . The algorithm sequentially adds the candidate set satisfying (2) into the preliminary result set, R.

Input: (1) a vertex set V, (2) a vertex X in V, and (3) size α for conditional independence tests. Output: The set N X including all possible N that (approximately) satisfies equation (2) .

Calculate C X (S) for each S ∈ S of cardinal from low to high as follows:

Calculate C X (S) from equation (11) . end if end if Delete S from S. if C X (S) = ∅ then Add S into the preliminary result set, R. else Add every S ∪ {C} for C in C X (S) into S. end if end while return N X = {R|R ∈ R, and R is not a proper subset of any other sets in R}.

Remark 1: Note that the final output of Algorithm 1, N X , only contains the candidate sets which are not proper subsets of other candidate sets for the following reasons. If N X is a candidate set that satisfies (2) for X, then all proper subsets of N X satisfy (2) for X, but no proper subset of N X will satisfy (3). This is because for any N 1 N X , take M ∈ N X \N 1 , and we have X ⊥ M |S for any S ⊆ N 1 N X since N X satisfies (2) .

Hence N 1 does not satisfy (3).

In the calculation of C X (S) and C * X (S), the size of conditional sets in the conditional independence test is not restricted. However, conditional independence tests with large conditional sets are not accurate, and the number of all possible conditional independence tests also grows exponentially with the size of conditional sets. Hence we set the upper-bound of sizes of conditional sets in Algorithm 1, which is also discussed by Tsamardinos et al. (2006) and is commonly implemented in causal structural learning software such as Scutari (2010) . In addition, as shown by (11) , if we set the upper-bound of sizes of conditional sets to be M CI , then we can approximate C X (S) by

Remark 3: M CI is a tuning parameter in Algorithm 1. Theoretically, one should choose M CI that is no less than the largest degree of vertices in the graph. However, there are concerns about using a large M CI , and the true degrees are also unknown in practice. First, a large M CI corresponds to big conditional sets, which will increase the computing cost. Second, when the sample size is not large enough, a large M CI leads to less reliable results due to the increased type II errors. Generally, it is recommended to set M CI = 3 when the true graph is expected to be sparse or moderately sparse (Yan and Zhou, 2020) . We also carry on a simulation study on different choices of M CI , and more details and discussions can be seen in Supplement S.5.1.

To choose the best neighbor set from the set of candidate sets N X , we check how well each candidate N ∈ N X satisfies equation (3). Define

where CI(X, Y |S) is the p-value of some chosen conditional independence test for X and Y given S. To see why we use max in (13) , note that conditional independence relationship implies big p-values, and S N (X, Y ) measures whether any subset of N makes X and Y conditional independent. The maximum in (13) is from the fact that (3) only requires one S that makes X and Y conditionally independent given S as suggested by the large p-value. The idea of using conditional independence set of the largest p-value in the causal structural learning algorithm can also be found in Ramsey (2016) .

which measures how well it is for subsets of N to "separate" X from any vertices not in

Here we are concerned about whether there is any M i ∈ V\(N ∪{X}) violates the conditional independence between M i and X given N, and hence the minimum of p-values across M i 's is used. Equations (13) and (14) together can be seen as a minimax procedure. If N satisfies Equation (3), then for any M i not in the neighborhood of X,

is likely to be the true neighbor set of X. Furthermore, we have the following useful properties of S N (X, Y ) in Proposition 2, which are used in Algorithm 2 to facilitate the calculation of S N (X, Y ).

has the following properties.

1. Let ∅ be the empty set. Then

The properties of S N (X, Y ) in Proposition 2 can be shown by using its definition, and the proof can be found in Supplement S.2. Similar to Proposition 1, Proposition 2 13 shows a recursive structure in S N (X, Y ). (15) shows how to calculate S N (X, Y ) for an empty set. (17) states that S * N (X, Y ) is a lower-bound for S N (X, Y ) for a non-empty set N. Furthermore, (18) tells us that we only need to calculate the p-value of conditional

is used in Algorithm 2 to reduce the number of conditional independence tests and speed up the searching process. It is easy to prove that using (18), no conditional independence test is repeated in Algorithm 2 because the conditional independence test of X and Y given N only happens when we calculate S N (X, Y ).

Remark: Similar to Algorithm 1, we propose to set upper-bound for sizes of conditional sets to be M CI , then S N (X, Y ) can be approximated by

Furthermore, from (18), we can approximate S N (X, Y ) by

Based on (18) and (20), we have Algorithm 2 which selects the N with the largest Q X (N) to be the neighbor set of X.

With Algorithms 1 and 2, we can learn the neighbor set of every vertex X in the DAG G and hence the skeleton of the DAG. We can further orient edges according to the conditional independence relationship. The overall structural learning algorithm is summarized in Algorithm 3. The orientation procedure is quite similar to those of classic constraint-based causal structural learning algorithms, such as the PC algorithm (Spirtes and Glymour, 1991) . We also incorporate some prior knowledge into the orientations.

For example, we know covariates such as AgeGroup cannot be affected by other covariates such as Education, so if there is an edge between AgeGroup and Education, then we orient the edge as AgeGroup to Education. If there are multiple d-separation sets S(X, Y ) for a non-adjacent pair (X, Y ), we shall use the d-separation set S(X, Y ) with

Step: Find N ∈ N X with the largest Q X (N).

Input: (1) a vertex set V, (2) a vertex X in V, and (3) candidate neighbor set N X of X. Output: N ∈ N X with the largest Q X (N).

Let r = 0 and N X = ∅.

the largest p-value to make the orientation results stable. For Z adjacent to both X and Y , we call X − Z − Y a v-structure and make the orientation

In Algorithm 3, we resolve the conflict between v-structures by comparing the p-value of the v-structure. Going back to the

Remark 1: Note that multiple orientations may satisfy the inferred d-separation (conditional independence) structure. Hence, in Algorithm 3, we first use prior knowledge in

Input: (1) A vertex set V, and (2) size α for conditional independence tests. Output: CPDAG G.

for every vertex X ∈ V do Calculate the neighbor set N X using Algorithms 1 and 2. end for Start from a complete undirected graph G with the vertex set V. for every pair of vertices (X,

Form G by recursive orientation according to the following two rules: 1. If X − Y and there is a directed path from X to Y , then orient X − Y as X → Y ; 2. If X and Y are not adjacent and there is a Z such that X → Z and Z − Y , then orient Z − Y as Z → Y .

the orientation process to establish orientations and to enhance the interpretability of the orientation result. For instance, we presume that age may lead to education status but not the other way around. The prior knowledge is provided in the sample code of the supplemental material.

Remark 2: Note that in the d-separation set searching procedure in Algorithm 3, we use the d-separation set S(X, Y ) with the largest p-value to stabilize the orientation results. It agrees with equation (13) used in the maximization step for skeleton learning.

The approach of using the conditional independence set with the largest p-value in the orientation process has also been used in Ramsey (2016) . Sections 2.5 and 2.6 both use the largest p-value among the conditional independence tests but for different purposes. Section 2.5 is about skeleton learning, and Section 2.6 is about orientation. Ramsey (2016) only concerned the orientation phase but not the skeleton learning.

To make the graph learned by the proposed algorithm more interpretable, we calculate p-values for the significance of the (undirectional) connections for all edges in the graph learned by Algorithm 3. More specifically, for edge X − Y between the covariates X and Y , define P (X, Y ) to measure the significance of X − Y as follows

where N X and N Y are the neighbors of X and Y , respectively. Note that the measure P (X, Y ) is undirectional, i.e., the orientation of the edge X −Y has no effect on P (X, Y ) from its definition, and P (X, Y ) also has no information on the edge orientation. In the next section, we compare the results obtained from the proposed algorithm with those of the existing algorithms in Section 3.1. We further discuss the potential Tri90 pathways discovered by the proposed algorithm in detail in Section 3.2.

We also apply the following existing structural learning algorithms to each dataset: PCstable algorithm proposed by Colombo and Maathuis (2014) , which is a stable/orderindependent variant of the original PC (initials of the first names) algorithm proposed by Spirtes and Glymour (1991) ; MMPC (Max Min Parents and Children) algorithm proposed by Tsamardinos et al. (2003b) ; IAMB (Incremental Association Markov Blanket) algorithm proposed by Tsamardinos et al. (2003a) ; GS (Grow-Shrink) algorithm proposed by Margaritis (2003) . Notice that we use the PC-stable algorithm instead of the PC algorithm since the PC-stable algorithm is order-independent. Order independence means that a random reordering of the variables does not affect the graphical learning result, which enhances the stability of the algorithm and also makes the results more interpretable.

We use the existing graphical learning algorithms implemented in the R package bnlearn (Scutari, 2010) We first summarize the structural learning results by different graphical learning methods. The results in Table 2 shows that the proposed algorithm makes new discoveries regarding Tri90 goals. More specifically, Table 2 show that the numbers of edges (NE) and the number of directed edges (NDE) are both much smaller than the number of vertices (NV) in the graphs learned by the existing algorithms including the PC-stable, the MMPC, the IAMB, and the GS algorithms. These existing algorithms lean toward fractured graphs and do not have much conditional independence information for the orientation. On the contrary, the proposed algorithm produces larger numbers of edges and well-connected graphs and has more potential to infer the directions of edges.

Remark: It is important to note that the goal of the causal graphical algorithms is not to produce as many edges and directed edges as possible. Later in this section, we will use the Bayesian Information Criterion (BIC) to show that these discoveries made by the proposed algorithm provide useful information about the MPHIA data.

Furthermore, in Section 3.2, we discuss the discovered pathways in detail, which are reasonable, also confirmed in other HIV Tri90 literature, and can provide useful insight for the three Tri90 goals.

Let d be the distance (defined as the length of shortest path regardless of direction) from a particular 90-90-90 goal (awareness of HIV, ART, or VLS) to a covariate and N(d ≤ k) be the number of covariates whose distances to a 90-90-90 goal are smaller than or equal to k. Very few covariates are close to 90-90-90 goals in the graphs learned by the existing algorithms with only two covariates whose distance to the 90-90-90 goals are smaller or equal to three. On the contrary, the proposed algorithm discovers many covariates that are of a small distance to the 90-90-90 goals including several direct The Bayesian information criterion (BIC, (Schwarz, 1978) ) is a classical statistical tool for model selection. We compare our proposed graphical learning algorithm with the aforementioned classical PC-stable, MMPC, IAMB, and GS algorithms using BIC criterion and summarize the results in Table 3 . Let D k , k = 1, 2, · · · , 6, be the data sets corresponding to the three 90-90-90 goals of each gender, respectively. For k = 1, 2, · · · , 6, we use each of the graphical learning algorithm A i , for i = 1, 2, · · · , 5, to learn a DAG G i,k on D k , where A i , i = 1, 2, · · · , 5, stand for the PC-stable, MMPC, IAMB, GS, and the proposed algorithm, respectively.

The log-likelihood of a DAG G can be decomposed as follows:

where θ j 's are parameters of the model, n is the sample size, p is the number of covariates, Note that all the Log-likelihood* in Table 3 are positive since they are the differences between the log-likelihood and the log-likelihood of the null model (model with intercept only). Furthermore, in Table 3 , BIC score is calculated by −2Log-likelihood*+DF log(n),

where n is the sample size. Hence lower BIC scores correspond to better models.

Our proposed algorithm learns a much larger number of edges in all the six graphs compared with the existing algorithms in Table 2 , and thus it has much larger degrees of freedom for the log-likelihood defined in Equation (22). Table 3 shows that the proposed algorithm has the largest degree of freedom and the largest log-likelihood. The much larger log-likelihoods imply that the proposed algorithm discovers a lot more useful information in MPHIA, which is further confirmed by the best (smallest) BIC scores in Table 3 . 

We find males whose partners are older are more likely to be aware of their HIV status, both marginally, or conditioning on their age groups. Comparing the male awareness pathway and the female awareness pathway, we suspect that older females contribute Figure 1 : 90-90-90 Awareness graph in female. Vertices representing the Tri90 goals are biggest and marked by orange; vertices closer to goals have bigger sizes and darker colors than those farther away from goals. Widths of edges reflect the significance of the nondirectional connection (conditional dependence) between vertices. Red and blue edges represent positive and negative relationships with Tri90 goals, respectively. Codebook can be found in Supplement S.6.2. more to HIV awareness because someone's age group is associated with his/her partner's age.

In Figure S .1, the proposed algorithm finds that PartnerNumber12Mo (sexual partners in the last 12 months) and WifeNumLiveElsewhere (number of wive/partners that live elsewhere) to be important covariates for male HIV awareness. These discoveries are supported by Peltzer et al. (2009) We also find that ViolenceOK and AlcoholFrequency may be possible "reasons" of HIV awareness in males, where ViolenceOK is a score to measure whether the individual believes it is right for a man to beat his wife/partner under various scenarios. Further investigation shows that among males who are HIV positive: (1) those with a higher violence score are more likely to be unaware of their HIV status; (2) those who never drink are most likely to be aware of their HIV status while those with a high frequency of alcohol use (more than four drinks a week) are least likely to be aware of their HIV positive status, and such an effect is more significant in males who have not worked in the last 12 months. Alcohol usage is an important factor associated with sexual risk behavior (Kalichman et al., 2007; Hahn et al., 2011) , and violence score can also be an important factor for sexual behavior. These factors deserve more attention when advocating HIV awareness.

The proposed causal structural learning algorithm also finds that WorkLast12Mo has a small distance to HIV awareness which is supported by Peltzer et al. (2009) .

PLWHSupportGroup is a result of HIV awareness because males unaware of their status never answer the PLWHSupportGroup question.

Next, we discuss the ART pathways for females and males in one subsection because they are relatively simpler compared with the HIV awareness pathways. For the same reason, we discuss the VLS pathways for females and males in the same subsection. 2015); Sullivan et al. (2015) showed that violence and other kinds of abuse are significantly associated with ART initiation. We find females with smaller violence scores are more likely to be on ART treatment.

We find that wealthier males are more likely to initiate the ART treatment, which is in agreement with Gebru et al. (2018) . On the other hand, the males with "poor" wealth are likely to have no partners in the last 12 months, and more likely to have wives or partners live elsewhere. The proposed algorithm also discovers that females with known status of SyphilisTestInPreg are more likely to be on ART treatment. The learned female ART pathway reveals that those received ART treatments had easy access to the HIV care (short travel time) and easy access to antenatal care where they could be offered the syphilis test during pregnancy. They were both indicators of receiving good health care services which increased the chance of receiving ART treatment.

Finally, we find AbnormPenisDischarge to be a potential "reason" for male ART initiation. Further investigation shows that males with the missing response on abnormal penis discharge problems are more likely to be on ART treatment among the HIVpositive males who are aware of their HIV status. Unfortunately, it was unclear why people did not respond AbnormPenisDischarge question.

We find females who attend the support groups more frequently are more likely to have their viral load suppressed. It is supported by Roberts (2000) Tomori et al. (2014), and Rangarajan et al. (2016) . Friedman et al. (2015) ; Hatcher et al. (2015) ; Sullivan et al. (2015) showed that violence and other kinds of abuse are significantly associated with viral load suppression (VLS), especially among females. It supports the connection between VLS and ForceSexTimes found in the female VLS pathway.

We also find SeekMedicalHelp and WifeNum to be potential "reasons" of VLS among males who are on ART. Males who seek help from doctors or nurses because of health issues such as abnormal penis discharge and painful urination are more likely to have viral load unsuppressed; and males with more wives are more likely to have viral load suppressed.

In Section 3, we see that the proposed causal structural learning algorithm discovers many more new edges than the existing ones, and we validate the results of the proposed algorithm through the Bayesian information criterion (BIC). However, since the true causal graph for MPHIA data is unknown, we cannot validate the edge discoveries directly. To get more insights on the Type I and Type II error rates and their trade-off for the proposed algorithm against existing ones, we carry out simulation studies with settings mimicking the MPHIA data in this section.

We check the true positive and negative rates of the proposed algorithm against the existing algorithms on synthetic data sets that mimic the MPHIA data. More specifically, we use the DAG learned from MPHIA data as the truth to generate the simulation data. Here we choose the graphs learned by the proposed algorithm as the truth for simulation purposes since Table 3 shows that the graphs learned by the proposed algorithm are better fits for the MPHIA data in the BIC criterion than those learned by the other algorithms. That is to say, let G k be the DAG learned by the proposed algorithm on the 90-90-90 MPHIA data set D k for k = 1, 2, · · · , 6. Then we fit the data distribution P k based on G k on the data D k . We further randomly generate simulated data sets based on the distribution P k with sample size n. The distribution estimation and the random data sets generation utilize the bn.fit (Bayesian network fitting) and rbn (random Bayesian network) functions in the R package bnlearn (Scutari, 2010) . We then carry out the proposed algorithm together with the aforementioned PC-stable, GS, MMPC, and IAMB algorithms on the generated data sets. Finally, we calculate true positive rates and true negative rates of edges disregarding the orientation for each algorithm.

Here we set the sample size n = 250, 500, 1000 for a sample size similar with our real data and to check the performance of the proposed algorithm with different sample sizes. We repeat the Monte Carlo simulation 500 times for each setting and summarize the results in Table 5 . The left and right panels of Table 5 summarize the empirical true positive and negative rates of the proposed algorithm as well as those of existing algorithms, respectively. From the right panel of Table 5 , we can see that the proposed algorithm has similar true negative rates with existing algorithms. Furthermore, from the left panel of Table 5 , we can see that the proposed algorithm has better true positive rates than existing algorithms.

We also present additional simulation studies in Supplement S.5 to save space. More specifically, the proposed algorithm improves true positive rates over the existing classical algorithms while having a comparable true negative rate. It shows that our proposed algorithm has a great potential to discover important features and potential causal pathways, especially in a domain with many categorical variables.

Carrying out the causal graphical analysis on the MPHIA data set, we obtain interesting results on important covariates and possible causal pathways related to the UNAIDS 90-90-90 goals. For example, the proposed algorithm discovers age and condom usage to be important covariates for female HIV awareness and number of sexual partners to be important for male HIV awareness, which agrees with literature, such as Peltzer et al. (2009) . The proposed algorithm also discovers similarities as well as differences between female and male pathways. For example, travel time is discovered to be an important covariate for both female and male ART. However, there are also different important covariates for female and male ART, such as attitude towards violence for female ART and partner numbers for male ART.

It is also important to pay attention to the assumptions of the proposed causal structural learning algorithm when we interpret the results. One important assumption behind the proposed algorithm is causal sufficiency which is critical for our algorithm and many other constrained-based causal structural learning algorithms. Although the MPHIA survey provides many related covariates, the causal sufficiency assumption may still not hold perfectly.

In the paper, we stratify the MPHIA dataset by sex and learn the graphical models for each sex and each Tri90 goal. It is also possible to learn the DAG with further stratified data by other covariates such as age, but the sample size would be too small for each stratum to make reliable inferences.

The discoveries on causality are important extensions for existing literature where only correlation instead of causation is established. More studies can be carried out to further validate the potential causal discoveries, and other statistical inference tools such as mediation analysis can be applied to further the understanding of the causal relationship. The discoveries on causality can help develop better HIV response strategies and related policies.

We first present proofs of Propositions 1 and 2, and then present some additional details for the six Tri90 datasets, also some additional numerical results of Section 3 and parts of MPHIA Codebook.

It is easy to get (7), (8) , and (9) directly from the definition of C X (S). Furthermore, notice that (10) can be easily derived from (9) . Hence the only thing that still needs to be proved is (11) .

Suppose S = {S 1 , · · · , S n }, n ≥ 1, and S satisfies (2). For any C ∈ C X (S), we have C ∈ C * X (S) ∩ L X (S), and we also have C ⊥ X|S, and S i ⊥ X|(S −i ∪ {C}), i = 1, · · · , n, by definition of C X (S). Hence we have C

Furthermore, we want to prove that C X (S) = {C ∈ C * X (S) ∩ L X (S), C ⊥ X|S, S i ⊥ X|(S −i ∪ {C}), i = 1, · · · , n} by contradiction. If C X (S) {C ∈ C * X (S) ∩ L X (S), C ⊥ X|S, S i ⊥ X|(S −i ∪ {C}), i = 1, · · · , n}, then there exists C 0 ∈ C * X (S) ∩ L X (S) such that C 0 ⊥ X|S, S i ⊥ X|(S −i ∪ {C 0 }), i = 1, · · · , n, and C 0 / ∈ C X (S).

From the definition of C X (S) and C 0 ∈ L X (S), the only way for C 0 not to be in C X (S)

is for {C 0 }∪S to violate (2) . So there exists N 0 ∈ ({C 0 }∪S) and

that from the construction of N 0 and S 0 , we know ({N 0 } ∪ S 0 ) does not satisfy (2) . Furthermore, any set with ({N 0 } ∪ S 0 ) as a subset does not satisfy (2) . So

{C 0 }∪S −i 0 or S does not satisfy (2) , which is in contradiction with C 0 ∈ C X (S −i 0 ) and S satisfies (2).

S.1

(a) If N 0 = C 0 , then S 0 = S and C 0 ⊥ X|S, which is in contradiction with

In sum, we finish the proof of (11) and Proposition 1.

It is easy to get (15) and (16) directly from the definition of S N (X, Y ). Furthermore, notice that (17) can be easily derived from (16). Hence the only thing that still needs to be proved is (18).

In sum, we have

Hence we finish the proof of Proposition 2.

S.2 Tar S.3   Table S .1: Number of important covariates for 90-90-90 goals discovered by different graphical learning methods. Aware, ART, and VLS stand for the three 90-90-90 targets of HIV awareness, ART treatment, and viral load suppression respectively. d is the distance from a particular 90-90-90 goal (awareness of HIV, ART, or VLS) to a covariate, and N(d ≤ k), k = 1, 2, 3, are the number of covariates whose distances to a 90-90-90 goal are smaller than or equal to k.

Female In this simulation study, we use a simulation setting similar to Section 4 to check the performance of the proposed algorithm with different values of M CI . More specifically, we use the DAGs learned by the proposed algorithm as the truth to generate the simulation data. That is to say, let G k be the DAG learned by the proposed algorithm on the 90-90-90 MPHIA data set D k for k = 1, 2, · · · , 6. Then we fit the data distribution P k based on G k on the data D k . We further randomly generate M simulated data sets D k = (D k,1 , D k,2 , · · · , D k,M ) based on the distribution P k with the sample size n.

Here we set n = 500 for a sample size similar to the MPHIA datasets. And we further apply the proposed algorithm with M CI = 2, 3, 4, 5, ∞ on the generated dataset. The whole simulation is repeated 500 times, and we summarize the empirical averages of true positive rates and true negative rates of edges disregarding the orientation in Table S. 2. The left and right panels of Table S .2 summarize the empirical true positive and negative rates of the proposed algorithm with different M CI , respectively. From Table S.2, we can see that there is no significant difference among the true positive rates and true negative rates for the proposed algorithm with M CI = 2, 3, 4, 5, ∞. It shows that the proposed algorithm is quite robust to the choice of M CI . As discussed by other causal structural learning literature such as Yan and Zhou (2020) , we recommend to use M CI = 3 for sparse or moderate sparse graphs. Notice that the choice of value of M CI depends on the sample size, the types of covariates, the property of the true graph (the degree of the graph), etc as discussed in the remark for Algorithm 1, it is also possible to try different values of M CI and to cross-validation methods, BIC scores, or simulation studies to have a more sophisticated chosen of M CI .

S.10

In this simulation study, we use a simulation setting similar to Section 4. We use the receiver operating characteristic (ROC) curve and the area under the (ROC) curve (AUC) to have a closer look at the performance of the different structural learning algorithms. More specifically, we use the DAGs learned by the proposed algorithm as the truth to generate the simulation data since the BIC criterion shows that the graphs learned by the proposed algorithm are better fits for the MPHIA data than those learned by the other algorithms. That is to say, let G k be the DAG learned by the proposed algorithm on the 90-90-90 MPHIA data set D k for k = 1, 2, · · · , 6. Then we fit the data distribution P k based on G k on the data D k . We further randomly generate M simulated data sets D k = (D k,1 , D k,2 , · · · , D k,M ) based on the distribution P k with the same sample size as the original data set D k . Applying graphical learning algorithm A i , for i = 1, · · · , 5, on the simulated data sets D k , we have M DAGs (G k,i,1 , G k,i,2 , · · · , G k,i,M ). We set M = 500, so the whole simulation is repeated 500 times.

Here we use the receiver operating characteristic (ROC) curve and the area under the (ROC) curve (AUC) to measure the edge discovery in 90-90-90 graphs to have a better understanding of the Type I and Type II error and their trade-off for different structural learning algorithms. To get the ROC curve, from the M DAGs learned in the Monte Carlo simulations (G k,i,1 , G k,i,2 , · · · , G k,i,M ), we first calculate an average graph G k,i .Ḡ k,i is an undirected graph with weighted edges where the weight of an edge X − Y is the empirical frequency of the existence of edge X − Y in (G k,i,1 , G k,i,2 , · · · , G k,i,M ) disregarding the direction of edges. HenceḠ k,i reflects the "confidence" in edges for algorithm A i . Then for each cut-off value λ, we can get an undirected graph G k,i,λ by keeping all the edges inḠ k,i with weights greater than or equal to λ, calculate the true positive rate (TPR) and false positive rate (FPR) of edges, and obtain the AUC score and the ROC curve. The AUC score and the ROC curve for edge discovery in 90-90-90 graphs of different structural learning algorithms calculated from 500 Monte Carlo simulations are summarized in Table S . 3 and Figures S.6, S.7, respectively. In Table S .3, we can see that the proposed algorithm achieves better (larger) AUC compared to other structural learning algorithms across all the three 90-90-90 goals and both genders. To understand why the proposed method achieves a better AUC, let us look at Figures S.6 and S.7. In Figure S .6, we can see that when the false positive rate (FPR) is extremely small, the proposed algorithm has a similar true positive rate (TPR) as the other algorithms; while with larger FPR, the proposed algorithm has better TPR than the other algorithms. We can see the details for the ROCs with small FPR more clearly in Figure S .7 and find that the proposed algorithm achieves a better or comparable TPR with FPR ≥ 0.01 and a much better TPR with FPR ≥ 0.02 across all 90-90-90 goals and genders. The similar TPR of all algorithms for extremely small FPR illustrates that all algorithms have similar performance in the discovery of the most important relationship from the data. Furthermore, the better TPR of the proposed algorithm for larger FPR shows that while the existing structural learning algorithms S.11 cannot discover weaker signals beyond a cut-point; the proposed algorithm has a better ability in picking up relatively weak signals, which is the reason for the better AUC of the proposed algorithm in Table S 

In this simulation study, we check the empirical performance of the proposed algorithm on synthetic data sets with continuous variables and different levels of "sparsity" and signal strengths of edges. Let K be the number of vertices and ρ ∈ (0, 1) be the parameter that controls the level of "sparsity" of edges, we generate the simulation data randomly using the following procedure:

1. We first generate a DAG G * . Generate K(K − 1)/2 random variables E * i,j i.i.d from Bernoulli(ρ), 1 ≤ i < j ≤ K. For vertices i and j, the edge i → j exists in G * if and only if E * i,j = 1. Further generate K(K − 1)/2 random variables S * i,j i.i.d from Normal(0, 1), 1 ≤ i < j ≤ K.

We then generate a dataset D * of size n according to DAG G * . For j = 1, · · · , K, generate x * j recursively from the following linear regression models:

where θ controls the strengths of signals and * j i.i.d. follows the standard normal distribution, j = 1, · · · , K. And we repeat this step n times to generate an n by K data set D * .

After generation of the dataset D * , we permute the order of variables and use the permutation to obtain an n by K data set D and the corresponding DAG G. We then carry out the proposed algorithm together with the aforementioned PC-stable, GS, MMPC, and IAMB algorithms on the n by K data set D. Furthermore, we calculate true positive rates and true negative rates of edges disregarding the orientation for each algorithm.

Here we set K = 100 and n = 500 for a similar number of covariates and sample size with our real data. We set ρ = (0.01, 0.02, 0.04) for different levels of "sparsity" of the S.14 Table S .4 summarize the empirical true positive and negative rates of the proposed algorithm as well as those of existing algorithms, respectively. From the right panel of Table S .4, we can see that the proposed algorithm has similar true negative rates with existing algorithms. Furthermore, from the left panel of Table S .4, we can see that the proposed algorithm has better true positive rates than existing algorithms. In sum, we have similar conclusions to those of Section 4. Comparing the simulation results in Table S .4 with those in Table 5 , notice that the simulation settings in Section 4 are more challenging than those in this section in terms of the true positive rate. This is because there are many categorical variables in the simulation in Section 4, while there are only continuous ones in the simulation in this section. Since the conditional set of categorical variables takes more degrees of freedom away from the conditional independence tests than the continuous ones, categorical variables in the simulation in Section 4 can lead to more Type II errors and more contradictory/inconsistent statistical testing results than the simulation in this section. Hence, we can see that the improvement in the true positive rates of the proposed algorithm over the existing ones in Table 5 is larger than the improvement in Table S .4 in this section. In sum, we can see that the proposed algorithm is more beneficial in the true positive rate in the case of categorical variables.

Graphical models, regression graphs, and recursive linear regression in a unified way

Improving the reliability of causal discovery from small data sets using argumentation

Order-independent constraint-based causal structure learning

Journal of the royal statistical society series b-methodological

Awareness of HIV status, prevention knowledge and condom use among people living with HIV in Mozambique

Effects of syndemics on HIV viral load and medication adherence in the multicentre AIDS cohort study

Botswana's progress toward achieving the 2020 UNAIDS 90-90-90 antiretroviral therapy and virological suppression goals: A population-based survey

Perceived behavioural predictors of late initiation to HIV/AIDS care in Gurage zone public health facilities: a cohort study using health belief model

On the logic of causal models

d-separation: From theorems to algorithms

Sweden, the first country to achieve the Joint United Nations Programme on HIV/AIDS (UNAIDS)/World Health Organization (WHO) 90-90-90 continuum of HIV care targets

Adding fuel to the fire: Alcohol's effect on the HIV epidemic in sub-Saharan Africa

Intimate partner violence and engagement in HIV care and treatment among women

A systematic review of individual and contextual factors affecting ART initiation, adherence, and retention for HIV-infected pregnant and postpartum women

Joint United Nations Programme on HIV/AIDS. 90-90-90: an ambitious treatment target to help end the AIDS epidemic

Joint United Nations Programme on HIV/AIDS. 90-90-90. on the right track towards the global target

Joint United Nations Programme on HIV/AIDS. Ending AIDS: Progress towards the 90-90-90 targets. Global AIDS Update

Joint United Nations Programme on HIV/AIDS. 90-90-90: good progress, but the world is off-track for hitting the 2020 targets

Joint United Nations Programme on HIV/AIDS. Seizing the moment: Tackling entrenched inequalities to end epidemics

Alcohol use and sexual risks for HIV/AIDS in sub-Saharan Africa: Systematic review of empirical findings

Barriers and facilitators to linkage to care and ART initiation in the setting of high ART coverage in Botswana

Learning Bayesian network model structure from data

A tale of two countries: progress towards UNAIDS 90-90-90 targets in Botswana and Australia

Malawi Population-based HIV Impact Assessment (MPHIA) 2015-16: first report

Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference

Determinants of knowledge of HIV status in South Africa: Results from a population-based HIV survey

Improving accuracy and scalability of the pc algorithm by maximizing p-value

Factors associated with HIV viral load suppression on antiretroviral therapy in Vietnam

Barriers to and facilitators of HIV-positive patients' adherence to antiretroviral treatment regimens

Estimating the dimension of a model

Learning Bayesian networks with the bnlearn R package

Introduction to causal inference

An algorithm for fast recovery of sparse causal graphs

Substance abuse, violence, and HIV/AIDS (SAVA) syndemic effects on viral suppression among HIV positive women of color

Barriers and facilitators of retention in HIV care and treatment services in Iringa, Tanzania: the importance of socioeconomic and sociocultural factors

Algorithms for large scale Markov blanket discovery

Time and sample efficient discovery of Markov blankets and direct causal relations

The max-min hillclimbing Bayesian network structure learning algorithm

Searching for recursive causal structures in multivariate quantitative genetics mixed models

Effective and scalable causal partitioning based on low-order conditional independent tests

AbnormPenisDischarge: During the last 12 months, have you had an abnormal discharge from your penis?

AgeGroup: Age groups for population pyramid

AlcoholFrequency: How often do you have a drink containing alcohol? 4. EasyGetCondom: If you wanted a condom, would it be easy for you to get one? 5. Education: Level of school respondent ever attended 6. ForceSexTimes: How many times in your life have you been physically forced to have sex? 7. PartnerAge: How old is your partner?

PLWHSupportGroup: Have you ever attended a support group for people living with HIV?

PregNum: How many times have you been pregnant including a current pregnancy?

Did you see a doctor, clinical officer or nurse because of these problems? 12. SupportGroupTimes12Mo: In the last 12 months

SyphilisTestInPreg: When you were pregnant, were you offered a test for syphilis? 14. TranslatorUsed: whether or translator is used or not

At your last HIV care visit, approximately how long did it take you to travel from your home (or workplace) one way? 16. ViolenceOK?: Do you believe it is right for a man to hit or beat his wife/partner? 17. WifeNum: Altogether, how many wives or partners do you have? 18. WifeNumLiveElsewhere: How many wives

WifeNumOfHusband: Including yourself, in total, how many wives or live-in partners does your husband or partner have?

AgeGroup: Age groups for population pyramid 2. ChildNumSince2012: How many children have you given birth to since 2012? 3. CircumcisedHIVRisk: Relationship of circumcision and risk of HIV? 4. EasyGetCondom: If you wanted a condom, would it be easy for you to get one? 5. Education: Level of school respondent ever attended 6. EthnicGroup: What is your ethnic group? 7. MarrigeStatus: What is your marital status now: are you married

PLWHSupportGroup: Have you ever attended a support group for people living with HIV?

PregNum: How many times have you been pregnant including a current pregnancy?

SellSexEver: Have you ever sold sex for money? 13. SyphilisTestInPreg: When you were pregnant, were you offered a test for syphilis? 14. TravelTime: At your last HIV care visit, approximately how long did it take you to travel from your home (or workplace) one way? 15. Urban: Urban Area Indicator 16. ViolenceOK?: Do you believe it is right for a man to hit or beat his wife/partner? 17. WorkLast12Mo: Have you done any work in the last 12 months for which you received a paycheck

AdditionalPartner: Do you have additional spouse(s)/partner(s) that live with you? 2. AgeGroup: Age groups for population pyramid

AlcoholFrequency: How often do you have a drink containing alcohol? 4. BuySexEver: Have you ever paid money for sex?

CircumcisedStatus: Are you circumcised or planning to get circumcised? 6. EasyGetCondom: If you wanted a condom, would it be easy for you to get one? 7. PartnerAge: How old is your partner?

RelationToHeadOfHouse: What is your relationship to the head of the household? 12. TravelTime: At your last HIV care visit, approximately how long did it take you to travel from your home (or workplace) one way? 13. ViolenceOK?: Do you believe it is right for a man to hit or beat his wife/partner? 14. WantMoreChild: Would you like to have a/another child? 15. WealthQuintile: Wealth quintile 16. WifeNumLiveElsewhere: How many wives/partners do you have who live elsewhere? 17. WomenCondomHaveSexALot?: Do you believe women who carry condoms have sex with a lot of men? 18

AntenatalCareLastPreg: Flag if mother who gave birth 3 years preceding survey received antenatal care during last pregnancy 2. EthnicGroup: What is your ethnic group?

Mother's current and past breast feeding status 4. PLWHSupportGroup: Have you ever attended a support group for people living with HIV?

PregCurrentStatus: Are you pregnant now?

Urban: Urban Area Indicator 10. ViolenceOK?: Do you believe it is right for a man to hit or beat his wife/partner? 11. WifeNumOfHusband: Including yourself, in total, how many wives or live-in partners does your husband or partner have?

AbnormPenisDischarge: During the last 12 months, have you had an abnormal discharge from your penis?

AnalSexEver: Have you ever had anal sex?

CircumcisedStatus: Are you circumcised or planning to get circumcised? 4. CondomLastPaidSex: Flag if condom was used at last paid sexual intercourse 5. FirstSexForced: The first time you had vaginal or anal sex

Number of people they had sex with in the last 12 months 7. PLWHSupportGroup: Have you ever attended a support group for people living with HIV?

RelationToLastSexPartner: Relationship status with their last sex partner in the past 12 months S

SeekMedicalHelp: Did you see a doctor, clinical officer or nurse because of these problems?

At your last HIV care visit, approximately how long did it take you to travel from your home (or workplace) one way? 11. WantMoreChild: Would you like to have a/another child? 12. WifeNumLiveElsewhere: How many wives/partners do you have who live elsewhere?

AlcoholFrequency: How often do you have a drink containing alcohol? 2. CondomLastSex: Indicator for used condom at last sexual encounter in the past 12 months

EverWidowed: Have you ever been widowed? That is, did a spouse ever die while you were still married or living with them?

ForceSexTimes: How many times in your life have you been physically forced to have sex?

RelationshipToViolence: Relationship between you and the person who give physical violence to you

TranslatorUsed: whether translator is used or not

VistDoctorLast12Mo: Have you seen a doctor, clinical officer or nurse in a health facility in last 12 months?

AbnormPenisDischarge: During the last 12 months, have you had an abnormal discharge from your penis?

AdditionalPartner: Do you have additional spouse(s)/partner(s) that live with you?

SeekMedicalHelp: Did you see a doctor, clinical officer or nurse because of these problems?

the last 12 months, did a doctor

VerySick3MoInLast12Mo: Has name been very sick for at least 3 months during the past 12 months, that is name was too sick to work or do normal activities? 7. WifeNum: Altogether