key: cord-0609714-55a0l8it
authors: Charlot, Louis
title: Bayesian hierarchical analysis of a multifaceted program against extreme poverty
date: 2021-09-14
journal: nan
DOI: nan
sha: d133b2b7948d3f226a62d73a7f175398c82d46af
doc_id: 609714
cord_uid: 55a0l8it

The evaluation of a multifaceted program against extreme poverty in different developing countries gave encouraging results, but with important heterogeneity between countries. This master thesis proposes to study this heterogeneity with a Bayesian hierarchical analysis. The analysis we carry out with two different hierarchical models leads to a very low amount of pooling of information between countries, indicating that this observed heterogeneity should be interpreted mostly as true heterogeneity, and not as sampling error. We analyze the first order behavior of our hierarchical models, in order to understand what leads to this very low amount of pooling. We try to give to this work a didactic approach, with an introduction of Bayesian analysis and an explanation of the different modeling and computational choices of our analysis.

The idea of the multifaceted program is to provide a productive asset with related training and support, as well as general life skills coaching, weekly consumption support for some fixed period, access to savings accounts, and health information or services to the poorest households in a village.

The different components of this intervention are designed to complement each other in helping households to start a productive self-employment activity and exit extreme poverty. Banerjee et al.

(2015) [5] evaluate this intervention with six randomized trials carried out in six different countries (Ethiopia, Ghana, Honduras, India, Pakistan, and Peru), with a total of 10,495 participants. If they find positive intention-to-treat (ITT) effects of the intervention for most of the outcomes when looking at all sites pooled together, it is not always the case when they analyze the impact of the intervention for each site separately. Particularly, 24 months after the start of the intervention, they find that asset ownership increases significantly in all sites but Honduras. Given this heterogeneity of the results between the different sites, the authors conclude that it would be important to study the significant site-by-site variation in future work.

The purpose of this master thesis is to study this heterogeneity using Bayesian statistics. More precisely, we use two different Bayesian hierarchical models to provide an alternative to the original frequentist analysis provided by [5] . Our first hierarchical model is inspired from the model proposed by Rubin (1981) [8] and analyzed by the Stan Development Team (2020) [9] and Gelman et al. (2013) [10] . It uses directly as input the coefficients and standard errors of sitelevel regressions, and is adapted to cases where all the data is not available. Our second model is inspired from Gelman and Hill (2006, chapter 13) [11] and Gelman et al. (2013, chapter 15) [10] for the theoretical part, and from the Stan Development Team (2021, Stan Users Guide, chapter 1.13) [9] for the practical implementation. Contrary to the first model, this second model can take as input the full dataset of outcomes, and individual-and site-level predictors. This permits to bring more information into the model, and possibly to improve predictions. For both the models, we use recent recommendations of the Stan Development Team (2021) [9] to optimize the related calculations. In addition, we explain our choices of priors. We implement these hierarchical models with the language Stan (Stan Development Team, 2021) [9] , which is based on the No-U-Turn Sampler (NUTS), a recent improvement by Hoffman and Gelman (2014) [12] of the Hamiltonian Monte Carlo (HMC) sampling method. We justify this choice of an HMC sampler over more ancient Markov Chain Monte Carlo (MCMC) samplers (Betancourt, 2017[13] , Haugh, 2021 [14] , Neal, 2011 [15] ). We finally calculate the level of information pooling between sites to which our hierarchical modeling leads.

Our Bayesian hierarchical analysis leads to a very low amount of pooling of information between the different sites. This gives us estimates of the ITT effect of the multifaceted program on asset ownership 24 months after the asset transfer that are very close to the ones of the simple no-pooling site-level regressions of the original approach of . According to our different models, that lead all to similar results, the observed heterogeneity of the site-level estimates should thus be interpreted mainly as true inter-site heterogeneity.

Our average pooling of information between sites, around 3%, is much lower than the values obtained by Meager (2019) [3] in her Bayesian hierarchical analysis of a microcredit expansion policy, ranging from 30% to 50% depending on the variable observed. To understand whether this difference of our results with Meager's ones comes from the model or more fundamentally from the data, we apply our first Bayesian hierarchical model to her microcredit data, and find pooling averages of the same order of magnitude as those she found. Therefore, the difference with Meager's results seems to come from the data rather than from the model.

Running several simulations, we then try to understand the reason of this difference. We find that our Bayesian hierarchical model tends to have the following first order behavior. When the the site-level estimates of the program impact are close enough to one another comparatively to their associated standard errors, the model seems to consider that the observed heterogeneity is mostly due to sampling error of the ITT effect measure in each site, and will therefore enact a high amount of pooling of information between sites. This is the case with Meager's analysis. On the contrary, when the site-level estimates of the program impact are far from one another (always using as metric their associated standard errors), the model seems to consider that the observed heterogeneity is mostly due to true heterogeneity of the ITT effect between sites, and will not enact a high amount of pooling of information between sites. This is the case with our analysis.

The Bayesian hierarchical approach is already used to analyze the results of medical trials (Bautista et al., 2018[16] ). In development economics, Meager (2019) [3] leads a Bayesian hierarchical analysis to analyze the results of several microcredit evaluations. In our master thesis, we use different models from the ones we could see in our limited readings of medical literature. We also use a different approach to the one proposed by Meager (2019) . Inspired by the work of Haugh (2021) [14] , Mc Elreath (2016) [17] , and the Stan Developing Team (2021) [9] , we try to bring a didactic approach of the different choices we made for the modeling, the priors, the sampling methods, and the optimization of calculations. We also decide to use different models in our work in order to compare and analyze their results. Finally, we try to use recently developed Stan packages in order to present the results as visually as possible.

The remainder of this master thesis is organized as follows. In Section 2, we introduce the multifaceted program and its research context. In Section 3, we present the utility of the Bayesian hierarchical approach for our analysis. Then, in Section 4, we present the hierarchical models we use, and the related optimization and prior choices. In Section 5, we present the results we obtain with the different models. In Section 6, we interpret the results and see what they bring to our knowledge of the multifaceted program. Finally, in Section 7, we present the ideas of future research and their related challenges.

In this section, we introduce the multifaceted program and the research context that led to its implementation. More precisely, we first briefly present the multifaceted program and the cash and skills constraints it is intended to release. Given that the multifaceted program shares a similar aim with microcredit, we also give some comparison points with microcredit. Then, in a second part, we try to summarize the important results of the first field evaluations of the multifaceted program,

and mention some open questions about this program. [7] argue that the multifaceted program has been developed with the idea that there should be complementarities between the program's pieces. The consumption support is intended to help the families during the setting up of their business, to avoid the sale or the consumption of the asset. The training and the visits are there to help them not make elementary mistakes and stay motivated. The savings accounts are intended to encourage the households to save their earnings, and convert savings into future investments for the business.

As underlined by Bandiera et al. (2017) [6] , if such a program can permanently transform the lives of the poor, it would determine a causal link between the lack of capital and skills and extreme poverty in development countries. To take the words of Banerjee (2020) [18] , the multifaceted program addresses a "very big question: are those in extreme poverty there because they are intrinsically unproductive, or are they just unlucky and caught in a poverty trap?"

More precisely, the multifaceted program should permit to understand the reason why people stay poor. As highlighted by Balboni et al. (2021) [19] , there are two rival theories trying to answer this question. The equal opportunity theory explains that differences in individual characteristics like talent or motivation make the poor choose low productivity jobs. The poverty traps theory explains on the contrary that access to opportunities depends on initial wealth and thus poor people have no choice but to work in low productivity jobs. According to the poverty traps theory, by a sufficient asset transfer, training and support, the multifaceted program should permit the very poor households to exit poverty persistently by crossing this initial wealth barrier. A success of the multifaceted program would therefore support the poverty traps theory.

The rationality of the multifaceted program lies on the existence of a cash constraint, that prevents the very poor to exit poverty by successful business creation. The identification of such a cash constraint, preventing high returns and poverty exit for (potential) entrepreneurs, has been the aim of different randomized experiments. The results of these RCTs seem to imply that such a cash constraint exists, but is not always the only one.

McKenzie and Woodruff (2008) [20] identified that a release of the cash constraint could lead to high returns, with a RCT led on male-owned firms in the retail trade industry in Mexico. This experiment, by providing either cash or equipment to randomly selected enterprises, showed that an exogenous increase of capital generated large increases in profits. This return was very high (more than 70%) for firms that report being financially constrained, which are the informal firms with less educated owners without entrepreneur parents.

However, a RCT led by Banerjee et al. (2015b) [21] has shown that releasing this cash constraint doesn't always lead to high returns. Their experiment showed poor effects of a group-lending microcredit in Hyderabad, which targets women who may not necessarily be entrepreneurs: the demand of credit was lower than expected (informal borrowing declined with the emergence of microcredit and there was no significant difference in the overall borrowed amount) and there was no increase of the overall consumption. Finally, business profits increased only for the ones who already had the most successful businesses before microcredit.

These examples show that a cash constraint seem to exist, but releasing it doesn't seem sufficient for all the groups of population to exit poverty. It worked for male-owned firms in the retail trade industry (McKenzie and Woodruff, 2008) [20] but not for women who may not necessarily be entrepreneurs (Banerjee et al., 2015b) [21] . Other constraints seem to exist.

We just saw that the poor seem to face a cash constraint and we know that the aim of microcredit is precisely to release this cash constraint. So why don't we just implement microcredit instead of the multifaceted program? If microcredit and the multifaceted program are similar by the release of a cash constraint, they differ importantly on other points. Indeed, the multifaceted program also releases a skills constraint and it doesn't require people to reimburse what they received. [3] shows that, if microcredit has not significant effect on the average borrower, it can have positive but highly variable effects on profits for experienced people who already owned a business.

This confirms the fact that microcredit is not a sufficient solution for the very poor, who have no business experience.

In addition, the general equilibrium effects of microcredit have been questioned by some economists. For instance, Bateman and Chang (2012) [4] claim that microcredit can harm a developing country economy by several mechanisms. These different mechanisms, that may affect negatively the economy at a large scale, cannot be observed during an RCT evaluation. First, through its high interest rates and short maturities loans, microcredit model tends to encourage the development of unsophisticated micro-enterprises (retail and service operations) instead of longer-term returns growth-oriented enterprises that use sophisticated technologies. This can be problematic as technically innovative ideas and institutions can be important actors of development. Second, as argued by Karnani (2007) [22] , the microcredit model ignores the importance of scale economies by supporting the development of tiny micro-enterprises at the expense of large, productive and labor-intensive

industries. In addition, by increasing the number of micro-enterprises without increasing substantially the demand, microcredit tends to increase competition and decrease prices thus income for the existing enterprises on the market.

The disappointing impact of microcredit for the very poor and its potentially harmful general equilibrium effects justify the search of another policy, like the multifaceted program.

We saw previously that classical microcredit does not significantly improve lives of the very poor.

Besides releasing more the cash constraint for the very poor, multifaceted programs present other advantages compared to microcredit. Indeed, as households do not have to pay back the asset transfer, they will probably take more risks by investing more in their new activity. In addition, the training makes them gain some "entrepreneurial experience" that was maybe missing in the microcredit experiments.

Two similar forms of the multifaceted program have been analyzed for now.

The first form of multifaceted program has been evaluated by Bandiera To measure long term effects of the program, households were surveyed 2, 4 and 7 years after the program implementation. Bandiera et al. (2017) showed that this intervention enabled the poorest women to shift out of agricultural labor by running small businesses. This shift, which persists and strengthens after assistance is withdrawn, leads to 21% higher earnings than their counterparts in control villages. There is an increase of self-employment and a decrease of wage-employment for treated women, who work more regularly and are more satisfied. These results are more important 4 years after than 2 years after the program beginning, and sustained after 7 years, showing that these positive effects seem to last, guaranteeing a sustainable path out of poverty. A quantile effect analysis shows that the program has a positive effect on earnings and expenditures at all deciles, but that these effects are slightly better for high deciles.

However, the question of external validity is still open: the authors mention that a similar program in West-Bengal gave good results, but that it wasn't the case in Andhra Pradesh. They explain this failure by the fact that the Government of Andhra Pradesh simultaneously introduced a guaranteedemployment scheme that substantially increased earnings and expenditures for wage laborers.

A second form of multifaceted program (the one we will study) has been evaluated by Banerjee However, the aim of the program is still to help very poor households to start or continue with a self-employment activity through a release of cash and skills constraints. More precisely, it is this time a combination of a productive asset transfer, ranging from raising livestock to petty trade, of technical skills training on managing the particular productive asset and high-frequency home visits, but also of consumption support (regular transfer of food or cash for a few months to about one year), saving support (access to a savings account and in some instances a deposit collection service and/or mandatory savings) and some health education, basic health services, and/or lifeskills training. There are some differences from one site to another: for instance, only 4 sites partnered with microfinance institutions able to provide access to savings accounts (that were more or less compulsory according to sites).

They also find very positive results for their program: consumption, revenue and income of the treated very poor households increase, and this positive effects persist for at least one year after the program ends. This increase, though significant at all tested quantiles, is lower at the bottom of the distribution. Although results vary across countries, the general pattern of positive effects that persist for at least one year after the end of the program is common across all countries, with weaker impacts in Honduras and Peru. In addition, if we consider total consumption as the measure for benefits, all the programs except Honduras one have benefits greater than their costs (from 133%

in Ghana to 433% in India).

So, in both cases the multifaceted program seems to better address the issue of extreme poverty exit than what we saw with microcredit.

In addition, the multifaceted program has some advantages compared to a classical unconditional cash transfer of the same cost according to Bandiera et al. (2017) [6] : they showed that this program gives higher earnings on average, and permits households to smooth more naturally their consumption.

We can also mention that a recent paper by Banerjee et al. (2018) [7] mentioned that a seven-year follow up led in Bangladesh and India showed that impacts persisted for these countries. Even if this follow up has not been carried out in all the countries, it gives some hopes of long-lasting positive effects of the program.

Finally, Banerjee et al. (2018) [7] find also that both a savings-only treatment and an asset-only treatment have worse effects than the full multifaceted program. More precisely, the savings-only program has much weaker effects on consumption one year after the end of the program, while the asset-only treatment has no evidence of any positive welfare effects. These first results provide some evidence of the complementarity of the asset transfer with the other components of the program.

There are still some open questions about the impact of the multifaceted program.

Let us first note that some of the critics of microcredit by Bateman and Chang (2012) [4] can be also applied to this multifaceted intervention. Indeed, as the program ultimate goal is to increase consumption and life standards of the very poor by an occupational shift towards entrepreneurship, we can question if this multiplication of tiny micro-enterprises can harm the economy of a country.

It is not easy to answer to these critics as this program has been evaluated for now only through

RCTs at medium-term and medium-scale.

Nevertheless, as the main goal of this program is to extract the very poor from the poverty trap (no capital, no skills), and not necessarily to improve the economic development of an entire country (which was the initial claim of microcredit), these uncertainties do not really question the multifa- 24 months after the start of the intervention, they find that asset ownership increases significantly in all sites but Honduras. Given this heterogeneity of the results between the different sites, the authors conclude that it would be important to study the significant site-by-site variation in future work. The purpose of this master thesis is to study this heterogeneity using Bayesian statistics.

3 Utility of the Bayesian hierarchical approach

The data of Banerjee et al. (2015) [5] is hierarchical by nature: we look at households characteristics, and these households are located into different countries. With this type of data, three different approaches are possible. The first one is to treat the households of different countries separately, supposing that what we learn from a country does not provide any information about another country: this is the no-pooling approach. The second one is to treat the households of different countries without any distinction, supposing that what we observe in a country tells us exactly what happens in another country: this is the full-pooling approach. We suppose here that all experiments might be estimating the same quantity. The third approach, which is the one we are interested in, is to implement partial pooling at a level based on what we observe: this approach is halfway between the two previous ones. if the full-pooling approach shows a global significant increase of asset ownership, the no-pooling approaches shows that this increase is significant in all sites but Honduras.

As highlighted by Gelman et al. (2013) [10] and Mc Elreath (2016) [17] , neither of these two extreme approaches (that is the separate analyses that consider each country separately, and the alternative view of a single common effect that leads to the pooled estimate) is intuitive. Indeed, the fullpooling approach ignores possible variations between sites. In our case, it would imply that we believe that the probability that the true effect of the multifaceted program on asset ownership in Honduras (where no significant effect of the multifaceted program is measured) being lower than in Ghana (where a significant positive effect is measured) is only 50%. On the contrary, considering each country separately would imply that we believe that the large positive and significant effects of the multifaceted program measured in other countries do not provide any hope for the true effect in Honduras being greater than what was measured. We can see that neither of these approaches is fully satisfactory: it would be interesting to have a compromise combining information from all countries without assuming all the true ITT effects to be equal. A hierarchical model, that takes into account the two levels of the data (household level and country level) and that we use in Section 4, provides exactly this compromise.

In this master thesis, we use the Bayesian approach to implement our hierarchical models. Before highlighting the advantages of this approach over the frequentist approach, we give in this section a very quick introduction of Bayesian statistics.

As explained by Haugh (2021) [14] , Bayesian statistics are based on the following application of the Bayes theorem:

θ p(y | θ) · π(θ) with θ being an unknown random parameter vector, y a vector of observed data with likelihood p(y | θ), π(θ) being the prior distribution of θ, that is the distribution we assume for θ before observing the data, π(θ | y) being the posterior distribution of θ, that is the updated distribution of θ after observing the data. This formula is at the basis of Bayesian statistics: it provides a rule to update probabilities when new information appears. More precisely, the parameters of interest θ are assumed random and the observation of new data y permits to update our prior beliefs about the distribution of these parameters of interest.

Contrary to the frequentist approach where the parameter vector θ is taken fixed and the dataset y is taken with uncertainty, the Bayesian approach fixes the dataset y but takes the parameter vector θ with uncertainty.

As highlighted by Haugh (2021) [14] , the selection of the prior is an important element of Bayesian modeling. With a few amount of data, the influence of the prior choice on the final posterior can be very large. It can be in this case important to understand the sensitivity of the posterior to the prior choice. An ideal prior should capture the relevant information we have before observing the data, but should be dominated by the data when there is plenty of data. In Section 4, we will justify the different choices of priors that we use for our hierarchical models.

There are several advantages of the Bayesian approach over the frequentist one, as highlighted by Haugh (2021) [14] , Gelman (2013) [10] , and Thompson and Semma (2020) [23] .

First of all, the choice of priors permits to express our prior beliefs about the quantities of interest:

we will see several examples on this point with our hierarchical models in Section 4. Another very useful advantage of Bayesian statistics in the case of our hierarchical analysis is the possibility to build very flexible models. The fact that we can visualize the final distributions with Bayesian statistics also leads to results easy to understand and interpret.

The more intuitive Bayesian interpretation of the results is particularly visible with credible intervals, the Bayesian version of frequentist confidence intervals. While the 95% credible interval for a parameter θ can be interpreted as having a 95% probability of containing θ, this is not true for the 95% confidence interval for θ. Indeed, a frequentist analysis assumes that θ is unknown but not random. Therefore, a 95% confidence interval contains θ with probability 0 or 1. The interpretation of confidence interval is less intuitive: if we repeat the experiment an infinity of times, we expect the 95% confidence interval to contain the true value of θ 95% of the time.

As mentioned by Haugh (2021) [14] , there are also some drawbacks with the Bayesian approach. The first one is the subjectivity induced by the choice of the prior: we will discuss this difficulty and how to partially handle it in our practical implementation. The second one is the high computational cost of Bayesian analysis: it is in particular the case of the Bayesian sampling challenge, that we will address in the next two sections.

An important challenge in Bayesian statistics is the resolution of the sampling problem. As highlighted by Haugh (2021) [14] , it is the recent progress in sampling methods that permitted to do the calculations necessary to Bayesian analysis. After a century of domination of frequentist analysis, this sampling progress has brought Bayesian analysis back to the forefront.

The sampling problem is in fact the problem of simulating from the posterior π(θ | y) without knowing its denominator in the expression:

This problem appears in Bayesian models wherep(θ)

Several Markov Chain Monte Carlo (MCMC) sampling methods have been developed to solve this sampling problem (Betancourt, 2017[13] , Haugh, 2021 [14] ). The common idea of these methods is to construct a Markov chain for which the distribution we want to sample (that isp (θ) Zp ) is the stationary distribution. After having achieved this construction, we only need to simulate the Markov chain until stationarity is achieved to get the desired sample.

In order to obtain a high quality and quick sampling for our hierarchical models, we use the Hamiltonian Monte Carlo (HMC) version of MCMC sampling, as recommended by Neal (2011) [15] . HMC is a recent version of the MCMC sampling that uses the formalism of Hamiltonian dynamics in order to improve the quality and the efficiency of sampling for multidimensional and complex models.

Very shortly, Hamiltonian dynamics is a discipline of physics that permits to describe the movement of a physical particle thanks to Hamiltonian formalism (a particular way to write the physical equations of movement). The very interesting point for us is that the formalism of Hamiltonian dynamics can be applied to very diverse systems, including sampling in Bayesian statistics. The idea is to see the posterior distribution we want to sample as the position q of a particle in Hamiltonian dynamics. We then associate randomly a distribution to the momentum p (mass multiplied by velocity) of this same particle, and simulate the evolution of this particle in the state space (p, q) thanks to the leapfrog method (and improvement of the Euler method that is more adapted to Hamiltonian dynamics) and Hamiltonian dynamics. We finally accept the new state (p, q) of the particle with a probability that increases with the energy of the particle at this new state. If we do not accept the new state, we stay at the previous state. As this very short introduction of this method provides only the basic idea of HMC method, we recommend to the readers interested by these questions the very interesting paper by Neal (2011) [15] , that goes more into details in a very didactic way.

The original HMC algorithm requires the user to set the step size and the number of steps L for the leapfrog method part of the algorithm. Hopefully, Hoffman and Gelman (2014) [12] proposed a new version of HMC that sets automatically L and , which is called the No-U-Turn Sampler (NUTS).

Given the advantages of HMC sampling for hierarchical models, we use in this master thesis the probabilistic programming language Stan, presented by Carpenter et al. (2017) [24] and the Stan Development Team (2021) [9] , that provides Bayesian inference with NUTS.

We will now present the two models we used to carry out our hierarchical analysis of the impact of the multifaceted program on asset ownership. While our first model uses directly as inputs the coefficients and standard errors of the site-level regressions, our second model uses the full dataset of individuals as input. In this section, we will present both models, their optimization and the related prior choices. We will also see the advantages and difficulties of each one. and Semma (2020) [23] . In this model, we begin by carrying out the following regression in each site s: We then use the the estimatesτ s of the treatment effect τ s and their associated standard errorsσ s obtained with each site-level regression to build the following model:

In other words, we assume here that the observed treatment effectτ s for site s is drawn from the normal distribution N (τ s ,σ 2 s ) of the true treatment effect for site s. According to DuMouchel (1994) [26] , if the estimates of site treatment effects are not based on very small samples, this assumption is likely to be a good one. We also assume here that the mean τ s of this normal distribution of the treatment effect for site s is drawn from the normal distribution N (τ, σ 2 ) of the treatment effect over all possible sites in the universe. τ and σ are considered as random and they must be assigned a prior. As we believe that no prior knowledge exists about τ and σ, we assign them weakly informative priors: τ ∼ N (0, 5) and σ ∼ half Cauchy(0, 5)

The choice of a half-Cauchy distribution (that is a Cauchy distribution defined over the positive reals only) is usual for a parameter, like σ, that is strictly positive, as highlighted by Mc Elreath (2016) [17] . We give a visual summary of this first model in Figure 1 for more intuition.

Let us note that this model is the Bayesian version of the frequentist random effect model for metaanalysis, as highlighted by Thompson and Semma (2020) [23] and Biggerstaff et al. (1994) [27] . In the frequentist version, used for instance by Fabregas et al. (2019) [28] in development economics, the difference is that parameters τ and σ are not considered as random and are directly estimated. 

can lead to a very slow convergence for HMC because no single step size works well for the whole joint distribution of τ s , τ and σ. Indeed, according to Gelman et al., the trajectories in the HMC are unlikely to go in the region where σ is close to 0 and then are unlikely to leave this region when they are inside it. Therefore, they propose to write this model with the following parametrization:

The idea is here to take the means and standard deviations out of the original Gaussian distribution, which leaves only a standardized Gaussian prior. If the model is technically the same as before, this "non-centered" form of the model permits to sample much more efficiently according to Mc Elreath (2016) [17] .

We also carry out the data analysis with a second hierarchical model, inspired from Gelman and Hill The main idea of this second hierarchical model is to suppose that the outcome of interest y i for an individual i has a distribution of the form: 

We can notice that the intercept α site Returning to the general case and following Gelman and Hill (2006) [11] , we suppose that the coefficients β s for each site s follow a multivariate normal distribution:

For instance, in the case where X i = (1, T i ), we would have:

and ρ correlation parameter.

In their model, Gelman and Hill (2006) [11] take for the prior mean µ s of β s a simple vector parameter µ, which does not vary across sites. As recommended by the Stan Development Team (2021) [9] , we choose instead to include in our model site-level information through site-level predictors Z s (including an intercept through Z s,1 = 1). The idea is to model the prior mean µ s of β s itself as a regression over the site-level predictors Z s as follows:

with γ being the vector of site-level coefficients. We can give to each element of γ a weakly informative prior, such as:

Regarding the prior on the covariance matrix Σ of β s , Gelman and Hill (2006) [11] propose to use a scaled inverse Wishart distribution. This choice is motivated by the fact that the inverse Wishart distribution is the conjugate prior (prior with the same distribution family as the posterior) for the covariance matrix of a multivariate normal distribution (Σ in our case). Indeed, using the conjugate prior is computationally convenient when using Bugs, the programming language based on the Gibbs sampler (an improvement of the first version of MCMC algorithm) used by Gelman and Hill. As we are using Stan, which is based on Hamiltonian Monte Carlo, there is no such restriction in our case. Therefore, we follow instead a more intuitive approach recommended by the Stan Development Team (2021) [9] . The idea is to decompose the prior on the covariance matrix Σ into a scale diag(θ) and a correlation matrix Ω as follows:

, with θ k = Σ k,k and Ω k,l = Σ k,l θ k θ l .

The advantage of this decomposition is that we can then impose a separate prior on scale and on correlation. For the elements of the scale vector θ, the Stan Development Team recommends to use a weakly informative prior like a half-Cauchy distribution (that is a Cauchy distribution defined over the positive reals only) with a small scale, such as: To visualize these last steps more easily, let us return to the example where X i = (1, T i ). In this case, we have:

With a LKJcorr(1) prior, all values of ρ (between -1 and 1) are equally possible. When we increase η (η ≥ 1), the extreme values of ρ (close to -1 and 1) become less likely. Therefore, as our prior belief is that there should not be strong correlation between α and τ , we take a η ≥ 1, like η = 2.

We give a visual summary of this second model in Figure 2 for more intuition.

The advantage of this second model compared to the first one of Section 4.1 is that we can include both individual-level and group-level information in our model to improve predictions. In addition, we saw that we can include new forms of prior information if available (about correlations, standard deviations, etc.). However, the calculations for this second model are more heavy than for 

. . . β 6 ∼ N (Z 6 γ, Σ)

Site-level predictors

Level 2 the first one, as we use the whole data as input, and not only the coefficients of site-level regressions.

The heaviness of the calculations and the more important complexity of this model can lead to very long computing times that can be an obstacle to practical implementation. We will now explore how to address this computational issue.

To answer to the computational issue highlighted in the previous section, the Stan Development Team (2021) [9] recommends to vectorize the Stan code of our model. The idea is to create local variables for Xβ and Σ that permit reducing the sampling time for y and β by avoiding unnecessary repetitions of calculations that would happen with loops (see Stan code in Appendix for more details).

In addition, as the vectorization can be insufficient to optimize HMC sampling, the Stan Development Team proposes to combine it with a Cholesky-factor optimization. Indeed, we can notice that, as a correlation matrix, Ω is symmetric definite positive (SDP). Therefore, according to the Cholesky factorization theorem (van de Geijn, 2011) [30] , there is a lower triangular matrix Ω L such that:

The idea of the Cholesky-factor optimization is to take advantage of this factorization of Ω to reduce the number of matrix multiplications in our sampling. This can be done by defining our β by the alternative form:

with u being a random vector u of components

and Ω L the Cholesky factor of Ω. If we want our Ω to have a LKJcorr(η) prior, we have to assign

to Ω L the equivalent Cholesky factorized prior:

We can check that we have, as desired, that

A last point that can be optimized is the sampling from the Cauchy distribution θ k ∼ Cauchy(λ, ω), constrained by θ k > 0. 

As W ∼ unif orm(0, 1), we have that F −1 θ k (W ) ∼ Cauchy(λ, ω).

We will now present the results we obtained with our Bayesian hierarchical analysis. To begin, we justify our choices for the implementation of the two Bayesian hierarchical models in the case of the multifaceted program. Then, we present some sampling diagnostics of our Bayesian analysis.

After these first steps, we present the posterior distributions and the pooling we obtained for the parameters of interest. For Models 2 and 2bis, we have to do an additional choice, about site-level predictors. We decide to use for both models the following site-level predictors Z: the value of the asset transfer (measured in local goat price) and the presence of a health component in the program. We did not introduce other site-level predictors, as we found that other site-level information available was difficult to convert into a comparable predictor for all sites. [9] , and Gabry and Modrák (2021) [32] , the presence of divergent transitions during the exploration by HMC of the target posterior distribution might be due to the use of a too big step size for the exploration of the possibly small features of the target distribution: they therefore recommend to use a smaller step size, which we try with our data.

As the issue does not disappear even with a smaller step size, we have to address it with the optim- 

We present the posterior distributions obtained for the ITT effects of the multifaceted program on household total asset ownership for our 3 models graphically in Figure 3 and numerically in the Tables 1, 2 and 3. We also present the estimates and standard errors for these ITT effects obtained with the separate site-level regressions (No-Pooling) in Table 4 . approximately probability of 25% that τ 3 is inferior to -0.01 for this model. It can be seen in Table  Figure 3 : Density of posterior distributions of the ITT effect of the multifaceted program on total asset ownership obtained with Models 1, 2 and 2bis. The blue area corresponds to the values included into the 95% credible intervals. [5] . Indeed, similarly to the original study, we find with all the models a significantly positive impact of the intervention on asset ownership, unless for Honduras where the value of 0 is included into the 95% credible intervals.

We will come back in Section 6 on the interpretation of the similarity of the results of our analysis compared to the original analysis, but let us before quantify the level of pooling obtained with our hierarchical analysis.

Given the important proximity of the results for Models 1 and 2, we will focus on the analysis of information pooling with Model 1 only. To evaluate the pooling of information between the different sites, we use the approach proposed by Gelman and Pardoe (2006) [34]. This approach, also used by Meager (2019) [3] , will enable us to compare our results with the results she found by applying a Bayesian hierarchical analysis to study the impact of microcredit expansion interventions.

In their very interesting paper, Gelman and Pardoe (2006) [34] propose what they call a pooling factor, that represents for each site s the extent of information pooling with other sites. In the case of Model 1, it can be written as:

withσ the mean of the posterior distribution of σ obtained with our hierarchical model, andσ 2 s the sampling error defined in Section 4.1.

To have an intuition of what this pooling factor ω s represents, let us take two extreme cases. In the case where our posterior results indicate no heterogeneity of the ITT effect τ s between sites, we have thatσ 2 is very close to 0. This leads to a pooling factor very close to ω s = 1, meaning that we have a full-pooling of information between sites. This is quite intuitive: as our posterior results indicate no heterogeneity of the ITT effect τ s between sites, the effect we measure in India should tell us exactly what happens in another site like Bangladesh, so there is a lot of pooling. On the contrary, in the case where our posterior results indicate a very important heterogeneity of the ITT effect τ s between sites compared to the sampling error in site s, we have thatσ 2 >>σ 2 s . This leads to a pooling factor close to ω s = 0, meaning that we have no pooling of information between sites. This is again very intuitive: as our posterior indicates very much heterogeneity of the ITT effect τ s between sites, the effect we measure in India does not provide us much information about what happens in another site like Bangladesh, so there is little pooling.

After this quick introduction of the pooling factor ω s , we present the the posterior distributions obtained for the between sites heterogeneity parameter σ of our Model 1 graphically in Figure 4 and numerically in Table 5 . We also present the pooling factors ω s obtained with this model in each site s in Table 6 . We can notice from these results that our hierarchical Model 1 leads to pooling factors ω s ranging from 2% to 7% in the different sites. This is a very small amount of pooling compared to the ones found by Meager (2019) [3] during her analysis of microcredit expansion, which were around 50 %, with variations according to the outcomes observed.

In the next Section, we will try to understand where these differences of pooling between our results with the multifaceted program and Meager's results with microcredit expansion come from. We Figure 4 : Density of posterior distributions of the between sites heterogeneity parameter σ obtained with Model 1. The blue area corresponds to the values included into the 95% credible intervals. will also try to understand why Model 1 and Model 2 give very similar results.

We will now provide an interpretation of the main results we obtained with our Bayesian hierarchical analysis. To begin, we analyze the important differences of our results with the ones of the first application of Bayesian hierarchical analysis in development economics by Meager (2019) [3] . Then, we focus on the interpretation of the important similarity of the results we obtained with Model 1 and more complex Model 2. Finally, we conclude on the knowledge that this Bayesian hierarchical analysis brings us about the multifaceted program.

In Section 5.4, we have noticed that our hierarchical Model 1 leads to pooling factors ω s ranging from 2% to 7% in the different sites, which correspond to an average on the different sites ofω = 3%.

This average poolingω is much lower than the ones obtained by Meager (2019) [3] in her study of microcredit expansion, ranging from 30% to 50% depending on the variable observed.

To understand if this difference of our results with Meager's ones comes from the model we use or more fundamentally from the data we analyze, we apply our Bayesian hierarchical Model 1 to her microcredit data, and find pooling averagesω of the same order of magnitude as those she found.

Therefore, the difference with the results of the Bayesian hierarchical analysis by Meager does not seem to come from the model, but rather from the data.

Compared to the site-level regressions of the multifaceted program, the ones of the microcredit expansion program lead to treatment effects for which the standard errorsσ s are of the same magnitude than the estimatesτ s . This is linked to the fact that the effects measured for the microcredit expansion program are closer to 0 (and less significant) than the effects measured for the multifaceted program. To understand better how the amount of pooling changes with the properties of our data, we decide to run some simulations.

As the data input in our Model 1, excluding the weakly informative priors, consists only of the estimates of the treatment effectτ s and their associated standard errorsσ s obtained with site-level regressions, we decide to run some simulations by increasing and decreasingτ s andσ s respectively, and see what happens to the average amount of pooling. The results of these simulations are presented in Table 7 . So, according to our simulations, the first order behavior of our Bayesian hierarchical Model 1 seems to be the following. When the the site-level estimatesτ s are close enough to one another comparatively to their associated standard errors, the model seems to consider that the observed heterogeneity is mostly due to sampling error of the ITT effect measure in each site and will therefore enact a high amount of pooling of information between sites. On the contrary, when the site-level estimatesτ s are far from one another (always using as metric their associated standard errors), the model seems to consider that the observed heterogeneity is mostly due to true heterogeneity of the ITT effect between sites and will not enact a high amount of pooling of information between sites. in Honduras, where the effect measured was close to zero. Thus, according to our models, the disappointing results in Honduras reflect true heterogeneity between sites, and are not simply due to sampling error.

In the next Section, we will see some ideas of further research that could be used to deepen the analysis of the multifaceted program.

We will finally give some ideas of further research about the multifaceted program, that could be explored in future work. First, we discuss about the possibility to include more informative priors. Then, we mention some further methods relative to model comparison. Finally, we propose some ideas and their related challenges about the modelling of the interaction between the different outcomes of the multifaceted program, and about the inclusion of more complexity in our models.

In Section 2. 

In our analysis, we obtained very similar results with our different models. However, it might happen that these models lead to diverse results for other outcomes of the multifaceted program or even for the analysis of another program. In this case, we would be further interested by the selection of one model between our different proposals, and we should use model selection methods.

As highlighted by Vehtari et al. (2016) [35] and Haugh (2021) [14] , Let us note that such an analysis presents however an important challenge. Such modelling would lead to complex models, with more levels than the models we have implemented in our analysis with one outcome only. Indeed, there would be not only an hierarchy due to the presence of different sites, but also an hierarchy due to the presence of different outcomes. Given that our two-level models, that were much more simpler, already required optimization methods to obtain a good sampling with our HMC, we can imagine that more complex models would face even more computational issues, that could maybe prevent to carry out the analysis. These results suggest that a very big part of the observed heterogeneity of the impact between sites is due to true heterogeneity between sites. This differs importantly from the results of the first application of Bayesian hierarchical analysis in development economics by Meager (2019) [3] . With her study about microcredit expansion, she found levels of pooling between sites way more important than the ones we found, suggesting that an important proportion of the observed heterogeneity was due to sampling error in her case.

The results of our simulations permit us to understand the reason of this difference. In fact, the Bayesian hierarchical models we used tend to have the following first order behavior. When the the site-level estimates of the program impact are close enough to one another comparatively to their associated standard errors, the model seems to consider that the observed heterogeneity is mostly due to sampling error of the ITT effect measure in each site, and will therefore enact a high amount of pooling of information between sites. This is the case with Meager's analysis. On the contrary, when the site-level estimates of the program impact are far from one another (always using as metric their associated standard errors), the model seems to consider that the observed heterogeneity is mostly due to true heterogeneity of the ITT effect between sites, and will not enact a high amount of pooling of information between sites. This is the case with our analysis.

Let us finally note that the important similarity of the results obtained with our models 1 and 2 is maybe linked to the fact that there is very low pooling of information in our case. Therefore, this similarity of the results obtained with simpler Model 1 and more complex Model 2 should not be taken as a general result: the two models could lead to different results with other datasets that lead to stronger pooling. Thus, for the study of other programs, it can always be interesting to also implement the more detailed Model 2 when the required data is available. Model_2_no_baseline.R as follows.

We first prepare the data that will be used by the sampler: 

We then call the Stan file that contains our hierarchical model 2, to run the sampler on this prepared data: 1 f i t _2 <− s t a n ( 2 f i l e = " Model_2 . s t a n " , # Stan f i l e t h a t c o n t a i n s Model 2 . 

The data of the multifaceted program has been made available by the authors at https:// dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NHIXNT.

We used in particular the household level data file pooled_hh_postanalysis.dta, that can be obtained (in the folder ScienceDataRelease/data_modified) by running the Stata do files of the authors (that are in the folder ScienceDataRelease/dofiles).

In this Appendix, we include some additional information about Markov Chain Monte Carlo (MCMC) and Hamiltonian Monte Carlo (HMC), inspired from the didactic explanations provided by Lambert (2018) [37] , Betancourt (2017) [13] and Neal (2011) [15] .

The first generation MCMC algorithm is the Metropolis-Hastings algorithm. The idea of this algorithm is to obtain a sample from the posterior distribution π(θ | y) = p(y | θ) · π(θ) θ p(y | θ) · π(θ)

Given that the denominator of this expression is independent from θ, the idea is to sample directly from p(y | θ) · π(θ). As explained by Lambert (2018) [37] , this is done through the following steps:

• First, we draw an initial value θ 0 of θ.

• Then, for a large number n of iterations, we repeat the following setps:

-a) We draw a proposal θ proposal from a proposal distribution q(θ proposal | θ current step ) -b) We calculate: α = p(y | θ proposal ) · π(θ proposal )/q(θ proposal | θ current step ) p(y | θ current step ) · π(θ current step ))/q(θ current step | θ proposal )

We include q(θ current step | θ proposal ) and q(θ proposal | θ current step ) in order to correct the possible asymmetries of the proposal distribution q.

-c) Finally:

If α > 1, we accept the proposal (θ current step = θ proposal ).

If 0 < α < 1, we accept the proposal with probability α, and reject it (θ current step = θ current step ) with probability 1 − α.

A possible choice for the proposal distribution q is a Normal distribution centered on θ current step :

in this case, the algorithm is called the Random-Walk Metropolis-Hastings. The advantage of using a Normal distribution is that it gives a simpler formula to calculate α: α = p(y | θ proposal ) · π(θ proposal ) p(y | θ current step ) · π(θ current step )

Random-Walk Metropolis-Hastings has an important drawback: it is very inefficient when the posterior is multidimensional, like in our case. Indeed, when the dimensionality of the posterior distribution increases, the new values θ proposal proposed randomly around θ current step by the algorithm at each step will often be located in zones of very low density of the posterior. Therefore, a very low amount of these proposals will be accepted, leading to a very low sampling efficiency. Therefore, we need a more efficient way to propose new values θ proposal : this is exactly the aim of HMC.

As highlighted by Lambert (2018) [37] , the difference of HMC with Metropolis-Hastings is in the way we do proposals θ proposal , that is based on a physical analogy.

The idea is to attribute to a physical particle the position θ current step , make it evolve in space for some time, and take as proposal θ proposal the new position of this particle in the space after this evolution.

More precisely, we associate to a physical particle the energy: The marginal distribution of p(θ, M ) for θ is the posterior distribution π(θ | y) · π(θ). So we can sample (θ, M ), and then we only have to look at the values of θ to get this posterior distribution.

More precisely, to generate proposals for (θ, M ), we proceed as follows:

• We draw for our particle a new value of moment M * from N (0, 1).

• We then let our particle explore the space (θ, M ) departing from (θ, M * ), using a discretization of the Hamiltonian equations. After t steps, we reach the state (θ * , M * * ), that gives our proposal θ * for Metropolis-Hastings.

• We then execute the last steps of the Metropolis-Hastings algorithm (Section 9.3.1) with this proposal.

Every time we propose a new value of moment M , the particle moves up or down to a new energy state, allowing to sample the whole space, as it is illustrated in Figure 5 . The aim of this Appendix was to provide the idea behind Hamiltonian Monte Carlo. For further details, I recommend the lecture of the detailed explanations by Lambert (2018) [37] , Betancourt (2017) [13] and Neal (2011) [15] .

Estimates of the impact of COVID-19 on global poverty

Six Randomized Evaluations of Microcredit: Introduction and Further Steps

Understanding the Average Impact of Microcredit Expansions: A Bayesian Hierarchical Analysis of Seven Randomized Experiments

Microfinance and the Illusion of Development: From hubris to nemesis in thirty years

A multifaceted program causes lasting progress for the very poor: Evidence from six countries

Labor Markets and Poverty in Village Economies*

Unpacking a Multi-Faceted Program to Build Sustainable Income for the Very Poor

Estimation in Parallel Randomized Experiments

Stan Modeling Language Users Guide and Reference Manual, 2.27

Bayesian Data Analysis

Data Analysis Using Regression and Multilevel/Hierarchical Models

The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo

A Conceptual Introduction to Hamiltonian Monte Carlo

A Tutorial on Markov Chain Monte-Carlo and Bayesian Modeling

MCMC Using Hamiltonian Dynamics. In Handbook of Markov Chain Monte Carlo

Bayesian analysis of randomized controlled trials

Statistical rethinking : a Bayesian course with examples in R and Stan

Field Experiments and the Practice of Economics

Why Do People Stay Poor? STICERD -Economic Organisation and Public Policy Discussion Papers Series. Suntory and Toyota International Centres for Economics and Related Disciplines

Experimental Evidence on Returns to Capital and Access to Finance in Mexico

The Miracle of Microfinance? Evidence from a Randomized Evaluation

Microfinance misses its mark'. Stanford Social Innovation Review (Summer)

An alternative approach to frequentist meta-analysis: A demonstration of Bayesian meta-analysis in adolescent

Stan: A Probabilistic Programming Language

RStan: the R interface to Stan

Hierarchical Bayes linear models for meta-analysis

Passive smoking in the workplace: classical and Bayesian meta-analyses

SMS-extension and farmer behavior: lessons from six RCTs in East Africa

Generating random correlation matrices based on vines and extended onion method

Notes on Cholesky Factorization

Supplementary Materials for A multifaceted program causes lasting progress for the very poor: Evidence from six countries

Visual MCMC diagnostics using the bayesplot package

Bayesian linear mixed models using Stan: A tutorial for psychologists, linguists, and cognitive scientists

Bayesian Measures of Explained Variance and Pooling in Multilevel (Hierarchical) Models

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

loo: Efficient leave-one-out cross-validation and WAIC for Bayesian models

A student's guide to Bayesian statistics

Appendix 1: Methods for the implementation on R and Stan

The full code is available on GitHub at https://github.com/louischarlot/Bayesian_hierarchical_ analysis_multifaceted_program_extreme_poverty.We want to provide here some practical information about the implementation of our Bayesian hierarchical analysis on R and Stan.The first step is to download RStan, the R version of Stan, following the steps indicated at https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started. Let us note for interested readers that Stata and Python versions of Stan are also available.Once the installation is completed, we have to create different Stan files to write the hierarchical models we want to sample with HMC for our Bayesian analysis. Let us note that a Stan file will be written in the C++ programming language, as it will be a C++ compiler that will carry out the HMC sampling, communicating with R to import the input data and to export the results.The code for Model 1 can be included in a Stan file Model_1.stan. Using the same variable notations as in Section 4.1, it can be written as follows: