key: cord-0104074-vyzbqdnq
authors: Huang, Luna Yue; Hsiang, Solomon; Gonzalez-Navarro, Marco
title: Using Satellite Imagery and Deep Learning to Evaluate the Impact of Anti-Poverty Programs
date: 2021-04-23
journal: nan
DOI: nan
sha: b0a1f311b8c62e12b1e47530925acd87f3109306
doc_id: 104074
cord_uid: vyzbqdnq

The rigorous evaluation of anti-poverty programs is key to the fight against global poverty. Traditional evaluation approaches rely heavily on repeated in-person field surveys to measure changes in economic well-being and thus program effects. However, this is known to be costly, time-consuming, and often logistically challenging. Here we provide the first evidence that we can conduct such program evaluations based solely on high-resolution satellite imagery and deep learning methods. Our application estimates changes in household welfare in the context of a recent anti-poverty program in rural Kenya. The approach we use is based on a large literature documenting a reliable relationship between housing quality and household wealth. We infer changes in household wealth based on satellite-derived changes in housing quality and obtain consistent results with the traditional field-survey based approach. Our approach can be used to obtain inexpensive and timely insights on program effectiveness in international development programs.

This makes the data less useful for studying the very target of many international development programs-people living under the poverty line. Additionally, the low spatial granularity of night light prevents it from being used to evaluate programs reliant on fine spatial variations, including most randomized controlled trials in which households in close proximity to one another are assigned to different treatments.

We propose an alternative approach-we analyze daytime imagery using a deep-learning model [20] to explicitly measure the quality of housing, a tangible and verifiable asset that is known to be a powerful proxy for household wealth. Even in communities where electrification rates are low, housing quality remains a strong predictor of wealth, in part because housing accounts for a sizable portion (10-20%) of total household expenditure globally [28] . Furthermore, in many rural and low-income contexts, individuals do not migrate often [29] and tend to frequently upgrade their housing by expanding or building new structures on their property, making housing footprint a meaningful proxy for welfare that responds to improved economic conditions. In this study, we focus on building footprint because it can be precisely measured at scale with modern deep learning techniques.

Many features of buildings other than footprint are observable with satellite imagery; for example roof material [30, 31] . One of the main advantages of the method proposed here compared to alternative "black-box" machine learning approaches to measuring wealth that utilize all available information contained in satellite images (such as convolutional neural networks [6, 11] or random kitchen sinks [32] ) is that it allows the exclusion of subsets of satellite-derived outcomes that may have been directly impacted by the intervention. We show the benefits of this feature of our method in the context of the experiment we evaluate. Specifically, households were eligible for the GiveDirectly study as long as their roofing was of low quality (thatched). Due to this eligibility criterion, treatment households were "prompted" to use the GiveDirectly transfer to upgrade their roofing as a way to signal to the experimenters that they had used the cash for good. An improvement of roofs among participating households beyond what would have been expected solely from wealth increases biases estimates of wealth when methods cannot exclude subsets of outcomes. In contrast, it is straightforward for our method to focus exclusively on subsets of available information that were not affected directly (in this case building footprints) while ignoring problematic outcomes (such as roof material) in order to provide unbiased estimates of wealth effects.

We evaluate a development intervention that was conducted in 2014-2017 in 653 villages in rural Kenya [15] . GiveDirectly, a US charity, implemented a randomized controlled trial of unconditional cash transfers to rural households via mobile money, using as sole eligibility criterion whether the household lived under a thatched roof (a low quality roof material that served as a simple means test). Each treatment household received $1,000-equivalent to about 75% of annual household expenditures-in lump sum, and could spend it however they wished. To evaluate the effectiveness of the program, GiveDirectly randomly selected 328 villages as the treatment group, where eligible households (about 1/3 of the population) received transfers, and used the remaining 325 villages as the control group. The authors conducted extensive household surveys before and after the distribution of the transfers to measure program impacts as is the current practice in the evaluation literature.

Mapping Treatment Intensity and Housing Quality. To evaluate program impacts, we first construct a map that shows the intensity of the anti-poverty program (hereafter "treatment") in different geographical units (in this case it is simplest to work with raster grid cells). This geocoded information is obtained from program implementation records, which document where the program was administered. Because of the extremely high granularity of satellite-derived housing quality metrics, it is feasible to study programs that induce fine spatial variation such as household-level randomized trials. Importantly, the variation in treatment intensity has to be either random (if induced by an experiment) or as good as random (in a natural experiment setting), as is the case for any credible program evaluation project.

For the GiveDirectly experiment, we construct the treatment intensity map from a local census fielded in 2014-2015, which surveyed all the 65,385 households living in the study area [15] . The census data record each household's geo-location, and indicate whether they belong to the treatment (T), control (C), or out-of-sample (O) group (Figure 1a ). Among the three groups, only the treatment households eventually received the cash transfer from GiveDirectly. The control households were randomized into not receiving the transfer, whereas the out-of-sample households were never eligible to participate in the program. Our sample contains 11,055 treatment households and 10,682 control households in total. We lay out a regular grid, and count the number of treatment households in each grid cell ( Figure 1b ). As every transfer was roughly USD 1,000, this variable can be interpreted as the amount of cash infusion (in $1,000) into a given grid cell, and is our preferred measure of treatment intensity ( Figure 1c ). Figure S9 and Supplementary Materials D). After post-processing, each predicted instance of buildings is represented by a polygon and a "representative" roof color ( Figure   1e ). The Mask R-CNN model conducts instance segmentation (as opposed to semantic segmentation), meaning that it is able to identify every building instance separately, even if they are adjacent to each other. As such, we can measure housing outcomes for each household.

We extract two metrics for each built structure: the size of building footprint, and the type of roof material. The roofs are classified into three types: tin roof, thatched roof, and painted roof, based on their color profiles (Supplementary Figure S3 ). Compared to tin roofs, thatched roofs are generally of lower quality [15, 35] . (Painted roofs are relatively uncommon in the study area.) In prior work, roof reflectance and roof color have been shown to be good proxies of housing quality [30, 31] . As such, we aggregate the total building footprint to measure all housing assets (Figure 1f Estimating the Program Effects on Housing Quality. We regress the remotely sensed outcomes on treatment intensity to estimate the causal effects of the GiveDirectly cash transfer. We choose a spatial resolution of 0.001 • (approximately 100m), such that most of the grid cells contain 0-5 households. We exploit only the experimentally-induced random variation in treatment intensity for identification, and account for pre-determined differences in program eligibility. Intuitively, consider two grid cells, one containing a household that received the transfer, and the other containing a household that was eligible to get the transfer but did not because it was randomized into the control group. With valid randomization [15], the differences in outcomes between the two can be attributed to the cash transfer. We plot the causal effects on night light and housing quality as cash infusion intensity increases (Figure 3 , in color), without making assumptions on the structure of the effects. The results suggest that the effects grow linearly with the amount of cash infusion. We therefore also report an "average" effect, estimated with the assumption that each $1,000 transfer generates an effect of the same magnitude ( Figure 3 , panel subtitles). We demonstrate the validity of the empirical strategy further by running 100 placebo simulations-we artificially generate placebo cash transfers that did not actually take place but is consistent with the original randomization design, and estimate their treatment effects ( Figure 3 , in gray). The resulting estimates are reassuringly centered around zero.

We observe statistically significant and economically sizable effects on housing quality, on both the extensive margin (larger building footprint) (Figure 3a) , and the intensive margin (higher quality roofs) (Figure 3b ). On average, a $1,000 cash transfer significantly increased building footprint by 7.9 square meters (95% CI: [2.3, 13. negative. This may be because of low demand for electrification [38] , or the poor sensitivity of night light in low-income, rural regions [6] .

Recovering the Program Effects on Economic Well-being with Engel Curves. We recover the program effects on household economic well-being with a canonical economic concept, the Engel curve. Engel curves describe how household expenditures on particular goods or services depend on households' economic well-being. For example, it is widely known that poorer families spend a larger share of their expenditure on food. Engel curves have long been used to infer economic well-being without needing detailed information on prices as it is straightforward to measure how much of a household's expenditure is spent on food [21-24]. We adapt this concept to housing quality by exploiting the fact that someone who lives in a larger house is likely to be wealthier than someone who lives in a smaller house (Figure 4a ). By the same logic, if we observe that someone's house size increased, then we can infer what level of wealth is associated to such a house size-as if they were moving up on the Engel curve. Mathematically, the slope of the Engel curve represents the ratio between the change in house size and the change in wealth. We divide the change in the house size ( Figure 3 ) by the slope of the Engel curve ( Figure 4a ) to infer the corresponding change in wealth (Figure 4b) . Importantly, the validity of this approach depends on the assumption that the Engel curve does not shift in response to the treatment, which could happen due to relative price changes of the good or taste changes.

In this study, we derive housing Engel curves from an endline survey of the GiveDirectly trial participants between May 2016 and June 2017, which includes 4,578 geo-coded households who were eligible for the transfer. Of these households, only those assigned to the control group are used for the estimation. In Figure 4a , we show the relationship between survey-based measures of economic well-being (x-axis) and remotely sensed night light or housing quality measures (y-axis).

The Engel curves are estimated with a linear regression (dotted lines). The non-linear fit with LOESS (solid lines) shows only small deviations from the linear regression line, and we cannot reject the null hypothesis that these Engel curves are linear (see Methods). The Engel curves are also roughly monotonically increasing, validating the choice of these variables as wealth proxies.

The Engel curves can be derived from any geo-coded consumption and expenditure survey, as long as the surveyed households are-or can be re-weighted to be-representative of the sample in the previous treatment effect estimation step. Notably, the sample does not necessarily have to include any one who has received the treatment, opening up the possibilities of using existing data sources (such as the Living Standards Measurement Study (LSMS)) to estimate Engel curves. We demonstrate this by comparing the Engel curves derived from two distinct samples: the households who were deemed eligible to receive the cash transfers (meaning that they used to live in thatchedroof houses), and households who were not. While all the households live in the same area in western Kenya, the ineligible households are generally wealthier than the eligible ones. Their Engel curves, however, are similar within the same range of wealth (Supplementary Figure S8) .

We scale program effects on each remotely sensed outcome by the Engel curve slope to estimate the impacts of the GiveDirectly transfer on household wealth, measured by aggregating the values of a variety of assets as measured with household surveys. In Figure 4b , we compare the satellite-derived estimates against the survey-based estimates, which are computed from rich endline household survey data and taken from Table 1 Why is the estimate based on the tin-roof area much larger than the survey based estimate?

We argue this is due to the violation of a key assumption, which is that the Engel curve used to estimate changes in wealth cannot change directly in response to the treatment-only through its wealth effects. To give intuition for why this matters, consider a program that directly gives people food. In such a case we can no longer look at food consumption to infer program effects on economic well-being, because the relationship between the food and income will be altered directly by the program and households will "look" wealthier than they really are based on their food consumption. More relevant for impact evaluation using satellite data, this example is analogous to examining the impacts a program that provides roads to a region. One would need to exclude the program roads themselves contained in satellite images and look at other correlates of welfare to estimate impacts of such a roads program in an unbiased manner. In the GiveDirectly case, only households that lived in thatched-roof houses were eligible for the study. Households' usual consumption patterns of high-quality tin roofs might have been affected by this eligibility criteria.

One can observe that treatment households owned more tin-roof buildings compared to control households with the same amount of wealth (Supplementary Figure S7 ). This may have been a result of households interpreting the treatment as a "labelled" cash transfer [39].

These results highlight the importance of using interpretable proxies when evaluating programs with machine learning predictions. An emerging literature is making great progress in mapping poverty with satellite imagery and machine learning with a high spatial granularity at scale [6] [7] [8] [9] [10] [11] [12] [13] .

Typically, a machine learning model first learns the mapping between the input satellite images and the ground truth labels of wealth or consumption expenditure, assembled from geo-coded household surveys. Then, the model generates predicted poverty maps for every region in the sample, including those with no survey coverage. The model implicitly combines and executes two tasks: (1) extracting semantically meaningful observations of, say, housing quality, agricultural productivity, or infrastructure, from raw satellite images; and (2) inferring economic well-being from observing the consumption patterns of these private or public goods (similar to the Engel curve analysis in this study). While the flexibility of the machine learning models helps improve predictive performance, the difficulty in interpretation makes it almost impossible to know or constrain what private or public goods are identified and utilized by the model. Since black-box machine learning models utilize as much information as possible from the input satellite images, it is very likely that the Engel curves of at least some of the observed goods will change (similarly to the tin-roof area variable in this study), introducing biases in the estimated program effects. In this study, we disentangle the two tasks, so that the first task can be framed as a traditional object detection and segmentation task, allowing us to leverage extensive research in computer science; and the second task becomes more transparent, explicit, and the assumptions testable (for example, with Supplementary Figure S7 ).

This paper provides compelling evidence that RCT program evaluations aimed at improving household welfare can be obtained solely based on satellite imagery and deep learning methods. This approach has the advantage of being inexpensive and timely, suggesting great promise as a complement and in some cases as a substitute to in-person survey data collection methods.

However, it bears noting that a fundamental limitation to evaluating programs based on satellite imagery is that in order to be measurable from space, programs being evaluated have to generate impacts on the built landscape. This prevents applicability to programs targeted at addressing development challenges that are unlikely to impact the built environment such as improved teaching methods at schools. Another limitation is that welfare is a household or individual concept whereas satellite images capture characteristics about a place. Mapping household welfare to housing as we do here requires a tight mapping between structures and households through limited mobility.

While migration rates are very low in the GiveDirectly study area [15] , this may be a challenge for programs that impact mobility, such as transportation infrastructure programs. Mapping treatment intensity and remotely sensed outcomes in the GiveDirectly study area in 2019. a Treatment intensity represents the number of households who received a $1,000 cash transfer from GiveDirectly. b Building footprint measures the total area covered by any building, shown as a percentage of the total area. c Tin-roof area measures the total footprint of buildings with roofs made of tin (a high quality construction material), shown as a percentage of the total area. d Night light is the average radiance in the Visible Infrared Imaging Radiometer Suite (VIIRS) Day/Night Band (DNB). In all the panels, the gray lines outline the GiveDirectly study area in Siaya, Kenya. Grid cells without trial participants are omitted and shown in white. n = 2, 501. There is no free house: Ethnic patronage in a Kenyan slum. 

Constructing the Treatment Intensity Map. To construct the treatment intensity map, we utilize data from a baseline census, which was conducted by the authors of the original paper in 2014-2015. The census identified all 65,385 households (roughly 280,000 people) residing in 653 villages in the study area, recorded their GPS coordinates, whether each household was eligible for the GiveDirectly cash transfer, and whether they had been randomized into the treatment or control group [1] . To address the measurement errors of the GPS collection devices, we discard 58 outliers (living more than 2 kilometers away from the village centers) and impute those and other 4 missing GPS coordinates with village center coordinates. Then, we convert these household records into a raster map. We lay out a regular grid, and count, in each grid cell, the number of households that ultimately received the GiveDirectly cash transfer (see Figure 1 and Figure 2a ).

Grid cells containing no eligible households are excluded. To account for pre-determined policy intensity differences, we record (and later control for) the number of households that were eligible for the cash transfer, regardless of whether they had been randomized into the treatment or control group.

Obtaining High-resolution Daytime Satellite Images. We utilize high-resolution daytime satellite images from Google Static Maps [2] . These images have a spatial resolution of about 30cm per pixel (at equator), and contain only the RGB (red, green, blue) bands (see Figure 1d and Supplementary Figure S2 for examples). These images come from a variety of commercial providers such as Maxar (formerly DigitalGlobe) and Airbus, and have been seamlessly mosaicked together.

They have also been geo-referenced and pre-processed to remove clouds and address other data quality issues. Google does not provide the exact timestamps for these images, but we estimate Loosely speaking, the Mask R-CNN model operates as follows. First, the model proposes a large number of "regions of interest", each of which potentially contains a building. Then, the model uses convolutional filters to identify patterns within the proposed region that are indicative of the presence of buildings, such as the sharp edges, the highly reflective roofs, and the building shadows. Finally, the model predicts whether each proposed region contains a building, as well as whether each pixel is occupied by the building.

We train the Mask R-CNN model with a multi-step process and a transfer learning framework, as described in greater detail in Supplementary Materials C. Publicly available building footprint datasets in rural and low-income regions are rare, and they often differ substantially in spatial resolution, sensor instrument, and landscape from inference images (that is, the target images that the model will make predictions for). Relying solely on publicly available training data is therefore insufficient for achieving satisfactory predictive performance. We curate a set of in-sample annotations by randomly sampling 120 images from all the Google Static Map images in the study area, and manually creating high-quality building footprint annotations for them. We pre-train the Mask R-CNN model on large, publicly available datasets such as COCO (Common Objects in Context) and Open AI Tanzania, and fine-tune them on this set of in-sample annotations.

The model predictions are highly accurate. The overall F1 score (a standard performance metric for instance segmentation) on a random subset of inference images is 0.79 (Supplementary Figure   S1 ). The F1 score is the harmonic mean of precision (the proportion of model-identified buildings that are actual buildings) and recall (the proportion of actual buildings that are correctly identified by the model). Here, a building is deemed to be correctly identified if the predicted pixel mask and the ground truth pixel mask have sufficient overlap (more precisely, if the intersection of the two masks is more than 50% of the union of the two masks). As a reference point, the top winner in the 2nd SpaceNet building footprint extraction competition reported an F1 score of 0.69 [4] . This demonstrates that the Mask R-CNN model used in this study performs well, although building footprint segmentation in rural, less complex scenes is generally easier than in modern cities so these metrics are not directly comparable.

We post-process the model-predicted pixel masks by converting them to polygons, and simplifying the polygons with the Douglas-Peucker algorithm with a pixel tolerance of 3. For each polygon, we compute two housing quality metrics: building footprint and type of roof materials. We then lay out a regular grid, assign each building to grid cells based on the centroids of the polygons, and aggregate to obtain two metrics at the pixel level: building footprint ( Figure 2b ) and tin-roof area ( Figure 2c ).

First, we measure the size of each building polygon and convert it to square meters. We correct for area distortion, which is induced by the Web Mercator projection system that the Google Static

Map uses. This metric may appear larger than what one expects for the size of homes in a lowincome context (Figure 4) , because (1) it represents the footprint of the entire building, which is typically larger than the size of the livable area; and (2) it accounts for both residential and non-residential structures, since the model is not able to distinguish between the two.

Second, we estimate the types of roof materials based on the colors of the roofs, and compute the footprint of tin-roof buildings in each grid cell. For each building, we take all the pixels associated with the given building instance, and assign a "representative" roof color by computing the average values in the RGB (Red, Green, Blue) channels. Since the Euclidean distances between color vectors in the RGB color space does not reflect perceptual differences, we project all the RGB color vectors to the CIELAB color space, and cluster these roof color vectors into 8 groups by running the K-means clustering algorithm. We further classify these 8 groups into three types of roof materials: tin roof, thatched roof, and painted roof (Supplementary Figure S3) , and compute the total footprint of tin-roof buildings.

Obtaining the Night Light Data. To measure nighttime luminosity, we use the Visible Infrared

Earth Engine [5, 6] . The VIIRS-DNB data product excludes areas impacted by cloud cover and correct for stray light [7] . However, it has not been filtered to screen out lights from aurora, fires, boats, and other temporal lights, and lights are not separated from background (non-light)

values [5] . This data product has a native spatial resolution of 15 arc seconds (approximately 463 meters at the equator), and we resample the data by conducting nearest neighbor interpolation when necessary. We average over all the monthly observations in 2019 and construct a single cross sectional observation, to reduce seasonality effects and for consistency with the daytime satellite imagery (Figure 2d ). The VIIRS-DNB data product is considered superior to the more widely used night light data, DMSP-OLS (the United States Air Force Defense Meteorological Satellite Program, Operational Linescan System) because it preserves finer spatial details, has a lower detection limit and displays no saturation on bright lights [8] . This ensures that we conduct a fair comparison with the most modern and high-quality night light data product.

Estimating the Program Effects on Housing Quality. The main econometric specification for Figure 3 is as follows footprint, and tin-roof area. To account for pre-existing differences in population density or wealth, which may cause non-random variation in treatment intensity, we flexibly control for the number of eligible households per grid cell, and exclude grid cells with no eligible households. Because the grid cells are fairly small and the number of observations for k > 2 is small, we bin the number of recipient households into four bins k ∈ K = {0, 1, 2, 2+}, to preserve statistical power. Standard errors are calculatedà la Conley, with a uniform kernel and a 3km cutoff [9] [10] [11] [12] . To reduce the effects of outliers (due to sensor malfunctioning or machine learning model prediction errors), we winsorize all remotely sensed variables at the 99 percentile.

We run 100 placebo simulations to further demonstrate the validity of the main specification.

In each simulation, we randomly assign half of the 68 groups of villages to the high-saturation group, and the other half to the low-saturation group. In the high-saturation groups, we randomly assign 2/3 of the villages to the treatment group (and the rest to the control group); whereas in the low-saturation group, we assign only 1/3 of the villages to the treatment group (and the rest to the control group). This mimics the two-tier randomization scheme of the original trial [1] . Using these simulated placebo treatment status variables, we estimate the placebo treatment effects with the econometric specification described in Equation 1.

To compute a single pooled treatment effect, we make an assumption of linear treatment effects-every transfer of $1,000 has an effect of the same magnitude, regardless of the treatment intensity in that geographical area. The resulting econometric specification is as follows

where τ is the "average" treatment effect, and all else remain the same as in Equation 1. We conduct two-sided t-tests to assess statistical significance.

Estimating the Engel Curves. An Engel curve describes how household expenditure on a particular good varies with income-a relationship that can be used to infer households' economic well-being from the consumption patterns of a limited subset of goods [13] [14] [15] [16] . The mathematical formulation is

where household h with W h wealth (or other measures of economic well-being) would consume Q hp quantities of a normal good p, and F p (·) represents the Engel curve for product p in the population.

With a linearity assumption, this can be simplified to be

where α p is the intercept and β p is the slope of a linear Engel curve.

In this study, we estimate the Engel curves-the relationships between remotely sensed metrics values because they are difficult to value given thin local markets [1] .

We perform heuristic matching between the buildings and the household survey GPS coordinates, to link variables in the survey with remotely sensed variables. First, we take the baseline census data, which geo-coded every single household who lived in the study area, and assign every building in the satellite images to its closest census GPS coordinate, if the distance between the two was within 250m. This ensures that every building is matched to at most one household.

Second, we match GPS coordinates from the survey with GPS coordinates from the census. While the same household supposedly had the same geo-location, these two often differed because of the measurement errors of the GPS collection devices, and because the coordinates might be recorded anywhere on the participants' plots and not necessarily in their primary residence. We similarly assign each survey GPS coordinate to its closest census GPS coordinate, if the distance between the two was within 250m. In cases of multiple surveys being assigned to the same census coordinate, Figure 4a ) and linearly (see Equation 4 and the dotted lines in Figure 4a ). When fitting LOESS, we allow for locally-fitted quadratic polynomials, and use 75% of the data points for each fit. We test for the non-linearity of the Engel curves in a separate procedure. We first run a linear regression, take the residuals, and fit the residuals with a natural (cubic) spline with 5 knots. We then conduct a two-sided F-test on the coefficients of the natural spline basis, and reject the null hypothesis (linearity) if these coefficients are jointly significant. We cannot reject linearity for any of the three proxies in Figure 4 influence of outliers, we winsorize annual expenditure, housing assets, non-housing assets and total assets at the 1 and 99 percentile of the eligible and non-eligible sample, respectively. We winsorize at the 1 percentile as outliers with a large amount of debt exist and could potentially drive the results otherwise. We similarly winsorize all the remotely sensed variables at the 99 percentile for the eligible and non-eligible sample. We exclude a small number of renters who do not own any housing assets (31 treatment households, 32 control households, and 55 ineligible households), to simplify the interpretation of the Engel curves.

Recovering the Program Effects on Economic Well-being. We adapt a prior mathematical formulation that uses the Engel curve to infer changes in economic well-being [15] . Suppose that one is interested in studying the effect of a plausibly exogenous treatment Z on, say, wealth W (denoted τ W ), but can only inexpensively observe its effect on the consumption of product p (denotedτ Qp ).

Recall thatβ p is the estimated slope of the linear Engel curve in Equation 4, then

Using a formula for propagation of error (or the multivariate Delta method), one can derive the standard error forτ W as follows. This derivation is based on prior work [15], but additionally accounts for the precision of the slope of the Engel curve.

A key assumption of this approach is thatβ p does not depend on Z-that is, the Engel curve does not change in direct response to the treatment-also termed the conditional independence assumption [14] .

We estimate the treatment effects on wealth (or other measures of economic well-being) according to Equation 5 and Equation 6, with the treatment effect estimates for remotely sensed variables, and the slopes of the Engel curves. We compare the satellite-derived estimates against the survey-based estimates, taken from Table 1 , Column 1 in the original paper [1] , which were based on the endline household survey data (Figure 4b ).

This paper makes use of restricted access data, which contain personally identifying information of survey participants. Satellite images used in the analyses come from the Google Static Maps API at https://developers.google.com/maps/documentation/maps-static/overview, and redistribution is not possible. However, de-identified data necessary to reproduce all the figures and statistical analyses are freely available at https://github.com/luna983/beyond-nightlight. All the codes are available at https://github.com/luna983/beyond-nightlight. Figure S1 : The precision-recall curve of the Mask R-CNN model shows satisfactory predictive performance. The Mask R-CNN model is trained and evaluated with 3-fold cross validation. The evaluation is based on 120 annotated images, which were randomly sampled from all the input satellite images in Siaya, Kenya. The Mask R-CNN model outputs a confidence score for every predicted building instance, and the precision-recall curve is generated by varying the confidence score threshold, below which predicted instances are dropped. A higher threshold makes the model more conservative and corresponds to the left portion of the curve (with high precision and low recall), and vice versa. The dot represents the optimal confidence score threshold, obtained by maximizing F1, the harmonic mean of precision and recall. The main model used in this study employs the optimal threshold, and has a recall of 0.79 and a precision of 0.80. Figure S2 : Ten randomly sampled pairs of input images and deep learning predictions. Ten images are randomly sampled from all the input satellite images in the GiveDirectly study area. Each predicted building is outlined in white and filled with the "representative" roof color.

Thatched roof Painted roof Figure S3 : The distribution and grouping of roof colors. All the buildings in the GiveDirectly study area are split into eight groups by a K-means clustering algorithm, based on their roof colors. The color block on the left represents the "average" roof color of the cluster, and the color blocks on the right represent a random subset of all the roof colors in the given cluster. The number of color blocks on the right is proportional to the size of the cluster. The eight groups are further grouped into tin roof, thatched roof, and painted roof. 

We estimate that our evaluation approach costs $0. to be harder to identify for human annotators than metal-roof houses, because they are typically smaller, not as reflective, and may resemble trees in the overhead imagery.

We use the Mask R-CNN model [2] for instance segmentation of buildings on satellite images. The backbone architecture used is ResNet50 with the Feature Pyramid Networks. The model is trained with a learning rate of 5 × 10 −4 and a batch size of 10. Optimization is conducted with the Adam optimizer. We implement the deep learning pipeline with Python and PyTorch. In particular, we use the official Torchvision implementation of Mask R-CNN. We train the Mask R-CNN model in a transfer learning framework, with a multi-step process as follows.

The model is first pre-trained with the COCO (Common Objects in Context) data set, a large-scale natural image data set containing 80 object categories and around 1.5 million object instances [3] . Despite the fact that input images and object categories in COCO are different from target satellite images, pre-training the model with a large-scale dataset often provides meaningful performance gains, even when the model is later transferred across domains.

The model is then fine-tuned on the Open AI Tanzania building footprint segmentation data set, a collection of high-resolution aerial imagery collected by consumer drones in Zanzibar, Tanzania [4] . These images are representative of the rural or peri-urban scenes in a developing country context, in terms of the distribution of the density, sizes and heights of the buildings. All the buildings in the drone images are identified, outlined and classified into three categories (completed building, unfinished building, and foundation) by human annotators. This somewhat unusual categorization is due to the fact that there are a large number of unfinished structures in Zanzibar. Most input satellite images in this study contain very few unfinished structures, so we collapse the first two categories into one and drop the third category. The native resolution of the drone images is 7cm, and we down-sample the images to about 30cm to match with the resolution of the target satellite images.

In training time, 90% of the data are used for training, and the remaining 10% for validation.

In order to guard against overfitting, and choose the best model, in 4. In-sample Annotations Finally, the model is fine-tuned on a set of 120 in-sample annotated images in Siaya, Kenya (see Section C.1 for details). This ensures that training images and inference images belong to the same data distribution. The model is trained on 90% of the images for 25 epochs, and evaluated with the 10% held out set. We keep the best-performing model (at epoch 15).

This is the main model used for conducting inference on input satellite images in the GiveDirectly study area.

Throughout the training process, we conduct extensive data augmentation to increase the transferability of the model from one dataset to another. We randomly flip the training images horizontally and vertically, randomly jitter the brightness, contrast, saturation, and hue of the images. For the Open AI Tanzania dataset, we also randomly blur and crop the images.

We provide additional validation results in rural Mexico, using the 2010 Population and Housing

Census [5] . Population count in a rural village (as reported in the 2010 census), is highly correlated with the number of houses in that village (as identified by the deep learning model), with a Pearson correlation coefficient of 0.82 (Supplementary Figure S9b) . Population count, however, is only modestly correlated with night light (Supplementary Figure S9a) . Night light is less sensitive in smaller, less populated villages, a finding that is consistent with prior work [6] .

This comparison is based on the locality-level data set, Principales Resultados por Localidad, or 

The analysis of household surveys: a microeconometric approach to development policy

Poor economics: A radical rethinking of the way to fight global poverty

Development Impact Evaluations: State of Play and New Challenges tech. rep. (Agence Française de Développement

Better to be indirect? Testing the accuracy and cost-savings of indirect surveys

Social Protection Amidst Social Upheaval: Examining the Impact of a Multi-Faceted Program for Ultra-Poor Households in Yemen tech. rep

Combining satellite imagery and machine learning to predict poverty

Fighting poverty with data

Poverty from space: using high-resolution satellite imagery for estimating economic well-being tech. rep

Poverty Mapping Using Convolutional Neural Networks Trained on High and Medium Resolution Satellite Images, With an Application in Mexico in Proceedings of NIPS 2017 Workshop on Machine Learning for the Developing World

Socioecologically informed use of remote sensing data to predict rural household poverty

Using publicly available satellite imagery and deep learning to understand economic well-being in Africa

Targeting Development Aid with Machine Learning and Mobile Phone Data: Evidence from an Anti-Poverty Intervention in Afghanistan Unpublished

Machine learning can help get COVID-19 aid to those who need it most

References 1. Google Static Maps. Maps Static API Usage and Billing

Proceedings of the IEEE international conference on computer vision

Population and Housing Census of Mexico

Combining satellite imagery and machine learning to predict poverty

The GiveDirectly randomized controlled trial and field survey received IRB approval from Maseno University and the University of California, Berkeley. The AEA Trial Registry RCT ID is AEARCTR-0000505. Informed consent was obtained from all human research participants.The authors declare no conflicts of interest.

Supplementary Information is available for this paper. Correspondence and requests for materials should be addressed to Luna Yue Huang (yue huang@berkeley.edu). Reprints and permissions information are available at www.nature.com/reprints.