key: cord-0439912-snkdgpym
authors: Ackermann, Klaus; Chernikov, Alexey; Anantharama, Nandini; Zaman, Miethy; Raschky, Paul A
title: Object Recognition for Economic Development from Daytime Satellite Imagery
date: 2020-09-11
journal: nan
DOI: nan
sha: 142034ba5f25025256a429f86aa043c4d91f65b2
doc_id: 439912
cord_uid: snkdgpym

Reliable data about the stock of physical capital and infrastructure in developing countries is typically very scarce. This is particular a problem for data at the subnational level where existing data is often outdated, not consistently measured or coverage is incomplete. Traditional data collection methods are time and labor-intensive costly, which often prohibits developing countries from collecting this type of data. This paper proposes a novel method to extract infrastructure features from high-resolution satellite images. We collected high-resolution satellite images for 5 million 1km $times$ 1km grid cells covering 21 African countries. We contribute to the growing body of literature in this area by training our machine learning algorithm on ground-truth data. We show that our approach strongly improves the predictive accuracy. Our methodology can build the foundation to then predict subnational indicators of economic development for areas where this data is either missing or unreliable.

The efficient allocation of limited governmental funds from local governments as well as international aid organizations crucially depends on reliable information about the level of socioeconomic indicators. These indicators (e.g. income, education, physical infrastructures, social class etc.) are critical inputs for addressing the socioeconomic issues for researchers and policy-makers alike. Although data availability and quality for the developing countries has been improving in recent years, consistently measured and reliable data is still relatively scarce. Numerous studies have documented specifically the problems of aggregate economic accounts, in particular to Africa, where the data suffers from various conceptual problems, measurement biases, and other errors (e.g. Chen and Nordhaus 2011; Johnson et al. 2013; Jerven and Johnston 2015) .

Researchers have probed into alternative options in the absence of reliable official statistics. Among this newer generation of alternative economic data research, a burgeoning literature has emerged that uses satellite imagery of nighttime luminosity as a proxy for economic activity. Work by Sutton and Costanza (2002) , Elvidge et al. (2009) , Chen and Nordhaus (2011) , Henderson, Storeygard, and Weil (2012) , Sutton, Elvidge, and Tilottama (2007) and *Contributed equally. Work in progress. Hodler and Raschky (2014) documents a strong relationship between nighttime luminosity and gross domestic product (GDP) at the national and subnational levels. This allows researchers to generate information for any levels of regional analysis and also the likelihood of strategic, human manipulation is limited with satellite generated data. However, luminosity data as a proxy for economic activity is not free from concerns. Satellite sensors have a lower detection bound and nighttime light emissions below this bound are not captured by the satellites' readings. This leads to bottom-coding problem and this is particularly an issue in low-output and low-density regions (Chen and Nordhaus 2011) , which are very often regions and countries (e.g. Africa) where official macroeconomic data is missing or unreliable as well.

Over the past few decades, some parts of the African continent have witnessed large increases in economic development. Nevertheless, the majority of regions within African nations still lacks behind. The continent faces further challenges due to localized conflicts (Berman et al. 2017) , rapid urbanization (Moyo et al. 2020 ) as well as the impacts of the COVID-19 pandemic ), among others. A key pre-requisite in formulating adequate strategies to address these challenges, is reliable socioeconomic data at a spatially, granular level. As of now, even data about basic infrastructure such as roads and buildings is not consistently collected across the African continent.

The purpose of this project is to overcome this data problem, by applying machine learning and artificial intelligence tool to a vast amount of unstructured data from daytime satellite imagery. Ultimately, this project aims to go beyond the use of nightlight luminosity as a proxy for economic development data and use high resolution, daytime satellite imagery to predict key infrastructure variables at national and subnational levels for less developed countries like in Africa. Daytime images contain more information about the landscape that is correlated with economic activity, but the images are highly complex and unstructured, making the extraction of meaningful information from them rather difficult. Our approach builds upon and further expands the work of (Jean et al. 2016a ). The standard approach in the literature is to learn a representation out of satellite images, that allow an interpretation of pixel activation that are important for predicting night time light or other target. This represen-tation is then used to predict an aggregated wealth index. Instead, we directly predict infrastructure measures on the ground, albeit knowing that there is a wide spread scarcity of ground truth data.

Existing solutions for policy makers in developing countries often rely from traditional data gathering processes (i.e. surveys), which are costly and infrequent. Given the high costs, this data does not cover an entire country but only a subsample of geographic units. Our solution provides a lowcost method to collect valuable insights about economic development for every location in a country. Our methodology provides relevant decision makers in developing countries as well as NGOs and international organizations with very accurate counts of buildings and the length of roads for an entire country and continent. For example, accurate building counts and density can be used in natural hazard preparedness tools as an indicator for an areas vulnerability against natural disasters. Information about roads and settlement helps infrastructure agencies to quickly identify areas that lack market access, a key determinant for economic growth in developing countries.

Although relatively new, recent studies have begun to use different daytime satellite images to conduct novel economic research (Donaldson and Storeygard 2016) . Daytime images contain more information than night-time images and are thus a good alternative data source for empirical economics. Marx, Stoker, and Suri (2013) used daytime images to analyse the effects of investment on housing quality in the slums of Kibera, Kenya. Investment was calculated based on the age of a households roof. The results showed that ethnicity plays an important role in determining investment in housing and belonging to the same tribe as that of the local chief has a positive effect on household investment. (Engstrom et al. 2017 ) used daytime satellite imagery and survey data to estimate the poverty rates of 3,500 km2 subnational areas in Sri Lanka. Using a convolutional neural networks algorithm, they identify object features from raw images that were predictive of poverty estimates. The features examined by the study included built-up areas (buildings), cars, roof types, roads, railroads and different types of agriculture. The results showed that built-up areas, roads and roofing materials had strong effects on poverty rates. A suite of related work has used satellite images to predict population density (e.g. Simonyan and Zisserman 2015; Doupe et al. 2016 ), urban sprawl (Burchfield et al. 2006) , urban markets (Baragwanath et al. 2019) electricity usage (Robinson, Hohmans, and Dilkina 2017) , as well as income levels (Pandey, Agarwal, and Krishna 2018) . More broadl, we also relate growing body of literature that uses other passively collected data to measure local economic activity (e.g. Abelson, Varshney, and Sun 2014; Blumenstock, Cadamuro, and On 2015; Chen and Nordhaus 2011; Henderson, Storeygard, and Weil 2012; Hodler and Raschky 2014) , Methodologically, our paper contributes to the large remote-sensing literature that applies high-dimensional techniques to extract features from satellite imagery (e.g. Jean et al. 2016b Jean et al. , 2019 Yeh et al. 2020; Ronneberger, Fischer, and Brox 2015) . 

In general, reliable data at a more granular spatial level is very scarce for the African continent. This poses a particular challenge if the researcher wants to apply machine learning tools that require some form of ground truth data.

To overcome this problem, we accessed data from two open-data sources. The first one is Open Street Map, a collaborative project allowing volunteers around the world to contribute georeferenced information in an open-source GIS. We utilized http://download.geofabrik.de/ to retrieve a complete snapshot of all geo-located objects Africa in 2018. In general, OSM coverage for Africa is very sparse and often non-existent outside urban areas. Our strategy to mitigate this issue, was to build an iterative procedure that would help us select areas (1 × km) with good OSM coverage. We were then able to convert the geometric OSM data into an image mask.

Our image data was collected in 2018 via the google maps api following the exact procedure as in Jean et al. (2016b) . This data set has be used in various studies (e.g. Jean et al. 2019; Sheehan et al. 2019; Uzkent et al. 2019; Oshri et al. 2018) . Again the same pattern as with OSM data emerges, the image quality of these freely available African images is not as good as in other places around the world, see figure 1.

In the absence of reliable ground truth data, we selected the architecture based on data that we could make look like as if it would be from our target domain. For buildings, we employ imagery collected by drones in Africa from the "Open Cities AI Challenge: Segmenting Buildings for Disaster Resilience" 1 , with the corresponding ground truth data provided and re-scaled and blurred the drone imagery. For roads, we build a model to select images with almost complete masks, albeit having missing roads and errors.

We benchmark our proposed methodology against the latest publication of poverty predictions in Africa using their provided wealth index based on DHS cluster data (Yeh et al. 2020 ). As it is common in this literature an index is created with a principal component analysis (PCA) out of survey respondents. Again, due to data limitations, research in this area always only performed a in-sample validation. A true out of sample comparison would require a strict separation between the train and test set, something that is not possible if the PCA is calculated over all data points across all countries and therefore inflating the prediction results. As such, (Yeh et al. 2020 ) also provided an index that is based on within country survey respondents. This enables us to benchmark against both indices.

In principle, we follow the outline of the well known U-Net architecture for medial images (Ronneberger, Fischer, and Brox 2015) and modify it for satellite images creating a Satellite-U-Net (Sat-Unet). figure 2 provides a general overview of our approach. The network contains 61 layers in total, with 11 major blocks of 3 types: convolution / down-sampling block, intermediate convolution block and the de-convolution/up-sampling block. The convolution block, shown in figure 3, consists of a batch normalization layer, two convolution layers with the kernel of (3,3) and a dropout layer. The dropout layer is not used in the first down-sampling block. The number of filters in downsampling blocks (encoder part) starts from 32 and doubles every time in the following block reaching 1024 in the intermediate convolution block, and then decreases in the upsampling blocks (decoder) with the coefficient 0.5. The core difference from (Ronneberger, Fischer, and Brox 2015) is that instead of up-sampling layers, we are using transposed convolution layers, which performs the reverse convolution operation (Dumoulin and Visin 2016) . In addition, we added drop-out layers after each convolution and de-convolution block.

In a 400x400 image the number of pixel belong to a house or a road, class one, vs the number of pixels of zero-class (non classified space) is up to 10 4 times higher, creating a severe class imbalance. We address this issue with a hybrid loss function. First, we use the loss of the sum of binary cross entropy

(1) and the Sorensen-Dice coefficient:

combined. As metric we used the intersection over a union (the Jaccard index)

Data pre-processing

To make image input size is 400x400, compatible with a factor of 32 to conform the shape reduction coefficients of the network, we added a padding of 8. On average across our images, the RGB colours maximum was around 180-190 out of the maximum of 255. Color channels re-scaling has been implemented to intensify colors before feeding the image into the network. For augmentation we used rotation by 90, 180 and 270 degrees.

The main difficulty in choosing the exact architecture for the road network was the lack of a sufficiently large amount of, error-free, ground truth data. Therefore, we used the following iterative strategy: 1. Create an initial mask with OSM data and train on them. 2. Filter out masks, where the model predicts significantly more objects than the OSM mask has. 3. Retrain the model on the filtered data-set. Due to the large possible set of images to train from, around 22 million, we first selected a subset based on OSM data. As our main focus is to get an indicator of the economic development, the best case would be to find areas of economic activity. OSM has a classification for commercial buildings, which is rarely used (2143/22 mil). We selected areas in the same ADM2 regions of those images based on descending order of square meters occupied by buildings on a uniform grid, until we had selected a base set of 10000 masks. Next, we trained our Sat-Unet model for roads on all masks as labels we had created from OSM.

The Judge: For filtering purposes the Sat-U-Net based model Judge has been created with an additional input for the OSM mask. Using transfer learning the weights of the pretrained Sat-UNet model have been transferred to the bottom layers of the Judge for the mask creation from the original image, and top layers perform the calculations of the Index of validity, using the calculated mask and the OSM mask as inputs. Combining everything in a single GPU model allows to achieve more than 50x increase in performance comparing to CPU-based technique. 4. The index of validity is

where i,j-pixel values of predicted mask and the OSM respectively. The resulting filtering model decreased the dataset approximately by 40%, filtering out instances like those presenting in figure 5 . Furthermore, as our network predictions had very low values caused by model uncertainty, we re-trained several Sat-Unet model on the selected images with different random seeds. This ensemble learning significantly increased our predictive performance, as shown in figure 7 . The top three are predictions on the test set, while the last image is the combination of all three with the pixels reduced to the skeleton for counting.

For building recognition, we used Open Cities AI Challenge data set as ground truth data set. This data set contains imagery of several African cities in a ultra high resolution of up to 5cm per pixel. Each city is split by square areas and for each image there is a GeoJSON file with vector data describing contours of the buildings. Out of these geometric data, we created a contour layer and a centroid layer, which represents the center of every building structure. We down scaled the images to the scale of 1.53 pixel/meter to match Google This high-quality ground-truth data further allowed us to experiment with different architectures. We replaced the encoder part in the Sat-Unet model 2 with the inception v3 (Xia, Xu, and Nan 2017) and the resnet50 (He et al. 2016) . In every instance we reset all the weights to random before training but we did not make use of any transfer learning. Table 1 presents the evaluation results on our test set. We also compared how well the network performs in counting the correct house based on the jaccard index, visually shown in figure 6 .

The performance of the Incep3-Unet and Resnet50-Unet was quite similar. For the final model selection we trained both architectures on the previous selected masks out of OSM, and compared their performance in terms of their predictability on a test set. Table 2 demonstrates the effect of different thresholds on the performance.The threshold is in color intensity units (range 0-255), TP is true positive, when the predicted centroid is located inside the building contour of the mask, Pred-To-Mask coefficient is the ratio between predicted number of houses and the ground truth number, and False Positives (FP):

F P ← 100 · T otalP red − T P T otalP red (5) Figure 5 : The filter removes incomplete OSM masks. From left to right: original image, OSM mask, prediction of the trained model on the stage 1. Figure 6 : Africa, building model stage, contour-in-contour evaluation. Buildings that were predicted correctly are in blue, not predicted in orange. Orange areas of random shapes inside building blocks are usually courtyard areas and not considered as a wrong prediction.

A threshold of 15 has the closest Prediction-to-mask score, acceptable TP and FP rates. Therefore, we picked this threshold for further modeling. The final model Resnet50-Unet has 268 layers. To avoid the vanishing gradient problem with the depth of hundreds of layers ResNet uses skip connections, it adds input information of the convolution block to its output. In addition skip connections give the model the ability to learn the identity function which guarantee the similar performance of the lower and higher layers (He et al. 2016) .

We compare our prediction results of buildings and roads to the latest benchmark study in the field of poverty prediction based on DHS data (Yeh et al. 2020) . The DHS data is collected in various waves across countries and years. For the comparison use of the most recent wave available for a country and use the aggregated wealth index data. Two indices are provided, the first wealthpooled is the PCA calculation across all years, while the second index wealth, is calculated (Yeh et al. 2020) did not provide any out of sample estimates, therefore we fully replicated their method by getting all their images they used for their location and trained their combined CNN model of multi-spectrum and night lights images to determine economic well-being in Africa, with wealth as label for the last of their training folds (D). The performance results are almost identical in terms of Rsquared. In their study they also found a high correlation of the measures to other type's of aggregation, such the sum total of all assets. Using their wealthpool predictions as predictor for wealth, the r-squared is around 0.61 vs 0.67 for wealthpool.

The DHS location data from (Yeh et al. 2020) has 7,315 unique cluster location in the last wave of each respective country. We use a 5km radius corresponding to the possible displacement of survey measurements, as selection criteria to select our images. For every square km we predict the number of buildings, the number of roads as well as calculate the night time light (Elvidge et al. 2017 ) by grid cell. We then aggregate the roughly 1.5 million images into features, by building the sum, averag and quantiles by cluster across all 3 input variables. In total, this leaves us with 6,112 locations. We perform LOOCV cross validation by country, as Figure 7 : Ensembling: The first three are predicted masks based on different random seeds, the last one is the resulting mask.

we are interested in predicting the marginal unit if we would use the model to get data for one extra country. We also iterate over standard machine learning algorithms without any hyper parameter tuning, by only using the default settings. Table 3 and table 4 present the results for out of sample and out of country predictions, respectively. As expected, using a normalized outcome measures across all samples, inflates the performance. In comparison to the previous literature, our predictions show an increased predictive performance, both in and out of sample.

This paper introduces a novel and scalable method to predict road and housing infrastructure from daytime satellite imagery. Compared to existing approaches, we achieve higher predictive performance by training a U-net style architecture using ground-truth data from a subset of images. Using satellite images from 21 African countries we show how our method can be used to generate very granular information about the stock of housing and road infrastructure for regions in the world, where reliable information about the local level of economic development is hardly available. Consistently measured and comparable indicators about local economic development are crucial inputs for governments in developing countries as well as international organizations in their decision where to allocate scarce public funds and development aid.

The predictions generated by our method can be directly included in existing decision support systems. For example, international organization such as the Red Cross are using similar data at the local level to evaluate an area's vulnerability against natural hazards. Our data can be considered as more granular complements to existing measures of the local stock of physical infrastructure. Numerous charitable organizations already rely on satellite imagery to identify districts of African countries that are among the least developed (e.g. Abelson, Varshney, and Sun 2014) . Our approach provides a low-cost and scalable alternative to identify areas that are in need. In addition, the Open Street Map mapping community would benefit from our findings as well. The road prediction model could be used worldwide to help completing the road network or help narrowing down possible errors in the data.

Finally, our approach is an important methodological contribution to the large group of scholars from varying disciplines working in the area of poverty measurement. The majority of the existing research focuses on predicting poverty based on aggregate household wealth. This paper shows that predicting poverty measures can also be viewed as a simple high dimensional feature representation problem. Our study is a proof-of-concept exercise to show that combining daytime satellite imagery, open source ground truth data and machine learning tools can translate unstructured image data into valuable insights about local economic development at an unprecedented scale. Table 4 : Predictive performance of satellite predictions, r-squared based on LOOCV on out of country predictions by country using wealthpooled

Targeting Direct Cash Transfers to the Extremely Poor

Detecting urban markets with satellite imagery: An application to India

This Mine Is Mine! How Minerals Fuel Conflicts in Africa

Predicting poverty and wealth from mobile phone metadata

Causes of Sprawl: A Portrait from Space. The Quarterly

Using luminosity data as a proxy for economic statistics

The View from Above: Applications of Satellite Data in Economics

Equitable development through deep learning: The case of subnational population density estimation

A guide to convolution arithmetic for deep learning

VIIRS night-time lights

A global poverty map derived from satellite data

Evaluating the relationship between spatial and spectral features derived from high spatial resolution satellite data and urban poverty in Colombo, Sri Lanka

Deep residual learning for image recognition

Measuring Economic Growth from Outer Space

Regional favoritism

Combining satellite imagery and machine learning to predict poverty

Combining satellite imagery and machine learning to predict poverty

Tile2Vec -Unsupervised Representation Learning for Spatially Distributed Data

Statistical tragedy in Africa? Evaluating the database for African economic development

Is newer better? Penn World Table Revisions and their impact on growth estimates

The economics of slums in the developing world

African Cities Disrupting the Urban Future

Infrastructure Quality Assessment in Africa using Satellite Imagery and Deep Learning

Multitask deep learning for predicting poverty from satellite images

A deep learning approach for population estimation from satellite imagery

U-net: Convolutional networks for biomedical image segmentation

Predicting Economic Development using Geolocated Wikipedia Articles

Very deep convolutional networks for large-scale image recognition

Estimation of Gross Domestic Product at Sub-National Scales Using Nighttime Satellite Imagery

Global estimates of market and non-market values derived from nighttime satellite imagery, land cover, and ecosystem service valuation

The NextGenCities Africa Programme

Learning to Interpret Satellite Images using Wikipedia. IJCAI

Inception-v3 for flower classification

Using publicly available satellite imagery and deep learning to understand economic well-being in Africa