key: cord-0498429-s5hqbsl3 authors: Yeh, Christopher; Meng, Chenlin; Wang, Sherrie; Driscoll, Anne; Rozi, Erik; Liu, Patrick; Lee, Jihyeon; Burke, Marshall; Lobell, David B.; Ermon, Stefano title: SustainBench: Benchmarks for Monitoring the Sustainable Development Goals with Machine Learning date: 2021-11-08 journal: nan DOI: nan sha: d6ac9f187b4928ffa93cc8255f6c6846503a081e doc_id: 498429 cord_uid: s5hqbsl3 Progress toward the United Nations Sustainable Development Goals (SDGs) has been hindered by a lack of data on key environmental and socioeconomic indicators, which historically have come from ground surveys with sparse temporal and spatial coverage. Recent advances in machine learning have made it possible to utilize abundant, frequently-updated, and globally available data, such as from satellites or social media, to provide insights into progress toward SDGs. Despite promising early results, approaches to using such data for SDG measurement thus far have largely evaluated on different datasets or used inconsistent evaluation metrics, making it hard to understand whether performance is improving and where additional research would be most fruitful. Furthermore, processing satellite and ground survey data requires domain knowledge that many in the machine learning community lack. In this paper, we introduce SustainBench, a collection of 15 benchmark tasks across 7 SDGs, including tasks related to economic development, agriculture, health, education, water and sanitation, climate action, and life on land. Datasets for 11 of the 15 tasks are released publicly for the first time. Our goals for SustainBench are to (1) lower the barriers to entry for the machine learning community to contribute to measuring and achieving the SDGs; (2) provide standard benchmarks for evaluating machine learning models on tasks across a variety of SDGs; and (3) encourage the development of novel machine learning methods where improved model performance facilitates progress towards the SDGs. In 2015, the United Nations (UN) proposed 17 Sustainable Development Goals (SDGs) to be achieved by 2030, for promoting prosperity while protecting the planet [2] . The SDGs span social, economic, and environmental spheres, ranging from ending poverty to achieving gender equality to combating climate change (see Table A1 ). Progress toward SDGs is traditionally monitored through statistics collected by civil registrations, population-based surveys and censuses. However, such data collection is expensive and requires adequate statistical capacity, and many countries go decades between making ground measurements on key SDG indicators [20] . Only roughly half of SDG indicators have regular data from more than half of the world's countries [94] . These data gaps severely limit the ability of the international community to track progress toward the SDGs. Advances in machine learning (ML) have shown promise in helping plug these data gaps, demonstrating how sparse ground data can be combined with abundant, cheap and frequently updated sources of novel sensor data to measure a range of SDG-related outcomes [70, 20] . For instance, data from satellite imagery, social media posts, and/or mobile phone activity can predict poverty [15, 52, 109] , annual land cover [35, 18] , deforestation [42, 50] , agricultural cropping patterns [69, 103] , crop yields [11, 110] , and the location and impact of natural disasters [25, 92] . As a timely example of real-world impact, the governments of Bangladesh, Mozambique, Nigeria, Togo, and Uganda used ML-based poverty and cropland maps generated from satellite imagery or phone records to target economic aid to their most vulnerable populations during the COVID-19 pandemic [14, 38, 56, 66] . Other recent work demonstrates using ML-based poverty maps to measure the effectiveness of large-scale infrastructure investments [78] . But further methodological progress on the "big data approach" to monitoring SDGs is hindered by a number of key challenges. First, downloading and working with both novel input data (e.g., from satellites) and ground-based household surveys requires domain knowledge that many in the ML community lack. Second, existing approaches have been evaluated on different datasets, data splits, or evaluation metrics, making it hard to understand whether performance is improving and where additional research would be most fruitful [20] . This is in stark contrast to canonical ML datasets like MNIST, CIFAR-10 [60] , and ImageNet [81] that have standardized inputs, outputs, and evaluation criteria and have therefore facilitated remarkable algorithmic advances [43, 28, 57, 44, 47] . Third, methods used so far are often adapted from methods originally designed for canonical deep learning datasets (e.g., ImageNet). However, the datasets and tasks relevant to SDGs are unique enough to merit their own methodology. For example, gaps in monitoring SDGs are widest in low-income countries, where only sparse ground labels are available to train or validate predictive models. To facilitate methodological progress, this paper presents SUSTAINBENCH, a compilation of datasets and benchmarks for monitoring the SDGs with machine learning. Our goals are to 1. lower the barriers to entry by supplying high-quality domain-specific datasets in development economics and environmental science, 2. provide benchmarks to standardize evaluation on tasks related to SDG monitoring, and 3. encourage the ML community to evaluate and develop novel methods on problems of global significance where improved model performance facilitates progress towards SDGs. In SUSTAINBENCH, we curate a suite of 15 benchmark tasks across 7 SDGs where we have relatively high-quality ground truth labels: No Poverty (SDG 1), Zero Hunger (SDG 2), Good Health and Well-being (SDG 3), Quality Education (SDG 4), Clean Water and Sanitation (SDG 6), Climate Action (SDG 13), and Life on Land (SDG 15). Figure 1 summarizes the datasets in SUSTAINBENCH. Although results for some tasks have been published previously, data for 11 of the 15 tasks are being made public for the first time. We provide baseline models for each task and a public leaderboard 3 . To our knowledge, this is the first set of large-scale cross-domain datasets targeted at SDG monitoring compiled with standardized data splits to enable benchmarking. SUSTAINBENCH is not only valuable to improving sustainability measurements but also offers tasks for ML challenges, allowing for the development of self-supervised learning (Section 3.7), meta-learning (Section 3.7), and multimodal/multi-task learning methods (Sections 3.1 and 3.3 to 3.5) on real-world datasets. In the remainder of this paper, Section 2 surveys related datasets; Section 3 introduces the SDGs and datasets covered by SUSTAINBENCH; Section 4 summarizes state-of-the-art models on each dataset and where methodological advances are needed; and Section 5 highlights the impact, limitations, and future directions of this work. The Appendix includes detailed information about the inputs, labels, and tasks for each dataset. Our work builds on a growing body of research that seeks to measure SDG-relevant indicators, including those cited above. These individual studies typically focus on only one SDG-related task, but even within a specific SDG domain (e.g., poverty prediction), most tasks lack standardized datasets with clear replicate-able benchmarks [20] . In comparison, SUSTAINBENCH is a compilation of datasets that covers 7 SDGs and provides 15 standardized, replicate-able tasks with established benchmarks. Table 1 compares SUSTAINBENCH against existing datasets that pertain to SDGs, are publicly available, provide ML-friendly inputs/outputs, and specify standardized evaluation metrics. Perhaps the most closely-related benchmark dataset is WILDS [59] , which provides a comprehensive benchmark for distribution shifts in real-world applications. However, WILDS is not focused on SDGs, and although it includes a poverty mapping task, our poverty dataset covers 5× more countries. There also exist a number of datasets for performing satellite or aerial imagery tasks related to the SDGs [23, 86, 89, 108, 96, 62, 41, 4, 26, 96] which share similarities with the inputs of SUSTAIN-BENCH on certain benchmarks. For example, [86] compiled imagery from the Sentinel-1/2 satellites, which we also use for SDG monitoring tasks, and the Radiant Earth Foundation has compiled datasets for crop type mapping [77], a task we also include. However, SUSTAINBENCH's goal is to provide a broader view of what ML can do for SDG monitoring; it is differentiated in its focus on multiple SDGs, multiple inputs, and on low-income regions in particular. For tasks where existing datasets are abundant (e.g., cropland and land cover classification), SUSTAINBENCH has tasks that address remaining challenges in the domain (e.g., learning from weak labels, sharing knowledge across the globe). Appendix D provides task-by-task comparisons of SUSTAINBENCH datasets with prior work. In this section, we introduce the SUSTAINBENCH datasets and provide background on the SDGs that they help monitor. Seven SDGs are currently covered: No Poverty (SDG 1), Zero Hunger (SDG 2), Good Health and Well-being (SDG 3), Quality Education (SDG 4), Clean Water and Sanitation (SDG 6), Climate Action (SDG 13), and Life on Land (SDG 15). We describe how progress toward each goal is traditionally monitored, the gaps that currently exist in monitoring, and how certain indicators can be monitored using non-traditional datasets instead. Figure 1 summarizes the SDG, inputs, outputs, tasks, and original reference of each dataset, and Figures 2 and A1 visualize how many SDG indicators are covered by SUSTAINBENCH in each country. All of the datasets are easily downloaded via a Python package that integrates with the PyTorch ML framework [75] . Figure 2 : A map of how many SDGs are covered in SUSTAINBENCH for every country. SUSTAIN-BENCH has global coverage with an emphasis on low-income countries. In total, 119 countries have at least one task in SUSTAINBENCH. Despite decades of declining poverty rates, an estimated 8.4% of the global population remains in extreme poverty as of 2019, and progress has slowed in recent years [93] . But data on poverty remain surprisingly sparse, hampering efforts at monitoring local progress, targeting aid to those who need it, and evaluating the effectiveness of antipoverty programs [20] . In most African countries, for example, nationally representative consumption or asset wealth surveys, the key source of internationally comparable poverty measurements, are only available once every four years or less [109] . For SUSTAINBENCH, we processed survey data from two international household survey programs: Demographic and Health Surveys (DHS) [48] and the Living Standards Measurement Study (LSMS). Both constitute nationally representative household-level data on assets, housing conditions, and education levels, among other attributes. Notably, only LSMS data form a panel-i.e., the same households are surveyed over time, facilitating comparison over time. Using a a principal components analysis (PCA) approach [31, 85] , we summarize the survey data into a single scalar asset wealth index per "cluster," which roughly corresponds to a village or local community. We refer to clusterlevel wealth (or its absence) as "poverty". Previous research has shown that widely-available imagery sources including satellite imagery [52, 109] and crowd-sourced street-level imagery [64] can be effective for predicting cluster-level asset wealth when used as inputs in deep learning models. SUSTAINBENCH includes two regression tasks for poverty prediction at the cluster level, both using imagery inputs to estimate an asset wealth index. The first task (Section 3.1.1) predicts poverty over space, and the second task (Section 3.1.2) predicts poverty changes over time. The poverty prediction over space task involves predicting a cluster-level asset wealth index which represents the "static" asset wealth of a cluster at a given point in time. For this task, the labels and inputs are created in a similar manner as in [109] , but with about 5× as many examples. Dataset Following techniques developed in previous works [52, 109] , we assembled asset wealth data for 2,079,036 households living in 86,936 clusters across 48 countries, drawn from DHS surveys conducted between 1996 and 2019, computing a cluster-level asset wealth index as described above. We provide satellite and street-level imagery inputs, gathered and processed according to established procedures [109, 64] . The 255×255×8px satellite images have 7 multispectral bands from Landsat daytime satellites and 1 nightlights band from either the DMSP or VIIRS satellites. The images are rescaled to a resolution of 30m/px and are geographically centered around each surveyed cluster's geocoordinates. Geocoordinates in the public survey data are "jittered" by up to 10km from the true locations to protect the privacy of surveyed households [19] . For each cluster location, we also retrieved up to 300 crowd-sourced, street-level imagery from Mapillary. We evaluate model performance using the squared Pearson correlation coefficient (r 2 ) between predicted and observed values of the asset wealth index on held-out test countries. Appendix D.1 has more dataset details. For predicting temporal changes in poverty, we construct a PCA-based index of changes in asset ownership using LSMS data. For this task, the labels and inputs provided are similar to [109] , with small improvements in image and label quality. Dataset We provide labels for 1,665 instances of cluster-level asset wealth change from 1,287 clusters in 5 African countries. We use the same satellite imagery sources from the previous poverty prediction task. In this task, however, for each cluster we provide images from the two points in time (before and after) used to compute the difference in asset ownership, instead of only from a single point in time. Because street-level images were only available for ∼1% of clusters, we do not provide them for this task. We evaluate model performance using the squared Pearson correlation coefficient (r 2 ) on predictions and labels in held-out cluster locations. Appendix D.2 has more dataset details. The number of people who suffer from hunger has risen since 2015, with 690 million or 9% of the world's population affected by chronic hunger [93] . At the same time, 40% of habitable land on Earth is already devoted to agricultural activities, making agriculture by far the largest human impact on the natural landscape [5] . The second SDG is to "end hunger, achieve food security and improved nutrition, and promote sustainable agriculture." In addition to ending hunger and malnutrition in all forms, the targets under SDG 2 include doubling the productivity of small-scale food producers and promoting sustainable food production [93] . While traditionally data on agricultural practices and farm productivity are obtained via farm surveys, such data are rare and often of low quality [20] . Satellite imagery offers the opportunity to monitor agriculture more cheaply and more accurately, by mapping cropland, crop types, crop yields, field boundaries, and agricultural practices like cover cropping and conservation tillage. We discuss the SUSTAINBENCH datasets for SDG 2 below. One indicator for SDG 2 is the proportion of agricultural area under productive and sustainable agriculture [93] . Existing state-of-the-art datasets on land cover [18, 35] are derived from satellite time series and include a cropland class. However, the maps are known to have large errors in regions of the world like Sub-Saharan Africa where ground labels are sparse [56] . Therefore, while mapping cropland is largely a solved problem in settings with ample labels, devising methods to efficiently generate georeferenced labels and accurately map cropland in low-resource regions remains an important and challenging research direction. Dataset We release a dataset for performing weakly supervised cropland classification in the U.S. using data from [102] , which has not been released previously. While densely segmented labels are time-consuming and infeasible to generate for a large region like Africa, pixel-level and image-level labels are easier to create. The inputs are image tiles taken by the Landsat satellites and composited over the 2017 growing season, and the labels are either binary {cropland, not cropland} at single pixels or {≥ 50% cropland, < 50% cropland} for the entire image. Labels are generated from a highquality USDA dataset on land cover [69] . Train, validation, and test sets are split along geographic blocks, and we evaluate models by overall accuracy and F1-score. We also encourage the use of semi-supervised and active learning methods to relieve the labeling burden needed to map cropland. Spatially disaggregated crop type maps are needed to assess agricultural diversity and estimate yields. In high-income countries across North America and Europe, crop type maps are produced annually by departments of agriculture using farm surveys and satellite imagery [69] . However, no such maps are regularly available for middle-and low-income countries. Mapping crop types in the Global South faces challenges of irregularly shaped fields, small fields, intercropping, sparse ground truth labels, and highly heterogeneous landscapes [83] . We release two crop type datasets in Sub-Saharan Africa and point the reader to additional datasets hosted by the Radiant Earth Foundation [77] ( Table 1) . We recommend that ML researchers use all available datasets to ensure model generalizability. We re-release the dataset from [83] in Ghana and South Sudan in a format more familiar to the ML community. The inputs are growing season time series of imagery from three satellites (Sentinel-1, Sentinel-2, and PlanetScope) in 2016 and 2017, and the outputs are semantic segmentation of crop types. Ghana samples are labeled for maize, groundnut, rice, and soybean, while South Sudan samples are labeled for maize, groundnut, rice, and sorghum. We use the same train, validation, and test sets as [83] , which preserve relative percentages of crop types across the splits. We evaluate models using overall accuracy and macro F1-score. We release the dataset used in [58] and [54] to map crop types in three regions of Kenya. Since the timing of growth and spectral signature are two main ways to distinguish crop types, the inputs are annual time series from the Sentinel-2 multi-spectral satellite. The outputs are crop types (9 possible classes). There are a total of 39,762 pixels belonging to 5,746 fields. The training, validation, and test sets are split along region rather than by field in order to develop models that generalize across geography. Our evaluation metrics are overall accuracy and macro-F1 score. In order to double the productivity (or yield) of smallholder farms, we first have to measure it, and accurate local-level yield measurements are exceedingly rare in most of the world. In SUSTAINBENCH, we release county-level yields collected from various government databases; these can still aid in forecasting production, evaluating agricultural policy, and assessing the effects of climate change. Dataset Our dataset is based on the datasets used in [110] and [101] . We release county-level yields for 857 counties in the U.S., 135 in Argentina, and 32 in Brazil for the years 2005-16. The inputs are spectral band and temperature histograms over each county for the harvest season from the MODIS satellite. The ground truth labels are the regional soybean yield per harvest, in metric tonnes per cultivated hectare, retrieved from government data. See Appendix D.6 for more details. Models are evaluated using root mean squared error (RMSE) and R 2 of predictions with the ground truth. The imbalance of data by country motivates the use of transfer learning approaches. Since agricultural practices are usually implemented on the level of an entire field, field boundaries can help reduce noise and improve performance when mapping crop types and yields. Furthermore, field boundaries are a prerequisite for today's digital agriculture services that help farmers optimize yields and profits [98] . Statistics that can be derived from field delineation, such as the size and distribution of crop fields, have also been used to study productivity [21, 27] , mechanization [61] , and biodiversity [37] . Field boundary datasets are rare and only sparsely labeled in low-income regions, so we release a large dataset from France to aid in model development. Dataset We re-release the dataset introduced in Aung et al. 9 . The dataset consists of Sentinel-2 satellite imagery in France over 3 time ranges: January-March, April-June, and July-September in 2017. The image has resolution 224×224 corresponding to a 2.24km×2.24km area on the ground. Each satellite image comes along with the corresponding binary masks of boundaries and areas of farm parcels. The dataset consists of a total of 1966 samples. We use a different data split from [9] to remove overlapping between the train, validation and test split. Following [9] , we use the Dice score between the ground truth boundaries and predicted boundaries as the performance metric. Despite significant progress on improving global health outcomes (e.g., halving child mortality rates since 2000 [93] ), the lack of local-level measurements in many developing countries continues to constrain the monitoring, targeting, and evaluation of health interventions. We examine two health indicators: female body mass index (BMI), a key input to understanding both food insecurity and obesity; and child mortality rate (deaths under age 5), an official SDG 3 indicator considered to be a summary measure of a society's health. Previous works have demonstrated using satellite imagery [67] or street-level Mapillary imagery inputs [64] for predicting BMI. While we are unaware of any prior works using such imagery inputs for predicting child mortality rates, "there is evidence that child mortality is connected to environmental factors such as housing quality, slum-like conditions, and neighborhood levels of vegetation" [51] , which are certainly observable in imagery. We provide cluster-level average labels for women's BMI and child mortality rates compiled from DHS surveys. There are 94,866 cluster-level BMI labels computed from 1,781,403 women of childbearing age , excluding pregnant women. There are 105,582 cluster-level labels for child mortality rates computed from 1,936,904 children under age 5. As in the poverty prediction over space task (Section 3.1.1), the inputs for predicting the health labels are satellite and street-level imagery, and models are evaluated using the r 2 metric on labels from held-out test countries. SDG 4 includes targets that by 2030, all children and adults "complete free, equitable and quality primary and secondary education". Increasing educational attainment (measured by years of schooling completed) is known to increase wealth and social mobility, and higher educational attainment in women is strongly associated with improved child nutrition and decreased child mortality [40] . Previous works have demonstrated the ability of deep learning methods to predict educational attainment from both satellite images [112] and street-level images [36, 64] . We provide cluster-level average years of educational attainment by women of reproductive age compiled from same DHS surveys used for creating the asset wealth labels in the poverty prediction task. The 122,435 cluster-level labels were computed from 3,013,286 women across 56 countries. As in the poverty prediction over space task (Section 3.1.1), the inputs for predicting women educational attainment are satellite and street-level imagery, and models are evaluated using the r 2 metric on labels from held-out test countries. Clean water and sanitation are fundamental to human health, but as of 2020, two billion people globally do not have access to safe drinking water, and 2.3 billion lack a basic hand-washing facility with soap and water [84] . Access to improved sanitation and clean water is known to be associated with lower rates of child mortality [65, 33] . We provide cluster-level average years of a water quality index and sanitation index compiled from same DHS surveys used for creating the asset wealth labels in the poverty prediction task. The 87,938 (water index) and 89,271 (sanitation index) cluster-level labels were computed from 2,105,026 (water index) and 2,143,329 (sanitation index) households across 49 countries. As in the poverty prediction over space task (Section 3.1.1), the inputs for predicting the water quality and sanitation indices are satellite and street-level imagery, and models are evaluated using the r 2 metric on labels from held-out test countries. Since SUSTAINBENCH includes labels for child mortality in many of the same clusters with sanitation index labels, we encourage researchers to take advantage of the known associations between these variables. SDG 13 aims at combating climate change and its disruptive impacts on national economies and local livelihoods [68] . Monitoring emissions and environmental regulatory compliance are key steps toward SDG 13. Brick manufacturing is a major source of carbon emissions and air pollution in South Asia, with an industry largely comprised of small-scale, informal producers. Identifying brick kilns from satellite imagery is a scalable method to improve compliance with environmental regulations and measure their impact on nearby populations. A recent study [63] trained a CNN to detect kilns and hand-validated the predictions, providing ground truth kiln locations in Bangladesh from October 2018 to May 2019. Dataset The high-resolution satellite imagery used in [63] could not be shared publicly because they were proprietary. Hence, we provide a lower resolution alternative-Sentinel-2 imagery, which is available through Google Earth Engine [39] . We retrieved 64 × 64 × 13 tiles at 10m/pixel resolution from the same time period and labeled each image as not containing a brick kiln (class 0) or containing a brick kiln (class 1) based on the ground truth locations in [63] . Human activity has altered over 75% of the earth's surface, reducing forest cover, degrading oncefertile land, and threatening an estimated 1 million animal and plant species with extinction [93] . Our understanding of land cover-i.e., the physical material on the surface of the earth-and its changes is not uniform across the globe. Existing state-of-the-art land cover maps [18] are significantly more accurate in high-income regions than low-income ones, as the latter have few ground truth labels [56] . The following two datasets seek to reduce this gap via representation learning and transfer learning. One approach to increase the performance of land cover classification in regions with few labels is to use unsupervised or self-supervised learning to improve satellite/aerial image representations, so that downstream tasks require fewer labels to perform well. Dataset We release the high-resolution aerial imagery dataset from [53] , which spans a 2500km 2 (12 billion pixel) area of Central Valley, CA in the U.S. The output is image-level land cover (66 classes), where labels are generated from a high-quality USDA dataset [69] . The region is divided in geographically-continuous blocks into train, validation, and test sets. The user may use the training imagery in any way to learn representations, and we provide a test set of up to 200,000 tiles (100×100px) for evaluation. The evaluation metrics are overall accuracy and macro F1-score. A second strategy for increasing performance in label-scarce regions is to transfer knowledge learned from classifying land cover in high-income regions to low-income ones. Dataset We release the global dataset of satellite time series from [104] . The dataset samples 692 regions of size 10km × 10km around the globe; for each region, 500 latitude/longitude coordinates are sampled. The input is time series from the MODIS satellite over the course of a year, and the output is land cover type (17 possible classes). Users have the option of splitting regions into train, validation, and test sets at random or by continent. The evaluation metrics are overall accuracy, F1-score, and kappa score. The results from [104] are reported with all regions from Africa as the test set, but the user can choose to hold out other continents, for which the label quality will be higher. SUSTAINBENCHprovides a benchmark and public leaderboard website for the datasets described in Section 3. Each dataset has standard train-test splits with well-defined performance metrics detailed in Appendix E. We also welcome community submissions using additional data sources beyond what is provided in SUSTAINBENCH, such as for pre-training or regularization. Table 2 summarizes the baseline models and results. Code to reproduce our baseline models is available on GitHub 4 . Here, we highlight some main takeaways from our baseline models. First, there is significant room for improvement for models that can take advantage of multi-modal inputs. Specifically, our baseline model for the DHS survey-based tasks only uses the satellite imagery inputs, and its poor performance on predicting child mortality and women educational attainment demonstrates the need to leverage additional data sources, such as the street-level imagery we provide. Second, ML model development can lead to significant gains in performance for SDG-related tasks. While the original paper that compiled SUSTAINBENCH's field delineation dataset achieved a Dice score of 0.61 with a standard U-Net [9] , we applied a new attention-based CNN developed specifically for field delineation [99] and achieved a 0.87 Dice score. For more task-specific discussions, please see Appendix E. This paper introduces SUSTAINBENCH, which, to the best of our knowledge, is the largest compilation to date of datasets and benchmarks for monitoring the SDGs with machine learning (ML). The SDGs are arguably the most urgent challenges the world faces today, and it is important that the ML community contribute to solving these global issues. As progress towards SDGs is often hindered by a lack of ground survey data especially in low-income countries, ML algorithms designed for monitoring SDGs are important for leveraging non-traditional data sources that are cheap, globally available, and frequently-updated to fill in data gaps. ML-based estimates provide policymakers from governments and aid organizations with more frequent and comprehensive insights [109, 20, 52] . The tasks defined in SUSTAINBENCH can directly translate into real-world impact. For example, during the COVID-19 pandemic, the government of Togo collaborated with researchers to use satellite imagery, phone data, and ML to map poverty [14] and cropland [56] in order to target cash payments to the jobless. Recent work in Uganda demonstrates how ML-based poverty maps can be used to measure the effectiveness of large-scale infrastructure investments [78] . ML-based analyses of satellite images in Kenya (using the labels described in Section 3.2.2) were recently used to identify soil nitrogen deficiency as the limiting factor in maize yields, thereby facilitating targeted agriculture intervention [54] . And as a last example, the development of a new attention-based neural network architecture enabled the delineation of 1.7 million fields in Australia from satellite imagery [99] . These field boundaries have been productized and facilitate the adoption of digital agriculture, which can improve yields while minimizing environmental pollution [24] . Although ML approaches have demonstrated value on a variety of tasks related to SDGs [109, 20, 64, 53, 52, 101, 103] , the "big data approach" has its limits. ML models may not completely replace ground surveys. Imperfect predictions from ML models may introduce biases that propagate through downstream policy decisions, leading to negative societal impacts. The use of survey data, high resolution remote sensing images, and street-level images may also raise privacy concerns, despite efforts to protect individual privacy. We refer the reader to Appendix F for a detailed treatment of ethical concerns in SUSTAINBENCH, including mitigation strategies we implemented. Despite these limitations, ML applications have the greatest potential for positive impact in low-income countries, where gaps in monitoring SDGs are widest due to the constant lack of survey data. While SUSTAINBENCH is the largest SDG-focused ML dataset and benchmark to date, it is by no means complete. Field surveys are extremely costly, and labeling images for model training requires significant manual effort by experts, limiting the amount of data released in SUSTAINBENCH to quantities smaller than those of many canonical ML datasets (e.g., ImageNet). In addition, many SDGs and indicators are not included in the current version. Such SDG indicators can be placed into 3 categories. First, several tasks can be included in future versions of SUSTAINBENCH by drawing on existing data. For example, measures of gender equality (SDG 5) and access to affordable and clean energy (SDG 7) already exist in the surveys used to create labels for SUSTAINBENCH tasks but will require additional processing before releasing. Recent works have also pioneered deep learning methods for identifying illegal fishing from satellite images [74] (SDG 14) and monitoring biodiversity from camera traps [13] (SDG 15). Table 1 includes a few relevant datasets from this first category. Second, some SDG indicators require additional research to discover non-traditional data modalities that can be used to monitor them. Finally, not all SDGs are measurable using ML or need improved measurement capabilities from ML models. For example, international cooperation (SDG 17) is perhaps best measured by domestic and international policies and agreements. For the ML community, SUSTAINBENCH also provides opportunities to test state-of-the-art ML models on real-world data and develop novel algorithms. For example, the tasks based on DHS household survey data share the same inputs and thus facilitate multi-task training. In particular, we encourage researchers to take advantage of the known strong associations between asset wealth, child mortality, women's education, and sanitation labels [33, 40] . The combination of satellite and street-level imagery for these tasks also enables multi-modal representation learning. On the other hand, the land cover classification and cropland mapping tasks provide new real-world datasets for evaluating and developing self-supervised, weakly supervised, unsupervised, and meta-learning algorithms. We welcome exploration of methods beyond our provided baseline models. Ultimately, we hope SUSTAINBENCH will lower the barrier to entry for the ML community to contribute toward monitoring SDGs and highlight challenges for ML researchers to address. In the long run, we plan to continue expanding datasets and benchmarks as new data sources become available. We believe that standardized datasets and benchmarks like those in SUSTAINBENCH are imperative to both novel method development and real-world impact. The Landsat, DMSP, NAIP, and VIIRS satellite images provided in SustainBench are in the public domain. PlanetScope imagery and Mapillary street-level imagery are provided under the CC BY-SA 4.0 license. Sentinel-2 imagery is provided under the Open Access compliant Creative Commons CC BY-SA 3.0 IGO license. Sentinel-1 imagery provides free access to imagery, including reproduction and distribution 5 . Likewise, MODIS imagery is free to reuse and redistribute 6 . Our inclusion of labels derived from DHS survey data is within the DHS program Terms of Use 7 as the labels are aggregated to the cluster level and do not include any of the original "micro-level" data, and no individuals are identified. Our inclusion of labels derived from LSMS survey data is within the LSMS access policy, as we do not redistribute any of the raw data files. The Argentina crop yield labels are provided under the CC BY 2.5 AR license. United States crop yield labels are also free to access and reproduce 8 . The brick kiln binary classification labels were manually hand-labeled by ourselves and our collaborators and therefore do not have any licensing restrictions. SUSTAINBENCH itself is released under a CC BY-SA 4.0 license, which is compatible with all of the licenses for the datasets included. Our datasets are stored on Google Drive at the following link: https://drive.google.com/ drive/folders/1jyjK5sKGYegfHDjuVBSxCoj49TD830wL?usp=sharing. Due to the large size of our dataset, we were unable to find any existing research data repository (e.g., Zenodo, Dataverse) willing to accommodate our dataset. The GitHub repo with code used to process the datasets and run our baseline models is located at https://github.com/sustainlab-group/sustainbench/. The dataset will be maintained by the Stanford Sustainability and AI lab. Today, six years after the unveiling of the SDGs, many gaps still exist in monitoring progress. Official tracking of data availability is conducted by the UN Statistical Commission, which classifies each indicator into one of three tiers: indicator is well-defined and data are regularly produced by at least 50% of countries (Tier I), indicator is well-defined but data are not regularly produced by countries (Tier II), and the indicator is currently not well-defined (Tier III Figure A1 : Maps of geographic SUSTAINBENCH coverage per SDG. data, and 4 indicators are a mix depending on the data of interest (Table A1 ) [94] . For example, for monitoring global poverty (SDG 1), the proportion of a country's population living below the international poverty line (Indicator 1.1.1) is reported annually for all countries, but the economic loss attributed to natural and man-made disasters (Indicator 1.5.2) is only sparsely documented. We provide descriptions of the 17 Sustainable Development Goals (SDGs) in Table A1 . In this section, we detail the process of constructing the poverty, health, education, and water and sanitation labels from DHS surveys. We also give more information about the input imagery that we provide as part of SUSTAINBENCH. Labels from DHS survey data We constructed several indices using survey data from the Demographic and Health Surveys (DHS) program, which is funded by the US Agency for International Development (USAID) and has conducted nationally representative household-level surveys in over 90 countries. For SUSTAINBENCH, we combined survey data covering 56 countries from 179 unique surveys with questions on women's education, women's BMI, under 5 mortality, household asset ownership, water quality, and sanitation (toilet) quality. We chose surveys between 1996 (the first year that nightlights imagery is available) and 2019 (the latest year with available DHS surveys) 9 for which geographic data was available. The full list of surveys is shown in Table A3 . • Asset Wealth Index While the SDG indicators define poverty lines expressed in average expenditure (a.k.a. consumption) per day, survey data is much more widely available for household asset wealth than expenditure. Furthermore, asset wealth is considered a less noisy measure of households' long-run economic well-being [85, 32] and is actively used for targeting social programs [32, 7] . To summarize household-level survey data into a scalar asset wealth index, standard approaches perform principal components analysis (PCA) of survey responses and project them onto the first principal component [31, 85] . The household-level asset wealth index is commonly averaged to create a cluster-level index, where a "cluster" roughly corresponds to a village or local community. The asset wealth index is built using household asset ownership and infrastructure information as done in prior works [109] . We include the number of rooms used for sleeping in a home (capped at 25); binary indicators for whether the household has electricity and owns a radio, TV, refrigerator, motorcycle, car, or phone (or cellphone); and the quality of floors, water source, and toilet. As "floor type", "water source type", and "toilet type" are reported from DHS as descriptive categorical variables (e.g., "piped water"/"flush to pit latrine"), we convert the descriptions to a numeric scale, a standard technique for processing survey data [65] . We use a 1-5 scale where lower numbers indicate the water source is less developed (e.g., straight from a lake) while higher numbers indicate higher levels of technology/development (e.g., piped water); we use a similar 1-5 scale for toilet type and floor type. To calculate the index, we use the first principal component of all the variables mentioned above at a household level, and report the mean at a cluster level. The asset wealth index calculation includes 2,081,808 households total from 87,119 clusters in 48 countries, with a median of 22 households per cluster. Many surveys are dropped because they do not include one of the 12 variables we use to construct the index. The final number of clusters with asset wealth labels in SUSTAINBENCH is only 86,936, as several clusters did not have corresponding satellite imagery inputs. Note that households from these clusters with missing imagery still contributed to the PCA computation, since these clusters were excluded from SUSTAINBENCH only after the PCA-based index had already been constructed. The women's education metric is created by taking the cluster level mean of "education in single years". Following [40] , we capped the years of education at 18, a common threshold in many surveys which helps avoid outliers. The women's education metric includes data from 2,910,286 women in 56 countries, with a median of 24 women per cluster. To create the women's BMI metric, we first exclude all pregnant women, as the BMI is not adjusted for them. Using the sample of women BMI is appropriate for, we take the cluster For all indices, we excluded the calculated index for a cluster if fewer than 5 observations are used to create it. For the asset wealth, sanitation, and water indices an observation unit is a household; for the women's education, BMI and under 5 mortality measures the observation unit is an individual. We also excluded several hundred clusters for which satellite imagery could not be obtained. For all of the tasks based on DHS survey data, we use a uniform train/validation/test dataset split by country. Delineating by country ensures that there is no overlap between any of the splits-i.e., a model trained on our train split will not have "seen" any part of any image from the test split. The splits are listed in Table A2 . The main source of inputs for these tasks is satellite imagery, collected and processed in a similar manner as [109] . For each DHS surveyed country and year, we created 3-year median composites of daytime surface reflectance images captured by the Landsat 5, 7, and 8 satellites. Each composite takes the median of each cloud-free pixel available during a 3-year period centered on the year of the DHS survey. (Note the difference from [109] , which only chose three distinct 3-year periods for compositing.) As described in [109] , the motivation for using 3-year composites is two-fold. First, multi-year median compositing has seen success in similar applications for gathering clear satellite imagery [10] , and even in 1-year composites we observed substantial influence of clouds in some regions, given imperfections in the cloud mask. Second, the outcomes that we predict (wealth, health, education, and infrastructure) tend to evolve slowly over time, and we did not want our inputs to be distorted by seasonal or short-run variation. These daytime images TJ - Tajikistan TJ2012DHS, TJ2017DHS TZ - Tanzania TZ1999DHS, TZ2007AIS, TZ2010DHS, TZ2012AIS, TZ2015DHS, TZ2017MIS UG -Uganda We also include nighttime lights ("nightlights") imagery, using the same sources as [109] . [46] to ensure that the DMSP values are comparable across time (a procedure which [109] did not follow). For VIIRS, which provides monthly composites, we perform 3-year median compositing similar to the Landsat images, taking the median of each monthly average radiance over a 3-year period centered on the year of the DHS survey. All nightlights images are resized using nearest-neighbor upsampling to cover the same spatial area as each Landsat image. The MS and NL satellite imagery were processed in and exported from Google Earth Engine [39] . For each cluster from a given DHS surveyed country-year, we provide one 255×255×8 image (7 MS bands, 1 NL band) centered on the cluster's geocoordinates at a scale of 30 m/pixel. See Figure A2 for an example of an image in our dataset. In our released code, we provide the mean and standard deviation of each band across the entire dataset for input normalization. The exact image collections we used on Google Earth Engine are as follows: • For future releases of SUSTAINBENCH, we would like to update all of the Landsat imagery to the newer "Collection 2" products. New Collection 1 products will not be released beyond January 1, 2022, so we would not be able to use the existing Collection 1 imagery source for future DHS surveys. We would also like to update the VIIRS imagery to the official annual composites released by the Earth Observation Group. We did not provide such imagery in SUSTAINBENCH because they were not available on Google Earth Engine at the time SUSTAINBENCH was compiled. Mapillary Images Mapillary [71] provides a platform for crowd-sourced, geo-tagged street-level imagery. It provides an API to access data such as images, map features, and object detections, automatically blurring faces of human subjects and license places [72] and allowing users who upload images to manually blur if any are missed [3] for privacy. We retrieved only images that intersect with a DHS cluster. A given image must satisfy two conditions to intersect with a DHS cluster: 1) its geo-coordinates must be within 0.1 degree latitude and longitude to the cluster's geo-location, and 2) it must have been captured within 3 years before or after the year of the DHS datapoint. Each image has metadata, including a unique ID, timestamp of capture in milliseconds, year of capture, latitude, and longitude. All downloaded images have 3 channels (RGB), and the length of the shorter side is 1024. Approximately 18.7% of all DHS clusters, spanning 48 countries, have a non-zero number of Mapillary images. Of these clusters with Mapillary images, the number of images ranges from 1 to a maximum of 300, with a mean of 76 and median of 94. The total number of Mapillary images included in SUSTAINBENCH is approximately 1.7 million. Figure A3 shows some example Mapillary images. Table A5 summarizes the related works for the DHS-based tasks in SUSTAINBENCH. As shown in Table A4 , the DHS-based datasets in SUSTAINBENCH build on the previous works of Jean et al. 52 and Yeh et al. 109 , which pioneered the application of computer vision on satellite imagery to estimate a cluster-level asset wealth index. Notably, for the task of predicting poverty over space, SUSTAINBENCH's dataset is nearly 5× larger than the dataset included in [109] (over 2× the number of countries, and 3× the temporal coverage). Our dataset also has advantages over other related works which often rely on proprietary imagery inputs [52, 45, 36] , are limited to a small number of countries [12, 30, 64, 36, 105] , or have coarser label resolution [73] . Other researchers have explored using non-imagery inputs for poverty prediction, including Wikipedia text data [87] and cell phone records [15] ; while such multi-modal data are not currently in SUSTAINBENCH, we are considering including them in future versions. For the non-poverty tasks pertaining to health, education, and water/sanitation, there are extremely few ML-friendly datasets. Head et al. 45 comes closest to SUSTAINBENCH in having predicted similar indicators (women BMI, women education, and clean water) derived from DHS survey data. Also, like us, their results suggest that satellite imagery may be less accurate at predicting these non-poverty labels in developing countries. However, because they used proprietary imagery inputs, their dataset is not accessible and cannot serve as a public benchmark. A large collaborative effort [65] gathered survey and census data for creating clean water and sanitation labels in over 80 countries, but they did not provide satellite imagery inputs and only publicly released outputs of their geostatistical model, not the labels themselves. Again, SUSTAINBENCH has significant advantages over other related works that use proprietary data [45, 36, 67] , are limited to a small number of countries [36, 64] , or do not publicly release their labels [65] . Dataset Impact Most low-income regions lack data on income and wealth at fine spatial scales. Even at coarse spatial scales, temporal resolution can still be bad; Figure 1 in Burke et al. 20 shows that, in some countries, as many as two decades can pass between successive nationally representative economic surveys. Inferring economic welfare from satellite or street-level imagery offers one solution to the lack of surveys. Indeed, many governments turned to ML-based poverty mapping techniques during the COVID-19 pandemic to identify and prioritize vulnerable populations for targeted aid programs. For example, the government of Togo wanted to send aid to over 500,000 vulnerable people impacted by the surveys, LSMS provides panel data-i.e., the same households are surveyed over time, facilitating comparison over time. We start by compiling the same survey variables from the DHS asset index, except for refrigerator ownership because it is not included in the LSMS Uganda survey. (See the previous section for details on the survey variables included for the DHS asset index.) As with the DHS asset index, we convert "floor type", "water source type", and "toilet type" variables from descriptive categorical variables to a 1-5 ranked scale. Based on the panel survey data, we calculate two PCA-based measures of change in asset wealth over time for each household: diffOfIndex and indexOfDiff. For diffOfIndex, we first assign each household-year an asset index computed as the first principal component of all the asset variables; this is the same approach used for the DHS asset index. Then, for each household, we calculate the difference in the asset index across years, which yields a "change in asset index" (hence the name diffOfIndex). In contrast, indexOfDiff is created by first calculating the difference in asset variables in households across pairs of surveys for each country and then computing the first principal component of these differences; for each household, this yields a "index of change in assets" across years (hence the name indexOfDiff). These measures are then averaged to the cluster-level to create cluster-level labels. We excluded a cluster if it contained fewer than 3 surveyed households. The LSMS-based labels include data for 2,763 cluster-years (comprising 17,215 household-years) from 11 surveys for 5 African countries. Table A6 gives the full list of LSMS surveys used, 10 and Table A7 gives the number of clusters and households included for each country. See Figure A4 for an example of the satellite imagery inputs. The labels and inputs provided in SUSTAINBENCH for this task are similar (but not identical) to the labels and inputs used in [109] . While the underlying LSMS survey data used are the same, there are 3 key differences. 1. In SUSTAINBENCH, for each country, we only used data from households that are present in all surveys of that country. In Uganda, for example, we only keep households that were surveyed repeatedly in all of the 2005, 2009, and 2013 surveys. This is different from [109] which included any household that was present in two survey years-e.g., a household in Uganda 2005 and Uganda 2009, but not Uganda 2013. 2. The recoding of the floor, water, and toilet quality variables was made more consistent across countries and now closely matches the ranking introduced in [65] . 3. As in the case of the DHS-based datasets, the satellite imagery inputs have been improved. See Table A4 for details. To the best of our knowledge, the LSMS-based poverty change over time dataset in SUSTAINBENCH and its predecessor in [109] are the only datasets specifically designed as an index of asset wealth change. For related works on mapping poverty, see the "Comparison with Related Works" for DHS-based tasks in Appendix D.1. We release a dataset for performing weakly supervised classification of cropland in the United States using the data from Wang et al. 102, which has not been released previously. While densely segmented labels are time-consuming and infeasible to generate for a region as large as Sub-Saharan Africa, pixel-level and image-level labels are often already available and much easier to create. Figure A5 shows an example from the dataset. The study area spans from 37 • N to 41 • 30'N and from 94 • W to 86 • W, and covers an area of over 450,000km 2 in the United States Midwest. We chose this region because the US Department of Agriculture (USDA) maintains high-quality pixel-level land cover labels across the US [69] , allowing us to evaluate the performance of algorithms. Land cover-wise, the study region is 44% cropland and 56% non-crop (mostly temperate forest). The Landsat Program is a series of Earth-observing satellites jointly managed by the USGS and NASA. Landsat 8 provides moderate-resolution (30m) satellite imagery in seven surface reflectance bands (ultra blue, blue, green, red, near infrared, shortwave infrared 1, shortwave infrared 2) designed to serve a wide range of scientific applications. Images are collected on a 16-day cycle. We computed a single composite by taking the median value at each pixel and band from January 1, 2017 to December 31, 2017. We used the quality assessment band delivered with the Landsat 8 images to mask out clouds and shadows prior to computing the median composite. The resulting seven-band image spans 4.5 degrees latitude and 8.0 degrees longitude and contains just over 500 million pixels. The composite was then divided into 200,000 tiles of 50 × 50 pixels each. This full dataset was not released previously with Wang et al. 102 . The ground truth labels from the Cropland Data Layer [69] are at the same spatial resolution as Landsat, so that for every Landsat pixel there is a corresponding {cropland, not cropland} label. For each image, we generate two types of weak labels: (1) single pixel and (2) image-level, both with the goal of generating dense semantic segmentation predictions. The image-level label is ∈ {≥ 50% cropland, < 50% cropland}. Comparison with Related Works Cropland has already been mapped globally [18, 35] or for the continent of Africa [106] in multiple state-of-the-art land cover maps. However, existing land cover maps are known to have low accuracy throughout the Global South [56] . One reason behind this low accuracy is that existing maps have been created with SVM or tree-based algorithms that take into account a single pixel at a time [18, 35, 106] . Kerner et al. 56 showed that a multi-headed LSTM (still trained on single pixels) outperformed SVM and random forest classifiers on cropland prediction in Togo. Using a larger spatial context, e.g., in a CNN, could lead to further accuracy gains. However, ground label scarcity remains a bottleneck for applying deep learning models to map cropland. Wang et al. 102 showed that weak labels in the form of single pixel or image-level classes can still supervise a U-Net to segment cropland at accuracies better than SVM or random forest classifiers. We release this dataset, which is the first dataset for weakly supervised cropland mapping, as a benchmark for algorithm development. The dataset is in the U.S. Midwest because cropland labels there are of high accuracy; methods developed on this dataset could be paired with newly generated weak labels in low-income regions to generate novel, high-accuracy cropland maps (see below for an example application). Dataset Impact High accuracy cropland mapping in the Global South can have significant impacts on the planning of government programs and downstream tasks like crop type mapping and yield prediction. For instance, during the COVID-19 pandemic, the government of Togo announced a program to boost national food production by distributing aid to farmers. However, the government lacked high-resolution spatial information about the distribution of farms across Togo, which was crucial for designing this program. Existing global land cover maps, despite including a cropland class, were low in accuracy across Togo. The government collaborated with researchers at the University of Maryland to solve this problem, and in Kerner et al. 56 the authors created a highresolution map of cropland in Togo for 2019 in under 10 days. The authors pointed out that this case study demonstrates "a successful transition of machine learning research to operational rapid response for a real humanitarian crisis" [56] . Figure A6 : An example from the crop type mapping dataset [83] . The left image represents a satellite image timeseries (figure displays PlanetScope imagery) and the right image represents a segmentation map. Figure A7 : Example time series of the GCVI band computed from Sentinel-2 satellite bands [58] , after clouds were masked out. Both examples happen to be of the crop type "Cassava". As introduced in [83] , these datasets contain satellite imagery from Ghana and South Sudan. Sentinel 1 (10m resolution), Sentinel 2 (10m resolution), and Planet's PlanetScope (3m resolution) time series imagery are used as inputs for this task. As described in [83] , Planet imagery is incorporated to help mitigate issues from high cloud cover and small field sizes. We include three S1 bands (VV, VH, VH/VV), ten S2 bands (blue, green, red, near infrared, four red edge bands, two short wave infrared bands), and all four PlanetScope bands (blue, green, red, near infrared). We also construct normalized difference vegetation index (NDVI) and green chlorophyll vegetation index (GCVI) bands for PlanetScope and S2 imagery. Ground truth labels consist of a 64x64 pixel segmentation map, with each pixel containing a crop label. Ghana locations are labeled for Maize, Groundnut, Rice, and Soya Bean, while South Sudan locations are labeled for Sorghum, Maize, Rice, and Groundnut. Comparison with Related Works SUSTAINBENCH's crop type datasets and existing crop type datasets are summarized in Table A8 . A version of SUSTAINBENCH's Ghana/South Sudan dataset was released previously and is currently housed on Radiant MLHub. We highlight key differences between SUSTAINBENCH's dataset and the one used in Rustowicz et al. 83 . We use the same train, validation, and test splits used in [83] , though we use the full 64x64 imagery provided, while [83] further subdivided imagery into 32x32 pixel grids due to memory constraints. We also include variable length time series with zero padding and masking, while [83] trimmed the respective time series down to the same length. We include variable length time series with the reasoning that future research should be extendable to variable length time-series imagery. The metrics cited in Table 2 The train, validation, and test sets are split by region to encourage discovery of features and development of methods that generalize across regions. One region is the training and validation region, while the other two regions are test regions. Comparison with Related Works SUSTAINBENCH's crop type datasets and existing crop type datasets are compared in Table A8 . A dataset was only included in the table if it is publicly available and provides inputs and outputs in ML-friendly formats. There is considerable work underway in the remote sensing community, led by the Radiant Earth Foundation, to collect and disperse crop type data to improve the state-of-the-art classification. SUSTAINBENCH's crop type dataset in Kenya complements existing datasets. It is one of the largest available crop type datasets in a smallholder system. It also has defined train/val/test splits and baselines, which not all public crop type datasets do. One of the train/val/test split options is also designed to test model generalizability across geography by splitting along geographic clusters, which no other datasets do. We recommend that ML researchers test their methods on as many available datasets as possible to ensure model generalizability. Dataset Impact The crop type labels that we released in Kenya were the same labels used to create the first-ever maize classification and yield map across that entire country [54] . Kenya is one of the largest maize producers in sub-Saharan Africa, and studying maize production there could improve food security in the region. Jin et al. 54 used a random forest trained on seasonal median composites of satellite imagery to predict maize with an accuracy of only 63%. It is worth investigating how other machine learning models using a year's full time series could improve on this. As an example revealed that 72% of variation in predicted maize yields could be explained by soil factors, suggesting that increasing nitrogen fertilizer application should be a priority for increasing smallholder yields in Kenya. These datasets are constructed as an expansion of the dataset used in [101] . They are created using Moderate Resolution Imaging Spectroradiometer (MODIS) satellite imagery, which is freely accessible via Google Earth Engine and provides coverage of the entire globe. Specifically, we use 8-day composites of MODIS images to get 7 bands of surface reflectance at different wavelengths (3 visible and 4 infrared bands) from the MOD09A1 [97] collection, 2 bands of day and night surface temperatures from MYD11A2 [100] , and a land cover mask from MCD12Q1 [34] to distinguish cropland from other land. For each of the 9 bands of reflectance and temperature imagery and each of the 32 timesteps within a year's harvest season, we bin pixel values into 32 ranges, giving a 32 × 32 × 9 final histogram. We create one such dataset for each of Argentina, Brazil, and the United States, with 9049 datapoints for the United States, 1615 for Argentina, and 384 for Brazil. The ground truth labels are the regional crop yield per harvest, in metric tonnes per cultivated hectare, as collected from Argentine Undersecretary of Agriculture [8] , the Brazilian Institute of Geography and Statistics [17] , and the United States Department of Agriculture [95] . Comparison with Related Works SUSTAINBENCH releases the crop yield datasets from two previous works [110, 101] for the first time. To date, very few crop yield datasets exist, because yields require expensive farm survey techniques (e.g., crop cuts) to measure. The datasets that do contain field-level yields are privately held by researchers, government agencies, or NGOs. SUSTAINBENCH's datasets therefore provide yields at the county level. Furthermore, crop yield prediction is challenging as it requires processing a temporal sequence of satellite images. We provide ML-friendly inputs in the form of histograms of weather and satellite features over each county. Dataset Impact Tracking crop yields is crucial to measuring agricultural development and deciding resource allocation, with downstream applications to food security, price stability, and agricultural worker income. Notably, most developed countries invest in forecasting and tracking crop yield. For example, the European Commission JRC's crop yield forecasts and crop production estimates inform the EU's Common Agricultural Policy and other agricultural programs [1] . By involving satellite images in the crop yield prediction process, we aim to make timely predictions available in developing countries where ground surveys are costly and infrequent. Furthermore, we provide satellite histograms rather than human-engineered indices like NDVI, which are more human-friendly for visualization but discard a significant amount of potentially-relevant information. In doing so, we (a) Sentinel-2 (Input) (b) Delineated boundaries (c) Segmentation masks Figure A9 : An example from the field delineation dataset [9] . From left to right, an input Sentinel-2 image, its corresponding delineated boundaries, and its corresponding segmentation masks. hope to encourage the development of ML techniques that make use of more complete and useful features to generate better predictions. As introduced in [9] , the dataset consists of Sentinel-2 satellite imagery in France 11 over the 3 time ranges January-March, April-June, and July-September in 2017. The image has resolution 224 × 224 corresponding to a 2.24km× 2.24km area on the ground. Each satellite image comes along with the corresponding binary masks of boundaries and areas of farm parcels. The dataset consists of 1572 training samples, 198 validation samples, and 196 test samples. We use a different data split from Aung et al. 9 to remove overlapping between the train, validation and test split. An example of the dataset is shown in Figure A9 . Comparison with Related Works To our knowledge, SustainBench has released the first public field boundary dataset with satellite image inputs and ML-friendly outputs. That is, some countries in Europe (e.g., France) have made vector files of field boundaries public on their government websites, but without corresponding satellite imagery inputs or raster field boundary outputs. We provide these inputs and outputs. While field segmentation datasets from the U.S., South Africa, and Australia were used in prior field delineation research [107, 98, 99] , none of those datasets are publicly available. We are also currently working on collecting field boundaries in low-income countries, but this data will be added to SustainBench at a later date, not in time for this submission. Dataset Impact Automated field delineation makes it easier for farmers to access field-level analytics; previously, manual boundary input was a major deterrent from adopting digital agriculture [98] . Digital agriculture can improve yields while minimizing the use of inputs like fertilizer that cause environmental pollution -with the net effect of increasing farmer profit. The development of a new attention-based neural network architecture (called FracTAL ResUNet) enabled the delineation of 1.7 million fields in Australia from satellite imagery [99] . These field boundaries have since been productized by CSIRO, the Australian government agency for scientific research. This is an example where a novel deep learning architecture enabled the creation of operational products in agriculture. However, the Australia dataset is not publicly available. Our goal is for the release of SUSTAINBENCH's field boundary dataset in France to enable further architecture development and identify which model works best for field delineation. Brick manufacturing is a major source of pollution in South Asia, but the industry is largely comprised of small-scale, informal producers, making it difficult to monitor and regulate. Identifying brick kilns automatically from satellite imagery can help improve compliance with environmental regulations Figure A10 : An example of Sentinel-2 satellite imagery for brick kiln classification. On the left is a positive example of an image showing a brick kiln, while the right image is a negative example (i.e., no brick kiln). and measure their impact on the health of nearby populations. We provide Sentinel-2 satellite imagery at 10m/pixel resolution available through Google Earth Engine [39] . The images have size 64×64×13px, where the order of the bands correspond to the bands B1 through B12 on the Earth Engine Data Catalog, where B2 is Blue, B3 is Green, and B4 is Red. The other bands include aerosols, color infrared, short-wave infrared, and water vapor data. Comparison with Related Works A recent study detected brick kilns from high-resolution (1m/pixel) satellite imagery and hand-validated the predictions, providing ground truth locations of brick kilns in Bangladesh for the time period of October 2018 to May 2019 [63] . The imagery could not be shared publicly because they were proprietary. Hence, we provide Sentinel-2 satellite imagery instead. With help from domain experts, we verified the labels of each image as not containing a brick kiln (class 0) or containing a brick kiln (class 1) based on the ground truth locations provided by [63] . There were roughly 374,000 examples total, with 6329 positives. We sampled 25% of the remaining negatives, removed any null values, and included the remaining 67,284 negative examples in our dataset. Dataset Impact SUSTAINBENCH introduces the first publicly released dataset of this size and quality on detecting brick kilns across Bangladesh from satellite imagery. This dataset was manually labeled and verified in-house by domain experts. Brick kiln detection is a challenging task because of the sparsity of kilns and lack of similar training data, but with recent developments in satellite monitoring [63] , it plays a key role in affecting policy developed by public health experts, industry stakeholders (e.g., kiln owners), and government agencies [88] . SUSTAINBENCH is the first to contribute a large dataset for this task, and the results of models will be utilized by policymakers. The dataset from Jean et al. 53 uses imagery from the USDA's National Agriculture Imagery Program (NAIP), which provides aerial imagery for public use that has four spectral bands (red (R), green (G), blue (B), and infrared (N)) at 0.6 m ground resolution. Comparison with Related Works Representation learning on natural images often uses canonical computer vision datasets like ImageNet and Pascal VOC to evaluate new methods. Satellite imagery lacks an analogous dataset. The high-resolution aerial imagery dataset released in SUSTAINBENCH aims to fill this void for land cover mapping with high-resolution inputs in particular. We note that, for object detection or lower resolution inputs, repurposing a dataset like fMoW [23] , SpaceNet [96] , Figure A11 : Example images from the NAIP dataset collected by Jean et al. 53 . The left image is an example of the "Grapes" class and the right image is an example of the "Urban" class. Sen12MS [86] , or BigEarthNet [89] would also be appropriate. To our knowledge, such repurposing has not yet been done. Dataset Impact Many tasks in sustainability monitoring have abundant unlabeled imagery but scarce labels. Land cover mapping in low-income regions is one example; crop type mapping in smallholder systems is another. By learning representations of satellite images in an unsupervised or self-supervised way, we may be able to improve performance on SDG-related tasks for the same number of training labels. Wang et al. 104 sampled one thousand 10km ×10km regions uniformly at random from the Earth's land surface, and removed regions that have fewer than 2 unique land cover classes and regions where one land cover type comprises more than 80% of the region's area. This resulted in 692 regions across 105 countries. The authors placed the 103 regions from Sub-Saharan Africa into the meta-test set and split the remainder into 485 meta-train and 104 meta-val regions at random. We provide the user with the option of placing any continent into the meta-test set and splitting the other continents' regions at random between the meta-train and meta-val sets. In each region, 500 points were sampled uniformly at random. At each point, the MODIS Terra Surface Reflectance 8-Day time series was exported for January 1, 2018 to December 31, 2018 ( Figure A12 ). MODIS collects 7 bands and NDVI was computed as an eighth feature, resulting in a time series of dimension 8 × 46. Global land cover labels came from the MODIS Terra+Aqua Combined Land Cover Product, which classifies every 500m-by-500m pixel into one of 17 land cover classes (e.g., grassland, cropland, desert). Comparison with Related Works This SUSTAINBENCH dataset from [104] is the first time that any few-shot learning dataset has been released for satellite data. Because land cover products are available globally (albeit with varying accuracy), Wang et al. 104 created a few-shot dataset for land cover classification. Dataset Impact Our hope is that this dataset can be included in evaluations of few-shot learning algorithms to see how they do on real-world time series, and that new algorithms will improve knowledge sharing from high-income regions to low-income ones. That way, performance on remote sensing tasks can be increased in low-income regions for tasks with few labels. Code to reproduce baseline models new to SustainBench can be found in our GitHub repo. E.1 DHS survey-based regression tasks (SDGs 1, 3, 4, 6) The DHS survey-based regression tasks include predicting an asset wealth index (SDG 1), women's BMI and child mortality rates (SDG 3), women's educational attainment (SDG 4), and water and sanitation indices (SDG 6). We adapt the KNN scalar NL model from [109] as the SUSTAINBENCH baseline model for these tasks. We chose this model for its simplicity and its high performance on predicting asset wealth as noted in [109] . For each label, we fitted a k-nearest neighbor (k-NN) regressor implemented using scikit-learn, and the k hyperparameter was tuned on the validation split, taking on integer values between 1 and 20, inclusive. The input to the k-NN model is the mean nightlights value from the nightlights band in the satellite input image, with separate models trained for the DMSP (survey year ≤ 2011) vs. VIIRS (survey year ≥ 2012) bands. We observe that our KNN nightlights baseline model roughly matches the performance described in [109] on the poverty prediction over space task (r 2 = 0.63). However, its r 2 values for predicting the other non-poverty labels is much lower: child mortality rate (r 2 = 0.01), women BMI (0.42), women education (0.26), water index (0.40), sanitation index (0.36). Our result is in line with a similar observation made by [45] , which also found that models trained on satellite images were better at predicting the asset wealth index than other non-poverty labels in 4 African countries. This strongly suggests that predicting these other labels almost certainly requires different models and/or inputs. Indeed, this is why SUSTAINBENCH provides street-level imagery in addition to satellite imagery. While SUSTAINBENCH also provides street-level images for many DHS clusters, we do not have any baseline models yet that take advantage of the street-level imagery. Some preliminary results using street-level imagery to predict asset wealth and women's BMI are shown in [64] , although they only tested their models on India and Kenya (compared to the ∼50 countries included for DHS-based tasks in SUSTAINBENCH The goal for the task with a single pixel label is to predict whether the single labeled pixel in the image is cropland or not. The goal for the task with image-level labels is to detect whether the majority (≥50%) of pixels in an image are classified to the cropland category. In both cases, the model is a U-Net trained using the binary cross entropy loss defined as where y is either the single-pixel label or the image-level binary label andŷ is the single-pixel or image-level model prediction. The evaluation metric is test set accuracy, precision, recall, and F1 scores. Details about the dataset are provided in Appendix D.3. Comparison with Related Works As mentioned in Appendix D.3, existing cropland products have been created using SVMs or tree-based algorithms that take into account a single pixel at a time [18, 35, 106] . In Togo, Kerner et al. 56 showed that a multi-headed LSTM (still trained on single pixels) outperformed these classifiers on cropland prediction. Since SUSTAINBENCH's cropland dataset is a static mosaic over the growing season, we chose to stick with the U-Net in Wang et al. 102 as the backbone architecture for the baseline. Segmentation models that are more state-of-the-art than the U-Net would be good candidates to surpass this baseline. Active learning or semi-supervised learning methods could also beat a baseline that uses randomly sampled weak labels for supervision. Future updates to this cropland dataset can include the temporal dimension for cropland mapping as well. The architecture described in Rustowicz et al. 83 obtained an average F1 score and overall accuracy of 0.57 and 0.61 in Ghana and 0.70 and 0.85 in South Sudan respectively, demonstrating the difficulty of this task. We use the same train, validation, and test splits as [83] . However, we use the full 64 × 64 imagery provided, while [83] further subdivided imagery into 32 × 32 pixel grids due to memory constraints. We also include variable-length time series with zero padding and masking, while [83] trimmed the respective time series down to the same length. We include variable-length time series with the reasoning that future research should be extendable to variable length time-series imagery. Due to these changes, we do not include baseline models from [83] for this iteration of the dataset. We provide more details in Appendix D.4. Comparison with Related Works Like cropland maps, most operational works classifying crop types employ SVM or random forest classifiers [69, 49] . The baseline model that we use from Rustowicz et al. 83 improves upon these by using an LSTM-CNN. Recent models used in other, non-operational works include 1D CNNs and 3D CNNs [103] and kNN [55] . A review from this year comparing five deep learning models found that 1D CNN, LSTM-CNN, and GRU-CNN all achieved high accuracy on classifying crop types in China, with differences between them statistically insignificant [111] . The crop type data in Kenya come from three regions: Bungoma, Busia, and Siaya. We provide ML researchers with the option of splitting fields randomly or by region. The former setup would test the crop type classifier's ability to distinguish crop type in-domain, while the latter would test the classifier's out-of-domain generalization. In Table 2 , we show results for the latter from [58] . In Kluger et al. 58 , the authors trained on one region and tested on the other two in order to design algorithms that transfer from one region to another. In order to generalize across regions, they corrected for (1) crop type class distribution shifts and (2) feature shift between regions by estimating the shift using a linear model. The features used are the coefficients of a harmonic regression on Sentinel-2 time series. (In the field of remote sensing, the Fourier transform is a common way to extract features from time series [54] .) The results from Kluger et al. 58 show that harmonic features achieve a macro F1-score of 0.30 when averaged across the three test sets, highlighting the difficulty of this problem. Note that this baseline did not include the Non-crop class in the analysis. Comparison with Related Works We expect that, for in-domain crop type classification, methods mentioned previously (1D CNN, LSTM-CNN, GRU-CNN) will outperform the random forests and LDA used in [54] and [58] . However, for cross-region crop type classification, Kluger et al. 58 found that a simpler LDA classifier outperformed a more complex random forest. Nonetheless, deep learning-based algorithms that are designed for out-of-domain generalization could outperform the baseline. To our knowledge, these methods have not yet been tested on crop type mapping. The task is to predict the county-level crop yield for that season, in metric tonnes per cultivated hectare, from the MODIS spectral histograms. We split the task into three separate subtasks of crop yield prediction in the United States, Argentina, and Brazil, and provide a 60-20-20 train-validationtest split. For each subtask, we encourage the usage of transfer learning and other cross-dataset training, especially due to the imbalance in data availability, between the United States, Argentina, and Brazil. Comparison with Related Works Several past works apply machine learning algorithms to humanengineered satellite features such as linear regression over NDVI [76] and EVI2 [16] . The papers that originally compiled SUSTAINBENCH's datasets compared against these methods and outperformed them. A few other works, like Sun et al. 90 , apply different architectures to spectral histograms similar to those provided in SUSTAINBENCH. Still other methods report results trained on ground-based data, such as ground-level images of crops [91] , but these datasets have not been made public. Given an input satellite image, the goal is to output the delineated boundaries between farm parcels, or the segmentation masks of farm parcels [9] . Similar to [9] , given the predicted delineated boundaries of an image, we use the Dice score as the evaluation metric where "TP" denotes True Positive, "FP" denotes False Positive, and "FN" denotes False Negative. As discussed in [9] , the Dice score Equation (2) has been widely used in image segmentation tasks and is often argued to be a better metric than accuracy when class imbalance between boundary and non-boundary pixels exists. Comparison with Related Works While the original paper that compiled SUSTAINBENCH's field delineation dataset achieved a Dice score of 0.61 with a standard U-Net [9] , we applied a new attention-based CNN developed specifically for field delineation [99] and achieved a 0.87 Dice score. To our knowledge, this is the state-of-the-art deep learning model for field delineation. The task is binary classification on satellite imagery, where class 0 "no kiln" means there is no brick kiln present in the image and class 1 "yes kiln" means there is a brick kiln. The training-validation split of the provided Sentinel-2 imagery is 80-20. The ResNet50 [43] model trained in [63] achieved 94.2% accuracy on classifying high-resolution (1m/pixel) imagery; the authors hand-validated all positive predictions and 25% of negative predictions. The imagery was not released publicly because it was proprietary, so we report a baseline validation accuracy of 94.5%, training a ResNet50 model on lower-res Sentinel-2 imagery using only the Red, Blue, and Green bands (B4, B3, B2). In addition to accuracy on the validation set, AUC, precision, and recall are also valuable metrics given the class skew toward negative examples. Jean et al. 53 performed land cover classification using features learned through an unsupervised, contrastive loss algorithm named Tile2Vec. Since the features are learned in entirely unsupervised ways, they can be used with any number of labels to train a classifier. At n = 1000, Tile2Vec features with a multi-layer perceptron (MLP) classifier achieved 0.55 accuracy; at n = 10, 000, Tile2Vec features with an MLP achieved 0.58 accuracy. Notable also is that Tile2Vec features outperformed end-to-end training with a CNN sharing the same architecture as the feature encoder up to n = 50, 000 labels. Comparison with Related Works Jean et al. 53 was the first to apply the distributional hypothesis from NLP to satellite imagery in order to learn features in an unsupervised way. Tile2Vec features outperformed features learned via other unsupervised algorithms like autoencoders and PCA. Methods that have not yet been tried but could yield high-quality representations include inpainting missing tiles, solving a jigsaw puzzle of scrambled satellite tiles, colorization, and other self-supervised learning techniques. Recently, [80] proposed a representation learning approach that uses randomly sampled patches from satellite imagery as convolutional filters in a CNN encoder, which could also be tested on this dataset. Wang et al. 104 defined 1-shot, 2-way land cover classification tasks in each region, and compared the performance of a meta-learned CNN with pre-training/fine-tuning and training from scratch. The meta-learned CNN performed the best on the meta-test set. The meta-learning algorithm used was model-agnostic meta-learning (MAML). The MAML-trained model achieved an accuracy of 0.74, F1-score of 0.72, and kappa score of 0.32 when averaged over all regions in Sub-Saharan Africa in the meta-test set. Unlike other classification benchmarks in SUSTAINBENCH, this benchmark uses the kappa statistic to evaluate models because accuracy and F1-scores can vary widely across regions depending on the class distribution, and it is not clear whether an accuracy or F1-score is good or bad from the values alone. We note that, as previously mentioned, existing land cover products tend to be less accurate in low-income regions such as Sub-Saharan Africa than in high-income regions. As a result, the MODIS land cover product used as ground truth will have errors in low-income regions. We suggest users also apply meta-learning and other transfer learning algorithms using other continents (e.g., North America, Europe) as the meta-test set for algorithm evaluation purposes. Comparison with Related Works To our knowledge, [104] and [82] (same authors) were the first works to apply meta-learning to land cover classification in order to simulate sharing knowledge from high-income regions to low-income ones. The baseline cited in Table 2 uses MAML, which is one of the most widely-used meta-learning algorithms. As the field of meta-learning is advancing quickly, we hope ML researchers will evaluate the latest meta-learning algorithms on this land cover classification dataset. Because the SDGs are high stakes issues with direct societal impacts ranging from local to global levels, it is imperative to exercise caution in addressing them. Researchers must be aware of and work to address the potential biases in the training data and in the generated predictions. For example, current models have been observed to over-predict wealth in poor regions and under-predict wealth in rich regions [52] . If such a model were used to distribute aid, the poor would receive less than they should. Much work remains to be done to understand and rectify the biases present in ML model predictions before they can play a significant role in policy-making. Because the SUSTAINBENCH dataset involves remote sensing and geospatial data that covers areas with private property, data privacy can be a concern. We summarize below the risks of revealing information about individuals present in each dataset. • For our survey data (see Tables A3 and A6) , the geocoordinates for DHS and LSMS survey data are jittered randomly up to 2km for urban clusters and 10km for rural clusters to protect survey participant privacy [19] . Furthermore, geocoordinates and labels are only released for "clusters" (roughly villages or small towns); no household or individually identifiable data is released. • Mapillary images, as well as satellite images from Landsat, Sentinel-1, Sentinel-2, MODIS, DMSP, NAIP, and PlanetScope, are all publicly available. In particular, all of these satellites other than PlanetScope are low-resolution. Mapillary automatically blurs faces of human subjects and license plates, it allows users who upload images to manually blur parts of images for privacy. Thus it is very difficult to get individually identifiable information from these images, and we believe that they do not directly constitute a privacy concern. • The crop yield statistics, made publicly available by the governments of the US, Argentina, and Brazil, are published after aggregating over such large areas that the yields of individual farms cannot be derived. • The crop type dataset released by Rustowicz et al. 83 has no geolocation information that would allow tracing to individuals. The satellite imagery released also has noise added so that it is more difficult to identify the original location and time that the imagery was taken. The crop type dataset released in Kenya likewise does not include geolocation. • For the field delineation dataset, boundary shapefiles are publicly available from the French government as part of the European Union's Common Agricultural Policy [9] . The data has been stripped of any identifying information about farmers. • Brick kilns labels were generated by one of the authors under the guidance of domain experts. The version of this dataset released in SUSTAINBENCH consists of Sentinel-2 imagery, from which very few privacy-concerning details can be seen (see Figure A10 ). • The labels used for the representation learning task and out-of-domain land cover classification task are products of other machine learning algorithms. They are publicly available and do not reveal information about individuals. Crop yield forecasting The 2030 Agenda for Sustainable Development xView3: Dark Vessels Machine Learning and Mobile Phone Data Can Improve the Targeting of Humanitarian Assistance Multidimensional Poverty Measurement and Analysis Farm parcel delineation using spatio-temporal convolutional networks Landsat-based classification in the cloud: An opportunity for a paradigm shift in land cover monitoring. Remote Sensing of Environment Towards fine resolution global maps of crop yields: Testing multiple methods and satellites in three countries Poverty Mapping Using Convolutional Neural Networks Trained on High and Medium Resolution Satellite Images, With an Application in Mexico The iWildCam 2020 Competition Dataset Machine learning can help get COVID-19 aid to those who need it most Predicting poverty and wealth from mobile phone metadata Forecasting crop yield using remotely sensed vegetation indices and crop phenology metrics Producao agricola municipal: producao das lavouras temporárias Copernicus Global Land Cover Layers-Collection 2 Geographic displacement procedure and georeferenced data release policy for the Demographic and Health Surveys Using satellite imagery to understand and promote sustainable development Identification of the inverse relationship between farm size and productivity: An empirical analysis of peasant agricultural production Deep Neural Networks and Transfer Learning for Food Crop Identification in UAV Images. Drones Functional map of the world CSIRO. ePaddocks Australian Paddock Boundaries A global database of historic and real-time flood events based on social media DeepGlobe 2018: A Challenge to Parse the Earth through Satellite Images Land productivity and plot size: Is measurement error driving the inverse relationship Density estimation using Real NVP VIIRS night-time lights Poverty from space: using high-resolution satellite imagery for estimating economic well-being Estimating Wealth Effects Without Expenditure Data-Or Tears: An Application To Educational Enrollments In States Of India The effect of water and sanitation on child health: evidence from the demographic and health surveys 1986-2007 MCD12Q1 MODIS/Terra+Aqua Land Cover Type Yearly L3 Global 500m SIN Grid V006 Global land cover mapping from MODIS: algorithms and early results Using deep learning and Google Street View to estimate the demographic makeup of neighborhoods across the United States Cash in the City: Emerging Lessons from Implementing Cash Transfers in Urban Africa Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment Mapping local variation in educational attainment across Africa A dataset for assessing building damage from satellite imagery High-Resolution Global Maps of 21st-Century Forest Cover Change Deep residual learning for image recognition Momentum contrast for unsupervised visual representation learning Can Human Development be Measured with Satellite Imagery? DMSP-OLS Radiance Calibrated Nighttime Lights Time Series with Intercalibration. Remote Sensing Densely connected convolutional networks Demographic and Health Surveys (various Assessment of an Operational System for Crop Type Map Production Using High Temporal and Spatial Resolution Satellite Optical Imagery ForestNet: Classifying Drivers of Deforestation in Indonesia using Deep Learning on Satellite Imagery Estimating spatial inequalities of urban child mortality Combining satellite imagery and machine learning to predict poverty Tile2Vec: Unsupervised Representation Learning for Spatially Distributed Data Smallholder maize area and yield mapping at national scales with Google Earth Engine Field-Level Crop Type Classification with k Nearest Neighbors: A Baseline for a New Kenya Smallholder Dataset Rapid Response Crop Maps in Data Sparse Regions Two shifts for crop mapping: Leveraging aggregate crop statistics to improve satellite-based maps in new regions WILDS: A Benchmark of in-the-Wild Distribution Shifts Learning multiple layers of features from tiny images Challenges and opportunities in mapping land use intensity globally Objects in Context in Overhead Imagery Scalable deep learning to identify brick kilns and aid regulatory capacity Predicting Livelihood Indicators from Community-Generated Street-Level Imagery Mapping geographical inequalities in access to drinking water and sanitation facilities in low-income and middle-income countries National cash transfer responses to Covid-19: operational lessons learned for social protection system-strengthening and future shocks Use of Deep Learning to Examine the Association of the Built Environment With Prevalence of Neighborhood Adult Obesity Climate Change USDA National Agricultural Statistics Service Cropland Data Layer. Published crop-specific data layer United Nations Department of Economic and Social Affairs, Division for Sustainable Development The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes Accurate Privacy Blurring at Scale Using remotely sensed night-time light as a proxy for poverty in Africa Illuminating dark fishing fleets in North Korea PyTorch: An Imperative Style, High-Performance Deep Learning Library Semantic Segmentation of Crop Type in Africa: A Novel Dataset and Analysis of Deep Learning Methods Exploring Alternative Measures of Welfare in the Absence of Expenditure Data SEN12MS-A Curated Dataset of Georeferenced Multi-Spectral Sentinel-1/2 Imagery for Deep Learning and Data Fusion Predicting Economic Development using Geolocated Wikipedia Articles A Better Brick: Solving an Airborne Health Threat Bigearthnet: A large-scale benchmark archive for remote sensing image understanding County-Level Soybean Yield Prediction Using Deep CNN-LSTM Model Convolutional neural networks in predicting cotton yield from images of commercial fields Satellite imaging reveals increased proportion of population exposed to floods The Sustainable Development Goals Report 2021. The Sustainable Development Goals Report Tier Classification for Global SDG Indicators SpaceNet: A Remote Sensing Dataset and Challenge Series MOD09A1 MODIS/Terra Surface Reflectance 8-Day L3 Global 500m SIN Grid V006 Deep learning on edge: Extracting field boundaries from satellite images with a convolutional neural network Detect, consolidate, delineate: Scalable mapping of field boundaries using satellite images MYD11A2 MODIS/Aqua Land Surface Temperature/Emissivity 8-Day L3 Global 1km SIN Grid V006 Deep Transfer Learning for Crop Yield Prediction with Remote Sensing Data Weakly supervised deep learning for segmentation of remote sensing imagery Mapping Crop Types in Southeast India with Smartphone Crowdsourcing and Deep Learning Meta-learning for few-shot time series classification Socioecologically informed use of remote sensing data to predict rural household poverty Automated cropland mapping of continental Africa using Google Earth Engine cloud computing Conterminous United States crop field size quantification from multi-temporal Landsat data Bag-of-visual-words and spatial extensions for land-use classification Using publicly available satellite imagery and deep learning to understand economic well-being in Africa Deep Gaussian Process for Crop Yield Prediction Based on Remote Sensing Data Evaluation of Five Deep Learning Models for Crop Type Mapping Using Sentinel-2 Time Series Images with Missing Information A Framework for Sample Efficient Interval Estimation with Control Variates National Statistical Office, Government of Malawi Tanzania National Bureau of Statistics (NBS). Tanzania National Panel Survey Tanzania: NBS. Ref: TZA_2010_NPS-R2_v03_M. Dataset downloaded on September 5, 2021. Tanzania National Bureau of Statistics (NBS). Tanzania National Panel Survey Report (NPS) -Wave Tanzania: NBS. Ref: TZA_2012_NPS-R3_v01_M. Dataset downloaded on The authors would like to thank everyone from the Stanford Sustainability and AI Lab for constructive feedback and discussion; the Mapillary team for technical support on the dataset; Rose Rustowicz for helping compile the crop type mapping dataset in Ghana and South Sudan; Anna X. Wang and Jiaxuan You for their help in making the crop yield dataset; and Han Lin Aung and Burak Uzkent for permission to release the field delineation dataset. This work was supported by NSF awards (#1651565, #1522054) , the Stanford Institute for Human-Centered AI (HAI), the Stanford King Center, the United States Agency for International Development (USAID), a Sloan Research Fellowship, and the Global Innovation Fund.