key: cord-0688313-huwoab22 authors: Tiede, Dirk; Schwendemann, Gina; Alobaidi, Ahmad; Wendt, Lorenz; Lang, Stefan title: Mask R‐CNN‐based building extraction from VHR satellite data in operational humanitarian action: An example related to Covid‐19 response in Khartoum, Sudan date: 2021-05-06 journal: Trans GIS DOI: 10.1111/tgis.12766 sha: 659c98c548e5b7a21a9138de32a3d0adadc7a1a3 doc_id: 688313 cord_uid: huwoab22 Within the constraints of operational work supporting humanitarian organizations in their response to the Covid‐19 pandemic, we conducted building extraction for Khartoum, Sudan. We extracted approximately 1.2 million dwellings and buildings, using a Mask R‐CNN deep learning approach from a Pléiades very high‐resolution satellite image with 0.5 m pixel resolution. Starting from an untrained network, we digitized a few hundred samples and iteratively increased the number of samples by validating initial classification results and adding them to the sample collection. We were able to strike a balance between the need for timely information and the accuracy of the result by combining the output from three different models, each aiming at distinctive types of buildings, in a post‐processing workflow. We obtained a recall of 0.78, precision of 0.77 and F (1) score of 0.78, and were able to deliver first results in only 10 days after the initial request. The procedure shows the great potential of convolutional neural network frameworks in combination with GIS routines for dwelling extraction even in an operational setting. and buildings, using a Mask R-CNN deep learning approach from a Pléiades very high-resolution satellite image with 0.5 m pixel resolution. Starting from an untrained network, we digitized a few hundred samples and iteratively increased the number of samples by validating initial classification results and adding them to the sample collection. We were able to strike a balance between the need for timely information and the accuracy of the result by combining the output from three different models, each aiming at distinctive types of buildings, in a post-processing workflow. We obtained a recall of 0.78, precision of 0.77 and F 1 score of 0.78, and were able to deliver first results in only 10 days after the initial request. The procedure shows the great potential of convolutional neural network frameworks in combination with GIS routines for dwelling extraction even in an operational setting. A central piece of information required for almost any humanitarian intervention are accurate numbers about the population in need, and the locations or areas where the affected population resides . In particular, in protracted crises, traditional census data are missing or outdated. Building footprint polygons can be used to estimate population numbers, either by distributing known population numbers from larger administrative units onto the individual footprints (top-down, dasymetric mapping approach; cf. Eicher & Brewer, 2001) or by conducting microcensuses to calculate average occupancy rates per building over a small subset of buildings, and then extrapolating across the entire city (bottom-up approach; for an overview see Checchi, Stewart, Palmer, and Grundy (2013) . When building footprints are not available or updated, for example in OpenStreetMap (OSM), they can be extracted from very-high resolution satellite images. In Khartoum, Sudan, a city with an estimated population of the order of 5.1 million inhabitants (Zerboni et al., 2020 ) and a surface area of around 1,000 km 2 , the latest census dates back to 2008 (United Nations Department of Economic and Social Affairs, 2019). OSM contains a fairly complete street network, but building footprints are available only for a few parts of the city. Therefore, in April 2020 Médecins sans Frontières (MSF) requested a recent map of dwelling density distribution across the city, and later a map of individual dwellings/building footprints. MSF supports four hospitals and five clinics in the city, providing care to Covid-19 patients, promoting infection prevention and control measures, and helping to keep essential medical services up and running. MSF used these maps to estimate health-care demand at the health-care posts they support. In this article we report on the usage of a deep learning approach in such an operational humanitarian response context in a training-sample-scarce situation, in combination with GIS post-processing routines to combine and refine the results of the individual deep learning classifiers. Machine learning approaches, especially deep convolutional neural networks (CNN), have successfully shown their capabilities in unprecedented object-extraction accuracies also from remotely sensed images. Still, ready-to-use networks face challenges with unfamiliar context and less typical situations, where specific sample databases do not (yet) exist. Training or retraining of established networks needs a lot of effort in these situations, since the networks do not generalize very well (Marcus, 2018) . We present our applied approach, balancing out the need for timely information and the accuracy of the result in a sample-scarce and quite complex situation of dense buildings in a highly dynamic, growing mega-city (Zerboni et al., 2020) . Automated dwelling and building extraction from very high-resolution (VHR) optical satellite imagery for humanitarian purposes, mainly for population estimation in refugee or internally displaced person camps, but increasingly including informal settlements within fast-growing cities, has about two decades of history (see, for example, Bjorgo, 2000; Checchi et al., 2013; Ehrlich et al., 2009; Witmer, 2015) . (Semi-)operational solutions were developed to support humanitarian actors with up-to-date information especially in dynamically developing situations Tiede, Füreder, Lang, Hölbling, & Zeil, 2013) . Pixel-based approaches were rapidly replaced due to their limitations in spatial context analysis by objectbased image analysis (OBIA) or mathematical morphology-based approaches (Giada, De Groeve, Ehrlich, & Soille, 2003; Kemper & Heinzel, 2014; Kemper, Jenerowicz, Pesaresi, & Soille, 2011; Knoth & Pebesma, 2017; Lang, Tiede, Hölbling, Füreder, & Zeil, 2010; Spröhnle et al., 2014; Stängel, Tiede, Lüthje, Füreder, & Lang, 2014; Tiede, Krafft, Füreder, & Lang, 2017) , while the development of semi-automated and fully automated algorithms took quite some time. Challenges in an operational context of humanitarian actions include: quality issues of the images available in crisis situations; the small scale of the objects/dwellings to be extracted; the unplanned dynamic structure of the settlements; and varying environmental conditions. While knowledge-based approaches | 3 have clear advantages (e.g., sample data are not needed), they are time-consuming to set up and limited in their transferability with ever-changing environmental and atmospheric conditions of the target scenes. For some years now, CNNs have become prominent in satellite image analysis (Hoeser, Bachofer, & Kuenzer, 2020; Ma et al., 2019) , and various sample databases have been established with the aim of training transferable and more generically applicable networks. 1 Nevertheless, current applications still need a lot of additional training samples to obtain acceptable results, especially for non-standard tasks in information extraction . CNNs are nowadays also used in humanitarian mapping tasks; for example , Quinn et al. (2018) reviewed recent developments and conducted initial experiments for selected refugee camps where manually mapped data were available, concluding that full automation is not yet possible. Ghorbanzadeh, Tiede, Dabiri, Sudmanns, and Lang (2018) and Ghorbanzadeh, Tiede, Wendt, Sudmanns, and Lang (2020) showed for a single refugee camp, yet with different VHR satellite sensors and different time-steps, how CNNs can be coupled with knowledge-based OBIA approaches. Lu, Koperski, Kwan, and Li (2020) extracted tents in a Syrian refugee camp, comparing their proposed fully convolutional network based on an ImageNet pretrained VGG-16 network with different existing CNN networks and manually labelled data. Most documented approaches have in common that they are not benchmarked at an operational setting. Also, manually labelled data for the whole area under investigation were available and timely results are not considered crucial in these experimental works. We conclude that they have limited capacity in sample-scarce situations. Building extraction not specifically related to humanitarian response is as another broad field in deep learning applications, and in particular the DeepGlobe 2018 Satellite Image Understanding Challenge should be mentioned here (Demir et al., 2018) . In this challenge, data and training samples for Khartoum were also available, but a cross-check of the data set provided (WorldView-3 image from 2015 in off-nadir view) and the respective training samples were not readily useful for this task due to their very detailed building footprints, yet shifted slightly and outdated compared to our satellite scene. The special situations we face in humanitarian mapping, namely, fast delivery of results for areas not previously mapped with large numbers of informal dwelling structures, some of them small in size and attached to each other, are a challenge using training-data-hungry CNNs. Perfectly transferable, already trained networks are not (yet) available; so retraining on top of an initially trained network is needed. Also, additional information is often not available for these areas and the most recent images often do not show perfect conditions in respect of viewing angle, seasonality, atmospheric disturbances and so on. The biggest challenge in this operational work remains, namely, to provide enough samples in a short timeframe to train the CNN so that results would fulfil two main goals: fast delivery of a map of the area indicating density and sizes of the buildings; and sufficiently high accuracy to support humanitarian operational work. In the following we report on our workflow, aiming not so much at the ultimate accuracy CNNs can reach in perfect conditions, but rather to foster operational usage in humanitarian operations, optimizing both timeliness and accuracy in sample-scarce situations. The overall workflow is depicted in Figure 1 and explained in this section in detail. We used a Pléiades 1A image, acquired on 8 November 2019 and pansharpened to 0.5 m ground sampling distance, which covered approx. 825 km 2 of the central and eastern part of Khartoum (see Figure 2 ). Dwelling extraction and dwelling density mapping were done under the premise of targeting smaller units and their respective density (hence the term "dwelling"), not the precise mapping of building footprints for cadastral purposes and the like. A Mask R-CNN approach (He, Gkioxari, Dollár, & Girshick, 2017) was employed because of its remarkable results in instance segmentation and object-extraction capability. Mask R-CNN is an extension of Faster R-CNN (Ren, He, Girshick, & Sun, 2017) , a class of region-based CNN that has been speed-optimized for classification as well as object detection using proposed regions (bounding boxes) for multiple objects present in an image. Mask R-CNN extends this approach using small, fully connected networks applied to each region, predicting an object-specific segmentation mask per pixel, thus extracting target objects without background. Sampling focuses therefore on target objects only; no background or negative samples are needed, which is an advantage over other approaches in a time-critical, operational setting. Mask R-CNN has been successfully applied in similar satellite image analysis tasks, for example for sparse and multi-sized object detection in VHR images (Wang, Tao, Wang, Wang, & Li, 2019) , building extraction (Wen et al., 2019) or within the DeepGlobe Building Extraction Challenge (Zhao, Kang, Jung, & Sohn, 2018) . We used Mask R-CNN as implemented in the Python API for Esri's ArcGIS environment, which allows access to the underlying deep learning algorithms within a GIS environment to perform sample collection, training sample generation, classification, post-processing and map production in a seamless workflow to save time in the production process. Mask R-CNN provides GIS-ready objects, and it outperforms other approaches in this respect. Due to the huge number of dwellings and the lack of a priori sample data, we followed a parallel approach: two trained image analysts digitized dwellings on image subsets representing specific generic building footprints; in parallel, a Mask R-CNN network was trained using the growing sample database. Initial results of the Mask R-CNN network for different subsets of the image data were checked and validated by the interpreters against visual F I G U R E 1 Overall workflow of training sample generation manually and supported by initial Mask R-CNN analysis on subsets, post-processing of final results, and map production results and the positively evaluated ones were added to the growing sample database without additional digitizing work; several iterations helped to rapidly enhance the sample database much more efficiently compared to a manual approach. The goal was to find the trade-off between time-consuming manual sample extraction and a trained network able to deliver a solid overview of building density and structure distribution across the city. Based on the initial results, three different models were trained: one model for small dwellings using training sample augmentation (rotation by increments of 10°); a second one for larger dwellings without augmentation; and a third model for very dense blocks of buildings. Initial tests with the second model showed that augmentation did not lead to better results, most likely due to the shadows casted by the buildings. The shadows seem to influence the object detection process, and rotation of directed shadows did not improve the results here. The third model focused on building blocks, meaning larger aggregates of concatenated dwellings, in particular in densely built-up areas with large numbers of very small dwellings. The blocks of buildings (sampled for the densest parts of the city) should help to indicate these areas in the final density maps. Also here the sample size was increased through augmentation. For all three models ResNet-50 was selected as the backbone. The numbers of training samples used for the training of the three final models (mixture of manually digitized samples and validated initial results of the Mask R-CNN models), and their average size, are summarized in Table 1 . We deliberately allowed low-probability thresholds in order to create dwelling footprint layers for each size category; afterwards we employed post-processing based on the intersection of footprints and the probability values of each polygon to reduce double-counting and errors. The aim of this strategy was to "relax" the network to some degree, and to recognize non-typical building instances, for which only a few samples had been collected. Post-processing was done with the help of knowledge-based GIS routines (union of feature, identical removal, elimination/dissolve of slivers) and removal of dwellings smaller than 7 m 2 (defined as the minimum dwelling size based on visual inspection). For the final map production a visual check and manual cleaning of obvious misclassification/errors were also conducted. For the accuracy assessment a 50 m × 50 m grid was created over the area of Khartoum. Then 355 of these cells were randomly selected (see Figure 2 ) and dwellings were manually delineated by two trained mapping experts not involved in the study itself. The assessment was conducted on the post-processed data for the dwelling classes before the manual refinement took place, in order to evaluate the automated part of the approach only. Precision, recall, and F 1 were calculated based on true positives (TP), false positives (FP), and false negatives (FN) detected within the validation cells. Precision indicates the proportion of target dwellings correctly identified by the proposed approach: while recall indicates the proportion of target dwellings in the validation data that were correctly detected by the approach: F 1 is used to balance precision and recall parameters: For an evaluation of our building blocks result, a comparison with a new data set published after our map production in July 2020 was conducted (Dooley, Boo, Leasure, & Tatem, 2020) . This data set is based on recent building footprint analyses, which are not yet publically available. (1) precision = TP TP + FP In total, we extracted approximately 1.4 million dwellings automatically (1,099,753 small dwellings from the first model and 297,770 larger dwellings from the second model) and 26,512 dense building blocks. These figures were reduced to 1.17 million dwellings after the automated post-processing steps (890,074 small dwellings and 283,069 larger dwellings) and 20,091 dense building blocks (Table 2) . Figure 3 shows a subset of the area analysed and the dwelling footprints of the small and large dwelling category before post-processing took place. This was achieved using only a few hundred completely manually digitized training samples, complemented by a few thousand automatically extracted and manually verified samples (see Table 1 and Figure 1 ). Despite limited staff time and computer power (three researchers involved part-time in the analysis process, including manual sample generation and automation; training and analysis conducted on a standard workstation with an NVIDIA Quadro P4000 8 GB graphics card), we were able to deliver first results within 10 days after receiving the request, followed by some more detailed analyses and final map production a few days later. Before the final maps were created some manual refinement took place, mainly to remove clutter such as obvious outliers and misclassifications. Two final maps were produced, a dwelling density map where dwelling area was aggregated to a 1 ha sized hexagon-grid for an overview of different dwelling density zones within the city, and a second map showing the extracted dwellings. We distinguish between six dwelling categories by size, ranging from 7 m 2 to over 900 m 2 in surface area, and overlaid the results on the outlines of the additionally extracted building blocks to indicate the highest dwelling density areas (Figure 4 ). In 172 out of the 355 validation cells no dwellings-either in the validation data or in the automated extractionwere present (see the yellow cells in Figure 2 ). This shows that false positives and false negatives are in this case not mixed up with non-dwelling classes outside of the urban area; results are a good match to the city structure. Instead, they are more related to problems within the dense building areas. For the resulting 183 cells showing dwellings, the accuracy figures are shown in Table 3 . In light of the limited time available for this information request, the validation shows quite good results for these 183 cells. The number of buildings is very similar, and in particular the extent of the area covered by buildings seem to be a very good match, but the real area overlap reveals some problems with the exact locations. our approach as a final indicator map for locations where very dense structures are occurring. Figure 6 shows a subset of such a densely built-up area and the delineated building blocks. This result acts as an uncertainty layer providing additional information, and has been validated visually before map production. A comparison with a new data set published after our map production (Dooley et al., 2020 ) is shown in Figure 7 ; it exhibits a very similar pattern for the highest building density areas. . These values are not directly comparable to ours, since intersection over union was used in the evaluation procedure of the DeepGlobe Building Extraction Challenge, while we evaluated the spatial match of buildings without looking into intersecting parts only. The reason is, that we were aiming for a correct number of buildings for population estimation in this humanitarian context and less for correct building delineation. Nevertheless, the comparison of the results underpins the challenges of the complex situation of dense buildings in a highly dynamic mega-city like Khartoum. Major problems in our study occurred therefore mainly due to very dense and attached nature of buildings in some parts of the city. The introduction of an additional dense building block detection layer helped in identifying these dense areas, showing also the areas with higher uncertainty in our approach. Visual inspection prior to map production revealed that larger rectangular dwellings were mixed up with smaller dwellings surrounded by a rectangular wall and a similar cast shadow. These errors were minor, but should be addressed in the sampling process in the future. The iterative sampling strategy, interacting with initial training results from the Mask R-CNN approach, helped significantly in reducing time and effort required for the training sample production. Accuracy limitations need to be taken into consideration whenever a balance between delivery of fast initial results and reliability is needed to match the user expectations in the humanitarian response of concern. In this case the user, MSF, was satisfied with the accuracy of the map delivered; the information rapidly gained on some 1.2 million dwelling footprints could not have been achieved by other methods in such a short time. F I G U R E 6 Dense building blocks (red outlines) delineated by the third Mask R-CNN model to compensate for underestimation of small dwellings in very densely built-up areas. Dwelling types are categorized according to their size (see legend) F I G U R E 7 Comparison of the extracted dense building blocks (right) with a building density layer produced by Dooley et al. (2020, left) , aggregating building density to approximately 100 m grid cells Our workflow for Mask R-CNN-based dwelling extraction from VHR data shows great potential in operational workflows for humanitarian response. The ultimate goal of a trained network that convincingly performs with less or no additional sampling has not fully been achieved. However, as shown in our study, interactive sampling with feedback loops for refinement improved the speed of the sampling process and helped address limitations identified already in the early stage of the workflow. Future work is needed to elaborate on the reuse of existing models (e.g. from the DeepGlobe Building Extraction Challenge). Similar exercises can help in improving results and reducing the amount of training data needed for transferability on different images and areas. We acknowledge the support of Vitoria Barbosa and Tanya Singh in producing the validation data set. This research was supported by the Austrian Federal Ministry for Digital and Economic Affairs, the National Foundation for Research, Technology and Development, the Christian Doppler Research Association, and MSF Austria (Ärzte ohne Grenzen Sektion Österreich). Data available on request from the authors. https://orcid.org/0000-0002-5473-3344 E N D N OTE 1 See, for example, https://github.com/chrie ke/aweso me-satel lite-image ry-datasets. Using very high spatial resolution multispectral satellite sensor imagery to monitor refugee camps Validity and feasibility of a satellite imagery-based method for rapid estimation of displaced populations DeepGlobe 2018: A challenge to parse the Earth through satellite images Gridded maps of building patterns throughout sub-Saharan Africa (Version 1.1) Can Earth observation help to improve information on population? Indirect population estimations from EO derived geo-spatial data: Contribution from GMOSS Dasymetric mapping and areal interpolation: Implementation and evaluation Dwelling extraction in refugee camps using CNN-First experiences and lessons learnt. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences Transferable instance segmentation of dwellings in a refugee camp-Integrating CNN and OBIA Information extraction from very high resolution satellite imagery over Lukole refugee camp Mask R-CNN Object detection and image segmentation with deep learning on Earth observation data: A review-Part II Object detection and image segmentation with deep learning on Earth observation data: A review-Part I: Evolution and recent trends Mapping and monitoring of refugees and internally displaced people using EO data Enumeration of dwellings in Darfur camps from GeoEye-1 satellite images using mathematical morphology Detecting dwelling destruction in Darfur through object-based change analysis of very high-resolution imagery Earth observation tools and services to increase the effectiveness of humanitarian assistance Earth observation (EO)-based ex post assessment of internally displaced person (IDP) camp evolution and population dynamics in Zam Zam Deep learning for effective refugee tent extraction near Syria-Jordan border Deep learning in remote sensing applications: A metaanalysis and review Deep learning: A critical appraisal Humanitarian applications of machine learning with remote-sensing data: Review and case study in refugee settlement mapping Faster R-CNN: Towards real-time object detection with region proposal networks Earth observation-based dwelling detection approaches in a highly complex refugee camp environment: A comparative study Object-based image analysis using VHR satellite imagery for monitoring the dismantling of a refugee camp after a crisis: The case of Lukole, Tanzania Automated analysis of satellite imagery to provide information products for humanitarian relief operations in refugee camps: From scientific development towards operational services Stratified template matching to support refugee camp analysis in OBIA workflows United Nations Department of Economic and Social Affairs Big Map R-CNN for object detection in large-scale remote sensing images Automatic building extraction from Google Earth images under complex backgrounds based on deep instance segmentation network Remote sensing of violent conflict: Eyes from above Exploiting deep learning and volunteered geographic information for mapping buildings in Kano The Khartoum-Omdurman conurbation: a growing megacity at the confluence of the Blue and White Nile Rivers Mask R-CNN-based building extraction from VHR satellite data in operational humanitarian action: An example related to Covid-19 response in Khartoum