key: cord-0926060-be8l6yok
authors: Lopez-Rincon, Alejandro; Tonda, Alberto; Mendoza-Maldonado, Lucero; Mulders, Daphne G. J. C.; Molenkamp, Richard; Perez-Romero, Carmina A.; Claassen, Eric; Garssen, Johan; Kraneveld, Aletta D.
title: Classification and specific primer design for accurate detection of SARS-CoV-2 using deep learning
date: 2021-01-13
journal: Sci Rep
DOI: 10.1038/s41598-020-80363-5
sha: 7b3401352e81c826ab202d28ca9684d61469324b
doc_id: 926060
cord_uid: be8l6yok

In this paper, deep learning is coupled with explainable artificial intelligence techniques for the discovery of representative genomic sequences in SARS-CoV-2. A convolutional neural network classifier is first trained on 553 sequences from the National Genomics Data Center repository, separating the genome of different virus strains from the Coronavirus family with 98.73% accuracy. The network’s behavior is then analyzed, to discover sequences used by the model to identify SARS-CoV-2, ultimately uncovering sequences exclusive to it. The discovered sequences are validated on samples from the National Center for Biotechnology Information and Global Initiative on Sharing All Influenza Data repositories, and are proven to be able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. Next, one of the sequences is selected to generate a primer set, and tested against other state-of-the-art primer sets, obtaining competitive results. Finally, the primer is synthesized and tested on patient samples (n = 6 previously tested positive), delivering a sensitivity similar to routine diagnostic methods, and 100% specificity. The proposed methodology has a substantial added value over existing methods, as it is able to both automatically identify promising primer sets for a virus from a limited amount of data, and deliver effective results in a minimal amount of time. Considering the possibility of future pandemics, these characteristics are invaluable to promptly create specific detection methods for diagnostics.

www.nature.com/scientificreports/ of site 8872 located in ORF1ab gene and site 28,144 located in ORF8 gene gradually increased from 0 to 29% as the epidemic progressed 11 . Apart from the false negative test problems, SARS-CoV-2 assays can yield a small portion of false positives through nonspecific detection of other Coronaviruses, as the virus is closely related to other Coronavirus organisms 12 . In addition, SARS-CoV-2 may be present with other respiratory infections, hindering its identification 13, 14 .

Thus, it is fundamental to improve existing diagnostic tools to contain the spread. For example, diagnostic tools combining computed tomography (CT) scans with deep learning have been proposed, achieving an improved detection accuracy of 82.9% 15 . Another solution being used for studying SARS-CoV-2, is sequencing of the viral complementary DNA (cDNA). For example, we can use this sequencing data with cDNA, resulting from the PCR of the original viral RNA; e.g. Real-Time PCR amplicons to identify the SARS-CoV-2 16 .

Classification using viral sequencing techniques is mainly based on alignment methods such as FASTA 17 and BLAST 18 . These methods rely on the assumption that cDNA sequences share common features, and their order prevails among different sequences 19, 20 . However, these methods suffer from the necessity of needing base sequences for the detection 21 . Nevertheless, it is necessary to develop innovative improved diagnostic tools that target the genome to improve the identification of pathogenic variants, as sometimes several tests are needed to have an accurate diagnosis. Therefore, as an alternative, deep learning methods have been suggested for classification of DNA sequences. The advantage of these methods are that they do not need pre-selected features to identify or classify DNA sequences. Deep Learning has been efficiently used for classification of DNA sequences, using one-hot label encoding and Convolutional Neural Networks (CNN) 22, 23 , albeit the examples in literature are featuring DNA sequences of length up to 500 bps, only.

In particular, for the case of viruses, NGS genomic samples might not be identified by BLAST, as there are no reference sequences valid for all genomes, as viruses have high mutation frequency 24 . Alternative solutions based on deep learning have been proposed to classify viruses, by dividing sequences into pieces of fixed length, ranging from 300 bps 24 to 3000 bps 25 . However, this approach has the negative effect of potentially ignoring part of the information contained in the input sequence, that is disregarded if it cannot completely fill a piece of fixed size. The global impact of SARS-CoV-2 prompted researchers to apply effective alignment-free methods to the classification of the virus: for example, in 26 the authors propose the use of Machine Learning Digital Signal Processing for separating the virus from similar strains, with remarkable accuracy. Nevertheless, there is no human-readable information that can be extracted from their black-box procedure, so the biological insight provided by their approach is limited. In order to offer further understanding to experts, techniques from the field of explainable AI (XAI) 27,28 could be potentially effective; and it is interesting to remark that similar consideration specifically for the medical domain have already appeared in literature 29 .

Given the impact of the world-wide outbreak, international efforts have been made to simplify the access to viral genomic data and metadata through international repositories, such as the National Genomics Data Center (NGDC) repository 11 , the National Center for Biotechnology Information (NCBI) repository 30 and the Global Initiative on Sharing All Influenza Data (GISAID) repository 31 , expecting that the ease of access to information would make it possible to develop medical countermeasures to control the disease worldwide, as it happened in similar cases earlier [32] [33] [34] . Thus, taking advantage of the available information of international resources without any political and/or economic borders, we propose an innovative system based on viral gene sequencing.

Using a CNN to separate Coronaviruses belonging to different strains 35 , including SARS-CoV-2, we apply techniques inspired by XAI in computer vision to discover representative cDNA sequences that the network uses to classify SARS-CoV-2. We then validate the discovered sequences on datasets not used during the training of the CNN, and show how to exploit them to create a novel, highly informative set of sequence features (e.g. viral sequences). Such sequences can be later inspected and analyzed by human experts. Experimental results show that the new set of sequence features leads traditional, simple classifiers, to correctly assess SARS-CoV-2 with remarkable accuracy (> 99%). A few of the discovered sequences also possess the correct characteristics for potentially becoming primers, as just checking for their presence in samples is enough to specifically identify SARS-CoV-2. Laboratory testing on the most promising sequences identified, showed that the primers found by our approach can be a viable alternative to the commonly adopted primers at the time of writing. These results could pave the way to an automatic procedure for the design of primers, see Fig. 1 for the proposed workflow.

CNN classification and feature construction. The trained CNN described in the "Methods" section obtained a mean accuracy of 98.73% in a 10-fold stratified cross-validation. Observing the confusion matrix for the 5 considered classes, reported in Fig. 2 it is remarkable to notice that even samples from underrepresented classes were mostly correctly positioned. Such an encouraging result can indicate that the network was routinely able to uncover meaningful sequences to separate the different classes of viruses.

Once the network is trained, in a first analysis, we plot the inputs and outputs of the convolutional layer, to visually inspect for patterns. As an example, in Fig. 3a we report the visualization of the first 1250 bps of each of the 553 samples from the NGDC 11 repository.

Since each filter in the network slides a 21-bps window over the input, and for each step produces a single value, the output of a filter is a sequence of values in (0, 1). The output of the max pooling for each of the 12 filters is then further inspected for patterns. It is noticeable how samples belonging to different classes can be already visually distinguished. At this step, we identify filter 0 as the most promising, as it seems to focus on a few relevant points in the genome, that could correspond to meaningful cDNA sequences.

Given this data, it is now possible to identify the 21-bps sequences that obtained the highest output values in the max pooling layer of filter 0, in a section of 148 positions. This process results in 210 (31, www.nature.com/scientificreports/ convolutional filter, in a specific 148-position interval of the original genome: the first max pooling feature will cover positions 1-148, the second will cover position 149-296, and so on. We show the complete set of max pooling features for the complete data 4410 (210*21) arranged one after the other, in Fig. 3b . The CNN architecture is described in the methods section, the visualization of the filter, and max pooling are available in the Supplementary Materials, Section 1.

Analyzing the different sequence values appearing in the max pooling feature space, we get a total of 3827 unique 21-bps cDNA sequences, that can potentially be informative for identifying different virus strains. For example, sequence AGG TAA CAA ACC AAC CAA CTT is only found inside the class of SARS-CoV-2, in 59 out of the 66 available samples. Sequence CAC GAG TAA CTC GTC TAT CTT is present again only in SARS-CoV-2, in 63 out of the 66 samples. The experiments presented in the following subsections to validate our method have different objectives and make use of different datasets. A summary of all the experiments and datasets used is shown in Fig. 1 .

Recapitulating the results of experiments 1-4 (see Fig. 1 ), we discovered 12 meaningful 21-bps sequences that best characterize SARS-CoV-2. For all the analyzed data, these sequences appear only in SARS-CoV-2 samples and not in any other viruses, as summarized in Table 1 . Remarkably, our results outperform earlier publications using machine learning for identifying SARS-CoV-2 (see for example 26 ) , with the added benefit of producing human-readable results instead of a plain black box classifier.

Laboratory validation of the candidate primer set. We calculated the frequency of appearance of different primer sets' sequences used in SARS-CoV-2 RT-PCR tests developed by World Health Organization (WHO) referral laboratories, and compared it to our primer design on the GISAID dataset ( Table 2 ). All of the sequences have a frequency of appearance of > 99% , with the exception of CHINA-CDC-N-F, with a 68.52%. This is consistent with the percentage of genomes with mutation in the primer region, as stated by latest GISAID update summary of August 11th, 2020 31 . For the in-silico analysis of specificity, we compared all the primers sets' sequences with the NCBI-B and NGDC dataset, and the results show that HKU-N-F, HKU-N-R, Charite-E-F, Charite-E-R and US-CDC-N2-F are not specific to SARS-CoV-2, as they bind to SARS-CoV, too. The rest of the sequences, including our design, only appear in SARS-CoV-2. In summary, of the 8 different primer sets, 3 of them appear to not be specific to SARS-CoV-2; and by frequency of appearance, our design is the 3rd best option among the remaining 5, considering the lowest frequency between the -F and -R primer. This is a remarkable result, considering that the proposed primer set has been extracted by an almost completely automated procedure.

To validate the data obtained in-silico by laboratory methods, a conventional PCR was performed on cDNA obtained from RNA from SARS-CoV-2 and other human coronaviruses. In addition, RNAs from nasopharyngeal swabs from six patients previously diagnosed with SARS-CoV-2 infection and four patients negative for SARS-CoV-2 by routine diagnostic method 5 were analyzed with the same conventional PCR (Fig. 4) . Different dilutions of SARS-CoV-2 RNA were detected with similar sensitivity compared to the diagnostic reference assay. (Fig. 4 , lanes 1-8). Our candidate primer set exclusively detected SARS-CoV-2 and did not amplify RNA from other human coronaviruses (Fig. 4, lanes 9-14) . The candidate primer set was able to detect SARS-CoV-2 RNA from patient samples previously found positive for SARS-CoV-2, but not in patients previously found negative (Fig. 4 , lanes [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] . Although further validation will be required to develop this candidate primer set into a diagnostic assay, our results clearly demonstrate the power of our method to select potential sequences for further validation.

Being able to reliably identify SARS-CoV-2 and distinguish it from other similar pathogens is important to contain its spread. The time of processing samples and the availability of reliable diagnostic tests is a challenge during an outbreak. Developing innovative diagnostic tools that target the genome to improve the identification of pathogens, can help reduce health costs and time to identify the infection, instead of using unsuitable treatments www.nature.com/scientificreports/ or testing. Moreover, it is necessary to perform an accurate classification to identify the different species of Coronavirus, the genetic variants that could appear in the future, and the co-infections with other pathogens. Given the high transmissibility of the SARS-CoV-2, the proper diagnosis of the disease is urgent, to stop the virus from spreading further. Considering the false negatives given by the standard RT-qPCR detection, better implementations such as using deep learning are necessary in order to properly detect the virus. While the accuracy of current RT-qPCR testing is around 70%, and CT scans with deep learning go up at 83%, we believe that the use of the sequences detected by a CNN-based methodology has the potential to improve the accuracy of the diagnosis.

Our results, show that by targeting one out of the 12 selected 21-bps specific sequences, we are able to distinguish SARS-CoV-2, from any other virus (> 99%). Further testing is necessary to confirm these promising results so it is essential to create multidisciplinary groups that work to stop the outbreak. Finally, as an interesting remark, by comparing the discovered sequences against other hosts, we noticed that from the 12 sequences (Table 1) . This is consistent with the findings of Zhang et al. 36, 37 , and could point to the zootonic origin of the virus. Nevertheless, more data is necessary. As a result of the high density populations, and ever growing interaction between people, it is possible that other pandemics may occur. We believe that our methodology has a substantial added value over traditional methods, because it is a fast method and only limited set of viral sequencing data is needed. Moreover, this procedure led to a primer set with a very high specificity for SARS-CoV-2 with at least the same accuracy as the best primers sets in the world developed by WHO referral laboratories. Thus, thinking forward, our methodology can be applied in future viral pandemics to speed up the development of accurate detection methods for diagnosis and thereby contribute to limit the spread of a virus.

The CNN used during all the experiments is composed of one convolutional layer with 12 different filters or weights (each with window size 21, and an even padding of 10 steps on each side) with maxpooling (pool size 148 and stride 1), a fully connected layer (196 rectified linear units with dropout probability 0.5), and a final softmax layer with 5 units, to differentiate the different classes of Coronavirus strains. The optimizer used is Adaptive Momentum (ADAM) 38 , with learning rate 10 −5 and a batch size of 50 samples, run for 1000 epochs 35 . A graphical summary of the CNN used in the experiments is reported in Fig. 5 .

The convolutional layer of the network, in simple terms, is analyzing subsequences of 21 base pairs that can appear in different points of the virus genome. We selected 21 as designed primers for RT-PCR tests have a length of 18-22 bps normally. The pool size of the maxpooling represents the interval in which a specific 21-bps Table 1 . Percentage of appearance for each of the 12 discovered 21-bps sequences across the different datasets, and comparison to similar viruses in nature and other hosts. 1 21-bps sequence resulting from experiment 2. 2 21-bps sequence resulting from experiment 3. www.nature.com/scientificreports/ sequence can be recognized (in this case, 148 positions). Through the training process, the convolutional layer is de-facto learning new features to characterize the problem, directly from the data. In this specific case, the new features are 21-bps sequences that can more easily separate different virus strains. By analyzing the result of each filter in a convolutional layer, and how its output interacts with the corresponding max pooling, it is possible to detect human-readable sequences of base pairs that might provide domain experts with relevant information. It is important to notice that these sequences are not bound to specific locations of the genome; thanks to its structure, the CNN is able to detect them and recognize their importance even if their position is displaced in different samples. We downloaded 583 sequences (*.fasta files) from the NGDC on March 15th, 2020 (Table 3) . We divided the samples into 5 classes, taking into account both the number of available samples, and the seasonality of the related diseases. SARS-CoV-2 on its own as class 0, as the main objective is to separate this virus from all the others. The 240 samples from MERS-CoV/HCoV-EMC were assigned to class 1, as the related disease is mostly geographically limited to areas of Saudi Arabia 39 . HCoV-OC43, HCoV-229E and HCoV-NL63 have been reported to have winter seasonality, while HCoV-HKU1 has spring-summer seasonality. Nevertheless, there are instances where HCoV-NL63 had a spring-summer seasonality closer to HCoV-HKU1 [40] [41] [42] . Thus, also considering the number of samples, we grouped together HCoV-OC43, and HCoV-229E as class 2, while HCoV-HKU1 and HCoV-NL63 were grouped as class 3. We included HCoV-4408 in class 2 as well, as there were only 2 samples available: HCoV-4408 is subgroup A of the betacoronavirus genus, as HCoV-OC43 43 . Finally, as we deemed important to distinguish SARS-CoV-2 from SARS-CoV, SARS-CoV was assigned to its own class (class 4) even if only 10 samples were available.

We left out 30 SARS-CoV-2 sequences and then performed a 10-fold stratified cross-validation, with the remaining data divided into 80% training (8 folds), 10% validation (1 fold), 10% testing (1 fold). The stratified cross-validation preserves the ratio of classes in the original dataset for each fold, thus making the final accuracy more reliable.

Identifying SARS-CoV-2. Experiment 1: Validation on the NGDC dataset. We downloaded the dataset from the NGDC repository 11 on March 15th, 2020. We removed repeated sequences and applied the procedure to translate the data into the sequence feature space. This leaves us with a frequency table of 3827 features (21bps sequences) with 583 samples (Table 3 (left) ). Next, we ran a state-of-the-art feature selection algorithm 45, 46 , to reduce the sequences needed to identify different virus strain to the bare minimum. Remarkably, we are then able to correctly differentiate all the coronavirus (MERS-CoV, SARS-CoV-2, SARS-CoV-1, etc) samples using only 53 of the original 3827 sequences, obtaining a 100% accuracy in a 10-fold cross-validation with a simpler and more traditional classifier, such as Logistic Regression. The list of the 53 features is available in the Supplementary Materials, Section 2.

Experiment 2: Validation on the NCBI dataset. We downloaded data from the NCBI 30 repository on March 15th, 2020, with the following query: gene = "ORF1ab" AND host = "homo sapiens" AND "complete genome". The query resulted in 407 non-repeated sequences (Table 3 (right)). We call this dataset NCBI-A, where 68 sequences belong to SARS-CoV-2. Then, we applied the procedure to translate the data into the set of sequence features, and we run the same state-of-the-art feature selection algorithm 45 . The result is a list of 10 different sequences (Table 1) , for which just checking for their presence is enough to differentiate between SARS-CoV-2 and other viruses in the dataset, with a 100% accuracy. Each of the sequences, in fact, only appears in SARS-CoV-2 samples. Table 3 . Organism, assigned label, and number of samples in the unique sequences for the NGDC repository (left) and query: gene = "ORF1ab" AND host = "homo sapiens" AND "complete genome" in the NCBI repository (right). We use the NCBI organism naming convention 44 . , for a total of more than 584 other viruses (not considering strains and isolates). Then, we applied the procedure to translate the data into the sequence feature space and run the feature reduction algorithm 45 . This results in 2 extra sequences of 21 bps: just by checking for their presence, we are able to separate SARS-CoV-2 from the rest of the samples with a 100% accuracy (Table 1) . We design a specific primer set for detection of SARS-CoV-2 using Primer3plus 47 . We use TAG CAC TCT CCA AGG GTG TTC as forward primer and GCA AAG CCA AAG CCT CAT TA as reverse primer, obtaining an amplicon size of 179 bps. Then, we run an in-silico PCR test using FastPCR 6.7 52 with default parameters in NC045512.2 used as a reference SARS-CoV-2 sequence, this yields Tm = 56.2 • C for the forward primer, Tm = 53.1 • C for the reverse primer and Ta = 58 • C. In addition, we calculated the frequency of appearance of different primers sets' sequences used in SARS-CoV-2 RT-qPCR tests developed by WHO referral laboratories and compared it to our primer design sequences in 52,645 sequences from the GISAID repository and the 583 samples of different coronaviruses from the NGDC dataset from experiment 1. Table 2) . We selected this primers as they are the ones more commonly used as stated in the GISAID status update of August 11, 2020. We do not consider degenerate primer sets. Experiment 6: Validation of the candidate primer set in biological samples.. Viral RNA was isolated from cellcultured SARS-CoV-2, SARS-1, MERS-CoV, hCoV-NL63, hCoV-OC43, hCoV-229E, and from nasopharyngeal swabs from n = 10 patients by MagNA Pure LC (Roche Diagnostics, The Netherlands) using the total nucleic acid isolation kit. The RNA was converted into cDNA using SuperscriptIII (Thermo-Fisher Scientific, USA) and random hexamers. Subsequently, conventional PCR was performed on the cDNA using HotStar Taq DNA polymerase (Qiagen, The Netherlands) with 400 nM forward primer (5′-AG CAC TCT CCA AGG GTG TTC -3′) and 400 nM reverse primer (5′-GCA AAG CCA AAG CCT CAT TA-3′) and the following cycling conditions: 15 min at 95 • C, followed by 40 cycles of 1 min at 95 • C, 1 min at 5 • C and 1 min at 72 • C. The PCR products were visualized by electrophoresis. The same RNA was used in a diagnostics reference assay by Corman et al. 5 

Cycle threshold values form this reference assay were used for estimating sensitivity.

All the necessary scripts to reproduce the experiments are stored on the public GitHub repository: https :// githu b.com/stepp enwol f0/prime rs-sars-cov-2. Due to storage limits, while the data to reproduce Experiment 1 is included in the repository, for www.nature.com/scientificreports/

Coronavirus genomics and bioinformatics analysis.. Viruses

Genomic characterisation and epidemiology of 2019 novel coronavirus: Implications for virus origins and receptor binding

COVID-19) (World Health Organization

Combination of RT-qPCR testing and clinical features for diagnosis of COVID-19 facilitates management of SARS-CoV-2 outbreak

Detection of 2019 novel coronavirus (2019-ncov) by real-time RT-PCR

Evaluating the accuracy of different respiratory specimens in the laboratory diagnosis and monitoring the viral shedding of 2019-ncov infections

Antibody responses to SARS-CoV-2SARS-CoV-2 in patients of novel coronavirus disease 2019

False-negative results of initial RT-PCR assays for COVID-19: A systematic review

False negative tests for SARS-CoV-2 infection-challenges and implications

Next generation sequencing of viral rna genomes

Correlation of chest CT and RT-PCR testing in coronavirus disease 2019 (COVID-19) in China: A report of 1014 cases

Co-infections in people with COVID-19: A systematic review and meta-analysis

Clinical diagnosis of 8274 samples with 2019-novel coronavirus in Wuhan

A deep learning algorithm using CT images to screen for corona virus disease (COVID-19)

The first case of 2019 novel coronavirus pneumonia imported into Korea from Wuhan, China: Implication for infection prevention and control measures

Rapid and sensitive sequence comparison with fastp and fasta

Basic local alignment search tool

Applications of alignment-free methods in epigenomics

Alignment-free sequence comparison-a review

Phylogenetically diverse tt virus viremia among pregnant women

Dna sequence classification by convolutional neural network

A deep learning approach to dna sequence classification

Deep learning on raw dna sequences for identifying viral genomes in human samples

Identifying viruses from metagenomic data by deep learning

Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study

Defense Advanced Research Projects Agency (DARPA)

Explainable AI: Interpreting, Explaining and Visualizing Deep Learning

What do we need to build explainable AI systems for the medical domain?

dbsnp: The NCBI database of genetic variation

Global initiative on sharing all influenza data-from vision to reality

How ownership rights over microorganisms affect infectious disease control and innovation: A root-cause analysis of barriers to data sharing as experienced by key stakeholders

Managing severe acute respiratory syndrome (SARS) intellectual property rights: The possible role of patent pooling

Threats to timely sharing of pathogen sequence data

Accurate identification of SARS-CoV-2 from viral genome sequences using deep learning

A genomic perspective on the origin and emergence of SARS-CoV-2

Extreme genomic cpg deficiency in SARS-CoV-2 and evasion of host antiviral defense

A method for stochastic optimization

Middle east respiratory syndrome coronavirus

Human coronavirus infections in Israel: Epidemiology, clinical symptoms and summer seasonality of HCoV-HKU1

Human coronavirus circulation in the United States 2014-2017

Seasonality of coronavirus 229e, hku1, nl63 and oc43 from 2014-2020

Fatal interstitial pneumonia associated with bovine coronavirus in cows from southern Italy

The nucleotide sequence database

Automatic discovery of 100-MIRNA signature for cancer classification using ensemble feature selection

Machine learning-based ensemble recursive feature selection of circulating mirnas for cancer tumor classification

Primer3plus, an enhanced web interface to primer3

SARS-coronavirus open reading frame-8b triggers intracellular stress pathways and activates nlrp3 inflammasomes

The ORF3a protein of SARS-CoV-2 induces apoptosis in cells

Augmentation of chemokine production by severe acute respiratory syndrome coronavirus 3a/x1 and 7a/x4 proteins through nf-κ b activation

Severe acute respiratory syndrome coronavirus ORF3A protein interacts with caveolin

Fastpcr software for pcr primer and probe design and repeat search

The daily board of the Medical Ethics Committee Erasmus MC (METC-2015-306) of Rotterdam, The Netherlands, has reviewed the above mentioned research. As a result of this review, the Committee informs you that the rules laid down in the Medical Research Involving Human Subjects Act (informed consent), do not apply to this research proposal . The samples are from anonymized leftover material from patients for the purpose of improving diagnostics. All experiments were performed in accordance with relevant guidelines and regulations.

L.M.M., C.A.P. made the biological analysis, and primer design. A.L.R. and A.T. made the programming, data collection and experiments in silico. D.M. and R.M. made the PCR validation. E.C., A.D.K. and J.G. made the experiment and study design. All the authors contributed to the writing.

The authors declare no competing interests.

The online version contains supplementary material available at https ://doi. org/10.1038/s4159 8-020-80363 -5.Correspondence and requests for materials should be addressed to A.L.-R.Reprints and permissions information is available at www.nature.com/reprints.Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.