key: cord-0608269-5mnmh84s authors: Radanliev, Petar; Roure, David De; Walton, Rob; Kleek, Max Van; Santos, Omar; Montalvo, Rafael Mantilla; Maddox, La Treall title: What country, university or research institute, performed the best on COVID-19? Bibliometric analysis of scientific literature date: 2020-05-19 journal: nan DOI: nan sha: d02a4f8346b8d40725fa25fb68fe4c44d85610c2 doc_id: 608269 cord_uid: 5mnmh84s In this article, we conduct data mining to discover the countries, universities and companies, produced or collaborated the most research on Covid-19 since the pandemic started. We present some interesting findings, but despite analysing all available records on COVID-19 from the Web of Science Core Collection, we failed to reach any significant conclusions on how the world responded to the COVID-19 pandemic. Therefore, we increased our analysis to include all available data records on pandemics and epidemics from 1900 to 2020. We discover some interesting results on countries, universities and companies, that produced collaborated most the most in research on pandemic and epidemics. Then we compared the results with the analysing on COVID-19 data records. This has created some interesting findings that are explained and graphically visualised in the article. What country, university or research institute, performed the best on COVID-19? Bibliometric analysis of scientific literature 1 Since the COVID-19 pandemic started, we have seen an increasing number of scientific research articles, on a wide variety of related topics. Some of these topics are more closely related than we realise. For example, the research on tracking and monitoring of cases, is closely related to digital solutions, e.g. mobile apps. In this study, we use computable statistical methods, to investigate the correlations between, different scientific research records on the COVID-19 pandemic. Apart from investigating the connections between different topics, we also investigate the data records for patterns in the response from different countries. Our objective is that by investigating individual responses, we can provide scientific insights into some of the emerging concerns in the media, e.g. the World Health Organisation (WHO) speed of response 1 . There are topics that we consider beyond the scope of this study, such as the concerns on the origin of the disease 2 . Nevertheless, our aim is to provide statistical analysis, that can assist other researchers in answering these difficult questions. With the global focus on the pandemic, the data records are changing dramatically, at a fast rate. Since research data records are often categorised by year and not by months, it could be challenging at a future date to find scientific data to model, with precision, the research response at different stages of the response. We considered this study to be of significant relevance in this stage of the pandemic, because it provides statistical results that can be seen as a snapshot in time. Our rationale was based on the fact that the pandemic has been in existence for a few of months, and the scientific per-review process last few months. Hence, we assume that at present, we are reviewing research papers that have been produced at the beginning of the pandemic, and just published online now. We applied semi-automatic and automatic analysis of big data, to extract interesting, unusual and unknown patterns, from data records on COVID-19, published until 16 th of May 2020. We analysed 3094 data records, which constituted all data records in existence at present -from the Web of Science Core Collection on COVID-19. To compliment this, we conducted a second analysis of 138,624 historical data records from the Web of Science Core Collection, on pandemics and epidemics, covering the time period from 1900 to 2020. We used the historical data records, to compare with the current scientific research on COVID-19, and we use quantitative analysis to derive unexpected conclusions on the speed of response, from the most prominent organisations in pandemic research. In the investigation, we applied cluster analysis, anomaly detection analysis, association rule mining, and sequential pattern mining, among other methods. The findings of this study are presented in groupings of data records, and categorisations of patterns from the input data, which can be used or reproduced in future studies for predictive analytics, e.g. forecasting of insights on monitoring and management of future pandemics. Our objectives are to use computable statistical methods, to conduct bibliometric data mining on scientific research records and to answer some emerging questions on COVID-19. In the study, we investigate: After identifying the answers to these research questions, we focus on a new set of research questions: 4. What country produced the most research papers on pandemic and epidemics from 1900 to 2020? 5. What universities and companies have published most research on pandemic and epidemics from 1900 to 2020? 6. Which countries/universities collaborated most in research papers on pandemic and epidemics from 1900 to 2020? We use a variety of statistical methods (e.g three-fields plot, factorial analyse, historical analysis, network map analysis, etc.) to compare the findings from these questions. In this study, we differentiate our data mining approach from data analysis. We consider data analysis to be related to testing the effectiveness of specific models or hypothesis. We differentiate this from the data mining in our study, which we consider as using computerised statistical models to uncover interesting, unusual and unknown patterns from big data. Therefore, any reference to analysis in this study, e.g. historical analysis, bibliometric analysis, etc., refers to data mining and not data analysis. We conducted a brief literature review, to identify the current gaps in knowledge and to structure our research questions around these gaps. We found a related study on scientometric analysis of COVID-19 and coronaviruses [1] . Hence, we structured our questions on bibliometric analysis of COVID-19, compared with historic data on pandemics and epidemics. The main differences in this article in the approach. Scientometric analysis is focused on the performance of different authors, or journals. The bibliometric analysis in this article is focused on analysing national responses, institutional outputs, and the correlations between research findings. Similar research is present from March 2020 [2] , and presents analysis of 564 data records. Since then, the number of scientific research data records has increased to 3094. In addition, we use different statistical methods in our data mining and visualisation, which enables us to compare the COVID-19 analysis, with 138,624 historical data records on pandemics and epidemics. This is significantly different that a bibliometric analysis of 564 data records. The third study we reviewed to structure our research questions, was a study from March 2020, based on 183 data records from PubMed and analysed Identified and analysed the title, author, language, time, type and focus [3] . To differentiate our focus on looking at the same problem, from a different aspect, we used Web of Science data records, and we focused on clustering, classification, association, regression, summarisation and anomaly detection. In this study, we applied computable methods for statistical analysis, including R Studio, 'Biliometrix' package [4] , and VOSviewer [5] . To extract big data from established scientific databases, we used the Web of Science Core Collection, which contains data records from over 21,100 peer-reviewed, high-quality scholarly journals published worldwide, in over 250 disciplines 3 . Data mining represents a process of discovering new knowledge from patterns in big data. Usual methods applied include a combination of machine learning and statistics, on analysing big database systems. Data mining is considered a research field that combines computer science and statistics, in designing intelligent methods for extracting new information and for knowledge discovery from existing databases. The data mining in this study involved data management, data pre-processing, model inference and complexity considerations, postprocessing of discovered results considerations, visualisation, and interestingness metrics. Bibliometric analysis, or bibliometrics, is the practice of using statistical methods to analyse research publications, books, articles, and other scientific communications. In this bibliometric analysis, we extracted data records from the Web of Science Core Collection, and we analysed the records with three different data mining tools, (1) the Web of Science analyse results built-in tool; (2) data mining with VOSviewer; (3) data mining with the R Studio 'Bibliometrix' package. The first search for data records was on the Web of Science Core Collection. We searched for all data records on COVID-19 and we extracted 3094 data records (search performed on the date: 17 th of May 2020). We also conducted a second search for TOPIC: (pandemic) OR TOPIC: (epidemic), which resulted with 138,624 data records. Both data sets were analysed with the Web of Science analyse results built-in tool. Only the smaller data set was analysed with computerised statistical analysis, using the R Studio 'Bibliometrix' package. This was not by choice, but because of practicality. The Web of Science has data extraction limit of 500 records, to download the 3094 data records, we extracted 7 different files, and we merged the files using the 'Sublime Text' program. To repeat this process on 138,624 data records, we would need to extract 277 separate data files, and merge into one. This was considered tiresome and not practical. Instead, for the VOSviewer data mining, we used the Web of Science tool to identify the top 1000 most relevant data records and we used this data set as representative sample of the 138,624 data records. Only the 1000 most relevant data records are used for the VOSviewer data mining on pandemic and epidemics. To analyse all data records available on the Web of Science Core Collection, we used the built-in result analysis tool. First, we categorised the data records in researcher areas Figure 1 . From Figure 1 , we can see that current research is focused on the medical aspect of COVID-19. There is very little scientific research on the digital aspect of monitoring and managing the pandemic. Other relevant research areas are also missing, such as guidance on privacy preserving mobile app design for pandemic management, the role of internet of things in pandemic management, philosophical perspective on long term societal changes caused by the pandemic. In the first wave of the pandemic, the focus seems to have been predominantly on the medical aspects. Learning from this result, we can conclude that all other research areas become secondary in the immediate threat of pandemics -death. Therefore, scientific research on these topics should be ongoing and constantly advancing, in anticipation of similar pandemics happening in the future, without notice. To analyse if such preparations were happening in the past, we analyse the data records on COVID-19, and we compare the results with a historical analysis of data records on pandemic management. In Figure 2 , we categorise the data records by country. From Figure 2 , we can see that most scientific research is happening in the US and China, followed by the UK and Italy. Those are some of the worst affected countries at this present time. Although Spain and Iran are also in the hardest hit countries, the scientific research from these two countries is not showing as strong. Therefore, it is indicative, but not conclusive that countries that are most affected, are also most productive in terms of scientific research. To advance this analysis, we categorised the data records by organisations (enhanced) in Figure 3 . What becomes visible from the categorisation in Figure 3 , is that among the most reputable universities, which usually predominate such categorisations, we now have Wuhan University, where the pandemic originated (was first detected). The organisations (enhanced) in Figure 3 categorises the 3094 data records, to include research from associated organisations. We compare the 3094 data records on COVID-19, with the second data file on pandemics and epidemics records from 1900-2020, containing 138,624 data records in Figure 4 . What becomes clear when we compare the two classifications from Figure 3 and Figure 4 , is that some of the best performing universities on COVID-19, are not even present on the list of best performing research organisations on global pandemics and epidemics from the historical analysis. This indicates that there is either a global shift in scientific research, or the early affected regions e.g. Wuhan, have been most productive in scientific research on COVID-19. The second seems more likely. Since the Figure 3 and Figure 4 are classifying organisation-enhanced categories, to get a different perspective on organisations own research production, we did a second categorisation of the 3094 data records, by organisations own research Figure 5 . By categorising the organisations own research, we present a different result from the same data records. In the simple categorisation Figure 5 , we can see that Chines universities are currently in the lead, and we can also see that University of Teheran is also working on this topic, and its much higher ranked in terms of productivity from the previous categorisations -top performing organisation enhanced categorisation in Figure 3 . When we compare the Figure 5 -which is visualising the 3094 data record file, with Figure 6 -which is visualising the 138,624 data record file, we can see a further confirmation that the top performing institutions by output on COVID-19 ( Figure 5) , are not representative of the best performing research institutions ( Figure 6 ). This could signify that the world, for unclear reasons, was slow in responding with scientific research on COVID-19. We could speculate that the world didn't take COVID-19 seriously, or that Chinese knew something that the rest of the world didn't, but we have no data to confirm such speculations. What we can confirm with certainty, is that the Chinese research institutes acted much faster than the rest of the world, including the leading research organisation on pandemics and epidemics. What we can see in the categorisation by funding (based on the 3094 data record file) in Figure 7 , is that China is in the lead, but the US has more distributed funding programme, and if we sum up all the funding, we could get a different result. What is surprising however, is the weak performance of EU funding agencies. There are only 6 data records from the EU funds. When we compare the Figure 7 -which is visualising the 3094 data record file, with Figure 8 -which is visualising the 138,624 data record file, we can see that the organisations that have historically provided most of the funding on pandemics and epidemics, are not in the lead. Since COVID-19 is a global pandemic, the classifications in Figure 7 , should be similar with Figure 8 . The results are very different, and it is uncertain why the global response was so much slower than the Chinese response. But when we compare the institutions in Figure 5 , with countries that got most affected in the early stages of COVID-19, we can see the connection between countries affected early, and increased data records. Such assumptions from the categorisations, based on the automatic data mining using the Web of Science analyse results built-in tool, are speculative. We need more specific data mining methods to analyse this data records further. In the following section, we apply semi-automated data mining to look for association rule learning, anomaly detection, and regression to accompany and enhance our clustering and classification. We continued our data mining with using a computerised statistical analysis, using the VOSviewer computer program. From the 138,624 data records on pandemic and epidemics on the Web of Science Core Collection, for the VOSviewer data mining, we used the top 1000 most relevant data records and we considered this data set as representative sample of the 138,624 data records. We exported two separate text files and we used the two files for the data mining with VOSviewer. In Figure 9 , we can see the VOSviewer visualisation by country and collaborations between countries. In VOSviewer, we can select specific relationships of one country, and we can zoom in the image for more detailed data mining. It is however relatively easy to identify the US, England, Australia and China as the leading countries in the top 1000 historical data records on pandemics and epidemics. The Figure 9 and Figure 10 are both based on Lin/Log modularity normalisation. We conducted normalisation by association strength, and by fractionalisation normalisation, but the Lin/Log modularity normalisation presented better visualisation of the research collaborations between countries. For data mining on collaborations between institutions, we used associated strength normalisation, and since we wanted to investigate the collaborations, we set a limit on data records that included collaborations, this reduced our data records from 1000 to 90 analysed in Figure 11 . Although it's difficult to see in the Figure 11 visualisation, the VOSviewer identified 8 clusters, with the US universities predominating the biggest two clusters, and Chinese universities appearing in the third cluster. We continued our data mining in the next section with using a computerised statistical analysis, using the R Studio 'Bibliometrix' package. Since the Web of Science has data extraction limit of 500 records, to download the 3094 data records, we extracted 7 different files, and we merged the files using the 'Sublime Text' program. Then we downloaded the file in the 'Bibliometrix' package, 'Biblioshiny' function. Our data mining was based on association rule and clustering, using thee-fields plots, factorial analysis, collaboration network, conceptual map design, etc. The first graph we present (Figure 12 ) is based on association rule, investigating the relationship between variables, e.g. from all records on COVID-19, we used association rule to determine which other keywords are related in research, like SARS, infection, virus, etc. Apart from association rule, to design the three-fields plot in Figure 12 , we also applied clustering to discover and associate data records by countries of origin. The three-fields plot in Figure 12 , is similar to the research by country using the Web of Science result analysis tool, in Figure 2 . The difference in the visualisation is that in Figure 12 we can see the keywords associations between data records from individual country. While in Figure 2 , we can only see classifications of data records by country. To find a regression function, that estimates the relationship between data records, with the smallest amount of error, we developed a collaboration network map (Figure 13 ), using country in the network parameters, with equivalence normalisation, in a circle network layout, using Louvain clustering algorithm and the minimum number of edges set at 2. Since density is the proportion of present edges from all possible edges in the collaboration network, in Figure 14 , we can see the strongest collaboration networks, in edge connections and colour coding. Just to clarify these connections in the collaboration network map, edge density equals number of edges divided by maximal number of edges. Hence, an edge density in Figure 14 is defined of overlapping and weighted in graph communities. However, it is possible that edge variations in multiple keywords mainly reflect the variations in few underlying keywords. Hence, in Figure 15 , we applied factorial analysis as a statistical method to identify joint keywords in response to unnoticed (concealed) keywords. The parameters we applied in the factorial analysis (Figure 15 ), included 'multicorrespondence analysis', with field of analysis being the keywords of the records, with automatics clustering and a maximum number of terms 50. In Figure 15 we describe variability among the correlated keywords with potentially lower number of unobserved keywords (factors), aiming to identify independent latent keywords. In other words, we wanted to reduce the number of keywords in the data records. Our objective was to find the latent factors that create a commonality in the data records, and we applied factorial analysis because it is a statistical method that can identify smaller number of underlying variables, within large numbers of observed variables 4 . The factorial analysis derives two classifications of keywords (in Figure 15 ). The classification in blue, represents keywords like management, care, exposure, response, therapy, health, impact, risk, etc. The classification in red, represents more specific keywords, like respiratory syndrome, functional receptor, acute respiratory syndrome, etc. What we can see in the Figure 15 conceptual structure map, is factorial analysis of 3094 records, presenting classification of common keywords from all data records, in two classifications. The most interesting findings from this study was that institutions that are established as leaders in scientific research on pandemic and epidemics, have responded much slower than organisations that are located in the areas where COVID-19 first occurred. Wuhan University is currently very high in the classification of scientific research on COVID-19. In the alternative classification, using historical records on most prominent organisations in this field, the Wuhan and Teheran Universities are not present in the statistical classification. This triggers many questions on have the leading organisations on pandemic and epidemic management reacted as the appropriate speed? Is so, why are they behind in the production of scientific journal? Has there been a gap in communications and data sharing? Leaving these organisations oblivious to what was happening? Or was it that our global alert mechanisms failed to act? Did we ignore the warning signs? These are just some of the emerging questions from this study. Until these questions are answered, conspiracy theories would continue to spread. With the COVID-19 pandemic slowing down, we should also be seeking answers to these questions to prevent a second wave, and most importantly, to prevent the same mistakes happening in future pandemics. Our research findings can serve as the background work and starting point for future studies on developing spatial indices, such as large spatio-temporal datasets, and multidimensional objects, for computer decision support systems based on artificial intelligence. As the scientific research on COVID-19 continues to expand, the publications are becoming more fragmented, which makes the accumulation of knowledge challenging to navigate. In this article, we present the results from three different data mining methods for bibliometric science mapping, which can be replicated by other scientist seeking intellectual structure of the current research records. We found individual tools being restrictive, and we propose a multi-tool approach that enables faster results from statistical and graphical packages, aligned to bibliographical databases. With the use of these statistical methods, we presented interesting analysis and visualisations of the research connections between areas and countries, on the emerging patterns from national responses, and we provide scientific insights on some of the emerging concerns about the speed of response. Our aim was to provide statistical snapshot in time, and to assist other researchers, in answering difficult questions, in the future. There are obvious limitations in interestingness metrics, such as lack of insights into negative relationships, lack of statistical base on COVID-19. In addition, since we can only present results that emerge from the data, this study lacks an objective criterion for assessment. By lack of objective criterion, we refer to the lack of clearly defined research objectives, in specific terms that can be used to confirm if the terms of the objective criterion definitions are met. Instead, the visualisations in this article are representative of the statistical data records, on our search parameters, available on the 17 th of May 2020. In the spirit of reproducible research, we include our data records in this submission. The scientific literature on Coronaviruses, COVID-19 and its associated safety-related research dimensions: A scientometric analysis and scoping review A Bibliometric Analysis of COVID-19 Research Activity: A Call for Increased Output Coronavirus disease 2019: A bibliometric analysis and review bibliometrix: An R-tool for comprehensive science mapping analysis Software survey: VOSviewer, a computer program for bibliometric mapping Although very detailed, the collaboration network in Figure 13 seems a bit cluttered. To present a better visualisation of the data records, in Figure 14 , we kept the same parameters, but we reduced the minimum number of edges at 7.